Skip to content

Latest commit

 

History

History
449 lines (328 loc) · 12.9 KB

File metadata and controls

449 lines (328 loc) · 12.9 KB

CJA SDR Generator - Batch Processing Guide

Overview

The CJA SDR Generator now supports high-performance batch processing with 3-4x throughput improvement through parallel multiprocessing.

Quick Start

Single Data View

# Process a single data view
cja_auto_sdr dv_677ea9291244fd082f02dd42

Multiple Data Views (Automatic Batch Mode)

# Automatically triggers parallel batch processing
cja_auto_sdr dv_12345 dv_67890 dv_abcde

Note: When you provide multiple data view IDs, the script automatically enables parallel processing with auto-detected workers (based on CPU cores and workload). The --batch flag is optional.

Batch Processing with Custom Configuration

# Explicitly use batch mode with custom settings
cja_auto_sdr --batch dv_12345 dv_67890 dv_abcde dv_11111 --workers 8

Command-Line Arguments

Required Arguments

  • DATA_VIEW_ID [DATA_VIEW_ID ...] - One or more data view IDs (must start with dv_)

Optional Arguments

Argument Description Default
--profile NAME / -p Use named profile from ~/.cja/orgs/<NAME>/ None
--batch Explicitly enable batch mode (optional with multiple data views) Auto-detect (parallel if multiple data views)
--workers N Number of parallel workers (1-256), or auto for intelligent detection auto
--log-format FORMAT Log output format: text or json (for Splunk/ELK/CloudWatch) text
--output-dir PATH Output directory for generated files Current directory
--config-file PATH Path to CJA configuration file (ignored if --profile used) config.json
--continue-on-error Continue processing if one data view fails Stop on first error
--log-level LEVEL Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) INFO
--enable-cache Enable validation result caching Disabled
--clear-cache Clear cache before processing (use with --enable-cache) -
--cache-size N Maximum cached entries (>= 1) 1000
--cache-ttl N Cache time-to-live in seconds (>= 1) 3600
--shared-cache Share validation cache across batch workers Disabled
--api-auto-tune Enable automatic API worker tuning Disabled
--api-min-workers N Minimum workers for auto-tuning 1
--api-max-workers N Maximum workers for auto-tuning 10
--circuit-breaker Enable circuit breaker pattern Disabled
--circuit-failure-threshold N Failures before opening circuit 5
--circuit-timeout N Recovery timeout in seconds 30
--include-segments Include segments inventory in output Disabled
--include-derived Include derived fields inventory in output Disabled
--include-calculated Include calculated metrics inventory in output Disabled
--inventory-only Output only inventory sheets (requires --include-*) Disabled
-h, --help Show help message and exit -

Usage Examples

Basic Examples

# Single data view
cja_auto_sdr dv_12345

# Multiple data views (automatically triggers parallel batch processing)
cja_auto_sdr dv_12345 dv_67890 dv_abcde

# Explicitly use batch mode (same result as above when multiple data views)
cja_auto_sdr --batch dv_12345 dv_67890 dv_abcde

Advanced Examples

# Use a profile for credentials (recommended for multi-org)
cja_auto_sdr --profile client-a dv_12345 dv_67890 dv_abcde

# Custom number of workers (conservative for shared API)
cja_auto_sdr --batch dv_12345 dv_67890 --workers 2

# Custom output directory
cja_auto_sdr dv_12345 --output-dir ./reports

# Continue processing even if some data views fail
cja_auto_sdr --batch dv_12345 dv_67890 dv_abcde --continue-on-error

# Batch processing with custom log level
cja_auto_sdr --batch dv_* --log-level WARNING

# Full production example
cja_auto_sdr --batch \
  dv_12345 dv_67890 dv_abcde \
  --workers 4 \
  --output-dir ./sdr_reports \
  --continue-on-error \
  --log-level INFO

Reading Data Views from a File

# Create a file with data view IDs (one per line)
cat > dataviews.txt <<EOF
dv_12345
dv_67890
dv_abcde
dv_99999
EOF

# Process all data views from file
cja_auto_sdr --batch $(cat dataviews.txt)

# With continue-on-error
cja_auto_sdr --batch \
  $(cat dataviews.txt) \
  --continue-on-error \
  --output-dir ./batch_reports

Error Handling

No Arguments Provided

$ cja_auto_sdr

usage: cja_auto_sdr [-h] [--batch] ... DATA_VIEW_ID [DATA_VIEW_ID ...]
cja_auto_sdr: error: the following arguments are required: DATA_VIEW_ID

Invalid Data View ID Format

$ cja_auto_sdr invalid_id test123

ERROR: Invalid data view ID format: invalid_id, test123
       Data view IDs should start with 'dv_'
       Example: dv_677ea9291244fd082f02dd42

Help Output

$ cja_auto_sdr --help

# Displays full help with all options and examples

Performance Comparison

Single Data View Processing

1 data view × 35s = 35 seconds per data view

Multiple Data Views (Automatic Parallel Batch Processing with 4 Workers)

10 data views / 4 workers × 35s = ~87.5 seconds (1.5 minutes)
Improvement: 4x faster than processing individually (75% time savings)

Note: Multiple data views automatically trigger parallel batch processing for optimal performance.

Worker Optimization

Workers Best For Performance
1 Testing, debugging Baseline (100%)
2 Shared API, conservative ~2x faster
4 Default, balanced ~4x faster
8 Dedicated infrastructure ~8x faster

Note: Actual performance depends on API rate limits, network latency, and system resources.

Batch Processing Output

Console Output

Processing 10 data view(s) in batch mode with 4 workers...

2026-01-07 12:00:00 - INFO - ============================================================
2026-01-07 12:00:00 - INFO - BATCH PROCESSING START
2026-01-07 12:00:00 - INFO - ============================================================
2026-01-07 12:00:00 - INFO - Data views to process: 10
2026-01-07 12:00:00 - INFO - Parallel workers: 4
2026-01-07 12:00:00 - INFO - Continue on error: False
2026-01-07 12:00:00 - INFO - Output directory: .
2026-01-07 12:00:00 - INFO - ============================================================

2026-01-07 12:00:15 - INFO - ✓ dv_12345: SUCCESS (14.5s)
2026-01-07 12:00:16 - INFO - ✓ dv_67890: SUCCESS (15.2s)
2026-01-07 12:00:18 - ERROR - ✗ dv_abc123: FAILED - Data view validation failed
2026-01-07 12:00:20 - INFO - ✓ dv_def456: SUCCESS (16.1s)
...

============================================================
BATCH PROCESSING SUMMARY
============================================================
Total data views: 10
Successful: 8
Failed: 2
Success rate: 80.0%
Total duration: 125.3s
Average per data view: 15.7s

Successful Data Views:
  ✓ dv_12345         Production Analytics        14.5s
  ✓ dv_67890         Development Analytics       15.2s
  ✓ dv_def456        Testing Analytics           16.1s
  ...

Failed Data Views:
  ✗ dv_abc123        Data view validation failed
  ✗ dv_xyz789        No metrics or dimensions found

============================================================
Throughput: 4.8 data views per minute
============================================================

Log Files

Batch Mode:

  • logs/SDR_Batch_Generation_YYYYMMDD_HHMMSS.log - Main batch log

Single Mode:

  • logs/SDR_Generation_{DATA_VIEW_ID}_YYYYMMDD_HHMMSS.log - Per data view log

Scheduled Processing

Linux/macOS (Cron Job)

# Add to crontab (crontab -e)
# Note: In crontab, % has special meaning (newline), so it must be escaped with \

# Process all data views nightly at 2 AM
0 2 * * * cd /path/to/project && cja_auto_sdr \
  --batch dv_12345 dv_67890 dv_abcde \
  --output-dir /reports/$(date +\%Y\%m\%d) \
  --continue-on-error \
  --log-level WARNING

# Process weekly on Sunday at midnight
0 0 * * 0 cd /path/to/project && cja_auto_sdr \
  --batch $(cat /path/to/dataviews.txt) \
  --workers 8 \
  --output-dir /weekly_reports/$(date +\%Y_week_\%V) \
  --continue-on-error

Windows (Task Scheduler)

# Create a scheduled task to run nightly at 2 AM
$action = New-ScheduledTaskAction -Execute "C:\path\to\project\.venv\Scripts\cja_auto_sdr.exe" `
  -Argument "--batch dv_12345 dv_67890 dv_abcde --output-dir C:\reports --continue-on-error --log-level WARNING" `
  -WorkingDirectory "C:\path\to\project"
$trigger = New-ScheduledTaskTrigger -Daily -At 2am
Register-ScheduledTask -Action $action -Trigger $trigger -TaskName "CJA SDR Nightly" -Description "Generate CJA SDR reports"

# Or create a weekly task for Sunday at midnight
$weeklyAction = New-ScheduledTaskAction -Execute "C:\path\to\project\.venv\Scripts\cja_auto_sdr.exe" `
  -Argument "--batch dv_12345 dv_67890 --workers 8 --output-dir C:\weekly_reports --continue-on-error" `
  -WorkingDirectory "C:\path\to\project"
$weeklyTrigger = New-ScheduledTaskTrigger -Weekly -DaysOfWeek Sunday -At 12am
Register-ScheduledTask -Action $weeklyAction -Trigger $weeklyTrigger -TaskName "CJA SDR Weekly"

Alternatively via Task Scheduler GUI:

  1. Open Task Scheduler (search "Task Scheduler" in Start menu)
  2. Click "Create Basic Task..."
  3. Set schedule (Daily/Weekly)
  4. Action: "Start a program"
  5. Program: C:\path\to\project\.venv\Scripts\cja_auto_sdr.exe
  6. Arguments: --batch dv_12345 --output-dir C:\reports
  7. Start in: C:\path\to\project

Best Practices

1. Worker Configuration

# Conservative (shared API with rate limits)
--workers 2

# Balanced (default, works well for most cases)
--workers 4

# Aggressive (dedicated infrastructure)
--workers 8

2. Error Handling

# Stop on first error (default, good for testing)
cja_auto_sdr --batch dv_1 dv_2 dv_3

# Continue on error (good for production, get as many as possible)
cja_auto_sdr --batch dv_1 dv_2 dv_3 --continue-on-error

3. Output Organization

# Organize by date
--output-dir ./reports/$(date +%Y/%m/%d)

# Organize by environment
--output-dir ./reports/production
--output-dir ./reports/staging

4. Logging Levels

# Development/debugging
--log-level DEBUG

# Production (default)
--log-level INFO

# Production (quiet, only warnings/errors)
--log-level WARNING

Troubleshooting

Issue: "No module named 'cjapy'"

Solution: Use uv run to execute the script:

cja_auto_sdr dv_12345

Issue: "error: the following arguments are required: DATA_VIEW_ID"

Solution: Provide at least one data view ID:

cja_auto_sdr dv_12345

Issue: "Invalid data view ID format"

Solution: Ensure data view IDs start with dv_:

# Wrong
cja_auto_sdr 12345

# Correct
cja_auto_sdr dv_12345

Issue: Permission denied writing Excel file

Solution: Close any open Excel files or specify a different output directory:

cja_auto_sdr dv_12345 --output-dir ./new_reports

Issue: API rate limiting

Solution: Reduce the number of workers:

cja_auto_sdr --batch dv_1 dv_2 dv_3 --workers 2

Migration from Old Version

Before (Hardcoded Data View)

# Old way: Edit script to change data view
data_view = "dv_677ea9291244fd082f02dd42"
cja_auto_sdr

After (Command-Line Arguments)

# New way: Specify data view(s) as arguments
cja_auto_sdr dv_677ea9291244fd082f02dd42

# Or multiple at once
cja_auto_sdr dv_12345 dv_67890

Technical Details

Multiprocessing Architecture

  • ProcessPoolExecutor: True parallelism (separate processes)
  • No GIL limitations: Full CPU utilization
  • Isolated processing: Each data view runs in its own process
  • Fault tolerance: One failure doesn't affect others

Memory Management

  • Each worker process has its own memory space
  • No shared state between workers
  • Automatic cleanup after completion
  • Suitable for processing large datasets

API Efficiency

  • Parallel API calls to CJA endpoints
  • ThreadPoolExecutor for I/O-bound API fetching within each process
  • Optimized to minimize API call overhead
  • Respects API rate limits (adjust workers as needed)

Support

For issues, questions, or feature requests:

  1. Check this guide first
  2. Review error messages and logs
  3. Try with --log-level DEBUG for detailed output
  4. Use --help to see all available options

See Also