CJA SDR Generator - Batch Processing Guide

Overview

The CJA SDR Generator now supports high-performance batch processing with 3-4x throughput improvement through parallel multiprocessing.

Quick Start

Single Data View

# Process a single data view
cja_auto_sdr dv_677ea9291244fd082f02dd42

Multiple Data Views (Automatic Batch Mode)

# Automatically triggers parallel batch processing
cja_auto_sdr dv_12345 dv_67890 dv_abcde

Note: When you provide multiple data view IDs, the script automatically enables parallel processing with auto-detected workers (based on CPU cores and workload). The --batch flag is optional.

Batch Processing with Custom Configuration

# Explicitly use batch mode with custom settings
cja_auto_sdr --batch dv_12345 dv_67890 dv_abcde dv_11111 --workers 8

Command-Line Arguments

Required Arguments

DATA_VIEW_ID [DATA_VIEW_ID ...] - One or more data view IDs (must start with dv_)

Optional Arguments

Argument	Description	Default
`--profile NAME` / `-p`	Use named profile from `~/.cja/orgs/<NAME>/`	None
`--batch`	Explicitly enable batch mode (optional with multiple data views)	Auto-detect (parallel if multiple data views)
`--workers N`	Number of parallel workers (1-256), or `auto` for intelligent detection	auto
`--log-format FORMAT`	Log output format: `text` or `json` (for Splunk/ELK/CloudWatch)	text
`--output-dir PATH`	Output directory for generated files	Current directory
`--config-file PATH`	Path to CJA configuration file (ignored if `--profile` used)	config.json
`--continue-on-error`	Continue processing if one data view fails	Stop on first error
`--log-level LEVEL`	Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)	INFO
`--enable-cache`	Enable validation result caching	Disabled
`--clear-cache`	Clear cache before processing (use with --enable-cache)	-
`--cache-size N`	Maximum cached entries (>= 1)	1000
`--cache-ttl N`	Cache time-to-live in seconds (>= 1)	3600
`--shared-cache`	Share validation cache across batch workers	Disabled
`--api-auto-tune`	Enable automatic API worker tuning	Disabled
`--api-min-workers N`	Minimum workers for auto-tuning	1
`--api-max-workers N`	Maximum workers for auto-tuning	10
`--circuit-breaker`	Enable circuit breaker pattern	Disabled
`--circuit-failure-threshold N`	Failures before opening circuit	5
`--circuit-timeout N`	Recovery timeout in seconds	30
`--include-segments`	Include segments inventory in output	Disabled
`--include-derived`	Include derived fields inventory in output	Disabled
`--include-calculated`	Include calculated metrics inventory in output	Disabled
`--inventory-only`	Output only inventory sheets (requires `--include-*`)	Disabled
`-h, --help`	Show help message and exit	-

Usage Examples

Basic Examples

# Single data view
cja_auto_sdr dv_12345

# Multiple data views (automatically triggers parallel batch processing)
cja_auto_sdr dv_12345 dv_67890 dv_abcde

# Explicitly use batch mode (same result as above when multiple data views)
cja_auto_sdr --batch dv_12345 dv_67890 dv_abcde

Advanced Examples

# Use a profile for credentials (recommended for multi-org)
cja_auto_sdr --profile client-a dv_12345 dv_67890 dv_abcde

# Custom number of workers (conservative for shared API)
cja_auto_sdr --batch dv_12345 dv_67890 --workers 2

# Custom output directory
cja_auto_sdr dv_12345 --output-dir ./reports

# Continue processing even if some data views fail
cja_auto_sdr --batch dv_12345 dv_67890 dv_abcde --continue-on-error

# Batch processing with custom log level
cja_auto_sdr --batch dv_* --log-level WARNING

# Full production example
cja_auto_sdr --batch \
  dv_12345 dv_67890 dv_abcde \
  --workers 4 \
  --output-dir ./sdr_reports \
  --continue-on-error \
  --log-level INFO

Reading Data Views from a File

# Create a file with data view IDs (one per line)
cat > dataviews.txt <<EOF
dv_12345
dv_67890
dv_abcde
dv_99999
EOF

# Process all data views from file
cja_auto_sdr --batch $(cat dataviews.txt)

# With continue-on-error
cja_auto_sdr --batch \
  $(cat dataviews.txt) \
  --continue-on-error \
  --output-dir ./batch_reports

Error Handling

No Arguments Provided

$ cja_auto_sdr

usage: cja_auto_sdr [-h] [--batch] ... DATA_VIEW_ID [DATA_VIEW_ID ...]
cja_auto_sdr: error: the following arguments are required: DATA_VIEW_ID

Invalid Data View ID Format

$ cja_auto_sdr invalid_id test123

ERROR: Invalid data view ID format: invalid_id, test123
       Data view IDs should start with 'dv_'
       Example: dv_677ea9291244fd082f02dd42

Help Output

$ cja_auto_sdr --help

# Displays full help with all options and examples

Performance Comparison

Single Data View Processing

1 data view × 35s = 35 seconds per data view

Multiple Data Views (Automatic Parallel Batch Processing with 4 Workers)

10 data views / 4 workers × 35s = ~87.5 seconds (1.5 minutes)
Improvement: 4x faster than processing individually (75% time savings)

Note: Multiple data views automatically trigger parallel batch processing for optimal performance.

Worker Optimization

Workers	Best For	Performance
1	Testing, debugging	Baseline (100%)
2	Shared API, conservative	~2x faster
4	Default, balanced	~4x faster
8	Dedicated infrastructure	~8x faster

Note: Actual performance depends on API rate limits, network latency, and system resources.

Batch Processing Output

Console Output

Processing 10 data view(s) in batch mode with 4 workers...

2026-01-07 12:00:00 - INFO - ============================================================
2026-01-07 12:00:00 - INFO - BATCH PROCESSING START
2026-01-07 12:00:00 - INFO - ============================================================
2026-01-07 12:00:00 - INFO - Data views to process: 10
2026-01-07 12:00:00 - INFO - Parallel workers: 4
2026-01-07 12:00:00 - INFO - Continue on error: False
2026-01-07 12:00:00 - INFO - Output directory: .
2026-01-07 12:00:00 - INFO - ============================================================

2026-01-07 12:00:15 - INFO - ✓ dv_12345: SUCCESS (14.5s)
2026-01-07 12:00:16 - INFO - ✓ dv_67890: SUCCESS (15.2s)
2026-01-07 12:00:18 - ERROR - ✗ dv_abc123: FAILED - Data view validation failed
2026-01-07 12:00:20 - INFO - ✓ dv_def456: SUCCESS (16.1s)
...

============================================================
BATCH PROCESSING SUMMARY
============================================================
Total data views: 10
Successful: 8
Failed: 2
Success rate: 80.0%
Total duration: 125.3s
Average per data view: 15.7s

Successful Data Views:
  ✓ dv_12345         Production Analytics        14.5s
  ✓ dv_67890         Development Analytics       15.2s
  ✓ dv_def456        Testing Analytics           16.1s
  ...

Failed Data Views:
  ✗ dv_abc123        Data view validation failed
  ✗ dv_xyz789        No metrics or dimensions found

============================================================
Throughput: 4.8 data views per minute
============================================================

Log Files

Batch Mode:

logs/SDR_Batch_Generation_YYYYMMDD_HHMMSS.log - Main batch log

Single Mode:

logs/SDR_Generation_{DATA_VIEW_ID}_YYYYMMDD_HHMMSS.log - Per data view log

Scheduled Processing

Linux/macOS (Cron Job)

# Add to crontab (crontab -e)
# Note: In crontab, % has special meaning (newline), so it must be escaped with \

# Process all data views nightly at 2 AM
0 2 * * * cd /path/to/project && cja_auto_sdr \
  --batch dv_12345 dv_67890 dv_abcde \
  --output-dir /reports/$(date +\%Y\%m\%d) \
  --continue-on-error \
  --log-level WARNING

# Process weekly on Sunday at midnight
0 0 * * 0 cd /path/to/project && cja_auto_sdr \
  --batch $(cat /path/to/dataviews.txt) \
  --workers 8 \
  --output-dir /weekly_reports/$(date +\%Y_week_\%V) \
  --continue-on-error

Windows (Task Scheduler)

# Create a scheduled task to run nightly at 2 AM
$action = New-ScheduledTaskAction -Execute "C:\path\to\project\.venv\Scripts\cja_auto_sdr.exe" `
  -Argument "--batch dv_12345 dv_67890 dv_abcde --output-dir C:\reports --continue-on-error --log-level WARNING" `
  -WorkingDirectory "C:\path\to\project"
$trigger = New-ScheduledTaskTrigger -Daily -At 2am
Register-ScheduledTask -Action $action -Trigger $trigger -TaskName "CJA SDR Nightly" -Description "Generate CJA SDR reports"

# Or create a weekly task for Sunday at midnight
$weeklyAction = New-ScheduledTaskAction -Execute "C:\path\to\project\.venv\Scripts\cja_auto_sdr.exe" `
  -Argument "--batch dv_12345 dv_67890 --workers 8 --output-dir C:\weekly_reports --continue-on-error" `
  -WorkingDirectory "C:\path\to\project"
$weeklyTrigger = New-ScheduledTaskTrigger -Weekly -DaysOfWeek Sunday -At 12am
Register-ScheduledTask -Action $weeklyAction -Trigger $weeklyTrigger -TaskName "CJA SDR Weekly"

Alternatively via Task Scheduler GUI:

Open Task Scheduler (search "Task Scheduler" in Start menu)
Click "Create Basic Task..."
Set schedule (Daily/Weekly)
Action: "Start a program"
Program: C:\path\to\project\.venv\Scripts\cja_auto_sdr.exe
Arguments: --batch dv_12345 --output-dir C:\reports
Start in: C:\path\to\project

Best Practices

1. Worker Configuration

# Conservative (shared API with rate limits)
--workers 2

# Balanced (default, works well for most cases)
--workers 4

# Aggressive (dedicated infrastructure)
--workers 8

2. Error Handling

# Stop on first error (default, good for testing)
cja_auto_sdr --batch dv_1 dv_2 dv_3

# Continue on error (good for production, get as many as possible)
cja_auto_sdr --batch dv_1 dv_2 dv_3 --continue-on-error

3. Output Organization

# Organize by date
--output-dir ./reports/$(date +%Y/%m/%d)

# Organize by environment
--output-dir ./reports/production
--output-dir ./reports/staging

4. Logging Levels

# Development/debugging
--log-level DEBUG

# Production (default)
--log-level INFO

# Production (quiet, only warnings/errors)
--log-level WARNING

Troubleshooting

Issue: "No module named 'cjapy'"

Solution: Use uv run to execute the script:

cja_auto_sdr dv_12345

Issue: "error: the following arguments are required: DATA_VIEW_ID"

Solution: Provide at least one data view ID:

cja_auto_sdr dv_12345

Issue: "Invalid data view ID format"

Solution: Ensure data view IDs start with dv_:

# Wrong
cja_auto_sdr 12345

# Correct
cja_auto_sdr dv_12345

Issue: Permission denied writing Excel file

Solution: Close any open Excel files or specify a different output directory:

cja_auto_sdr dv_12345 --output-dir ./new_reports

Issue: API rate limiting

Solution: Reduce the number of workers:

cja_auto_sdr --batch dv_1 dv_2 dv_3 --workers 2

Migration from Old Version

Before (Hardcoded Data View)

# Old way: Edit script to change data view
data_view = "dv_677ea9291244fd082f02dd42"
cja_auto_sdr

After (Command-Line Arguments)

# New way: Specify data view(s) as arguments
cja_auto_sdr dv_677ea9291244fd082f02dd42

# Or multiple at once
cja_auto_sdr dv_12345 dv_67890

Technical Details

Multiprocessing Architecture

ProcessPoolExecutor: True parallelism (separate processes)
No GIL limitations: Full CPU utilization
Isolated processing: Each data view runs in its own process
Fault tolerance: One failure doesn't affect others

Memory Management

Each worker process has its own memory space
No shared state between workers
Automatic cleanup after completion
Suitable for processing large datasets

API Efficiency

Parallel API calls to CJA endpoints
ThreadPoolExecutor for I/O-bound API fetching within each process
Optimized to minimize API call overhead
Respects API rate limits (adjust workers as needed)

Support

For issues, questions, or feature requests:

Check this guide first
Review error messages and logs
Try with --log-level DEBUG for detailed output
Use --help to see all available options

FilesExpand file tree

BATCH_PROCESSING_GUIDE.md

Latest commit

History