Apache Airflow orchestration for the IGH Data Pipeline.
This project provides Airflow DAGs to orchestrate the IGH data pipeline:
- Ingestion: Sync data from Microsoft Dataverse to Bronze SQLite database
- Transform: Process data from Bronze to Silver and Gold layers
- Deployment: Deploy validated data to production
- Docker and Docker Compose
- Python 3.11+ (for development)
- UV (for dependency management)
# Clone the repository
git clone <repository-url>
cd igh-airflow
# Copy environment template
cp .env.example .env
# Install dependencies first
uv sync
# Generate security keys
uv run python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
# Add output to AIRFLOW__CORE__FERNET_KEY in .env
uv run python -c "import secrets; print(secrets.token_hex(32))"
# Add output to AIRFLOW__WEBSERVER__SECRET_KEY in .env
# Set your user ID (Linux only)
echo "AIRFLOW_UID=$(id -u)" >> .env
# Start Airflow
docker compose up -d
# Access Airflow UI at http://localhost:8080
# Default credentials: airflow / airflow┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ igh_ingestion │───▶│ igh_transform │───▶│ igh_deployment │
│ (manual only) │ │ (manual only) │ │ (manual only) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
Dataverse Bronze → Silver Gold → Production
→ Bronze Silver → Gold (atomic swap)
| DAG | Tasks | Description |
|---|---|---|
igh_ingestion |
1 | Sync Dataverse to Bronze DB |
igh_transform |
2 | Transform Bronze→Silver→Gold |
igh_deployment |
1 | Deploy to production with atomic swap |
igh-airflow/
├── dags/
│ ├── __init__.py
│ ├── igh_ingestion_dag.py # Dataverse sync
│ ├── igh_transform_dag.py # Bronze→Silver→Gold transforms
│ └── igh_deployment_dag.py # Production deployment
├── config/
│ ├── __init__.py
│ └── settings.py # PipelineConfig dataclass
├── tests/
│ ├── __init__.py
│ ├── conftest.py
│ ├── test_ingestion_dag.py
│ ├── test_transform_dag.py
│ └── test_deployment_dag.py
├── docker/
│ ├── Dockerfile # Production image
│ └── entrypoint.sh
├── docker-compose.yml # Local development
├── .env.example
├── .gitignore
├── .python-version
├── CLAUDE.md
├── pyproject.toml
└── README.md
| Variable | Default | Description |
|---|---|---|
AIRFLOW_UID |
50000 |
User ID for Airflow processes |
BRONZE_DB_PATH |
/opt/airflow/data/bronze/dataverse.db |
Bronze database path |
SILVER_DB_PATH |
/opt/airflow/data/silver/igh_silver.db |
Silver database path |
GOLD_DB_PATH |
/opt/airflow/data/gold/star_schema.db |
Gold star-schema database path |
PRODUCTION_DB_PATH |
/opt/airflow/data/production/igh.db |
Production database path |
Configure these in the Airflow UI (Admin → Connections):
| Connection ID | Type | Fields |
|---|---|---|
dataverse_api |
HTTP | Host: API URL, Login: Client ID, Password: Client Secret |
Configure these in the Airflow UI (Admin → Variables):
| Variable | Default | Description |
|---|---|---|
bronze_db_path |
/data/bronze/dataverse.db |
Bronze database path |
silver_db_path |
/data/silver/igh_silver.db |
Silver database path |
gold_db_path |
/data/gold/star_schema.db |
Gold star-schema database path |
production_db_path |
/data/production/igh.db |
Production database path |
# Install UV if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv sync
# Install with dev dependencies
uv sync --all-groupsThis project depends on two internal Akvo packages installed directly from GitHub (not PyPI):
| Package | Repository | Purpose |
|---|---|---|
igh-data-sync |
akvo/igh-data-sync | Dataverse ingestion |
igh-data-transform |
akvo/igh-data-transform | Bronze→Silver→Gold transforms |
Their source URLs and branches are configured in pyproject.toml under [tool.uv.sources], and uv.lock pins each package to a specific commit hash. Running uv sync alone will not pull newer commits — you must explicitly upgrade the lock entry.
# Update a single package to the latest commit on its configured branch
uv lock --upgrade-package igh-data-sync && uv sync
uv lock --upgrade-package igh-data-transform && uv sync
# Update both at once
uv lock --upgrade-package igh-data-sync --upgrade-package igh-data-transform && uv syncTo switch a package to a different branch, edit pyproject.toml:
[tool.uv.sources]
igh-data-transform = { git = "https://github.com/akvo/igh-data-transform.git", branch = "new-branch" }Then run uv lock --upgrade-package igh-data-transform && uv sync to resolve
the new branch. To point back to the default branch, remove the branch key.
After updating packages, rebuild the Docker image so containers use the new code:
docker compose build --no-cache && docker compose up -d# Run all tests
uv run pytest tests/ -v
# Run with coverage
uv run pytest tests/ --cov=dags --cov=config
# Run specific test file
uv run pytest tests/test_ingestion_dag.py -v# Check code style
uv run ruff check dags/ config/ tests/
# Auto-fix issues
uv run ruff check --fix dags/ config/ tests/
# Format code
uv run ruff format dags/ config/ tests/# Start Airflow
docker compose up -d
# View logs
docker compose logs -f airflow-scheduler
# Access Airflow shell
docker compose exec airflow-webserver bash
# List DAGs
docker compose exec airflow-webserver airflow dags list
# Trigger DAG manually
docker compose exec airflow-webserver airflow dags trigger igh_ingestion
# Stop Airflow
docker compose down
# Stop and remove volumes
docker compose down -vdocker build -f docker/Dockerfile -t igh-airflow:latest .docker run -d \
-e AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://... \
-e AIRFLOW__CORE__FERNET_KEY=... \
-e AIRFLOW__WEBSERVER__SECRET_KEY=... \
-v /data:/opt/airflow/data \
igh-airflow:latest webserverCopyright Akvo Foundation