Skip to content

akvo/igh-airflow

Repository files navigation

IGH Airflow

Apache Airflow orchestration for the IGH Data Pipeline.

Overview

This project provides Airflow DAGs to orchestrate the IGH data pipeline:

  • Ingestion: Sync data from Microsoft Dataverse to Bronze SQLite database
  • Transform: Process data from Bronze to Silver and Gold layers
  • Deployment: Deploy validated data to production

Quick Start

Prerequisites

  • Docker and Docker Compose
  • Python 3.11+ (for development)
  • UV (for dependency management)

Local Development

# Clone the repository
git clone <repository-url>
cd igh-airflow

# Copy environment template
cp .env.example .env

# Install dependencies first
uv sync

# Generate security keys
uv run python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
# Add output to AIRFLOW__CORE__FERNET_KEY in .env

uv run python -c "import secrets; print(secrets.token_hex(32))"
# Add output to AIRFLOW__WEBSERVER__SECRET_KEY in .env

# Set your user ID (Linux only)
echo "AIRFLOW_UID=$(id -u)" >> .env

# Start Airflow
docker compose up -d

# Access Airflow UI at http://localhost:8080
# Default credentials: airflow / airflow

Architecture

Pipeline Flow

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   igh_ingestion │───▶│  igh_transform  │───▶│ igh_deployment  │
│  (manual only)  │    │  (manual only)  │    │  (manual only)  │
└─────────────────┘    └─────────────────┘    └─────────────────┘
        │                      │                      │
        ▼                      ▼                      ▼
   Dataverse            Bronze → Silver        Gold → Production
   → Bronze             Silver → Gold          (atomic swap)

DAG Details

DAG Tasks Description
igh_ingestion 1 Sync Dataverse to Bronze DB
igh_transform 2 Transform Bronze→Silver→Gold
igh_deployment 1 Deploy to production with atomic swap

Project Structure

igh-airflow/
├── dags/
│   ├── __init__.py
│   ├── igh_ingestion_dag.py    # Dataverse sync
│   ├── igh_transform_dag.py    # Bronze→Silver→Gold transforms
│   └── igh_deployment_dag.py   # Production deployment
├── config/
│   ├── __init__.py
│   └── settings.py             # PipelineConfig dataclass
├── tests/
│   ├── __init__.py
│   ├── conftest.py
│   ├── test_ingestion_dag.py
│   ├── test_transform_dag.py
│   └── test_deployment_dag.py
├── docker/
│   ├── Dockerfile              # Production image
│   └── entrypoint.sh
├── docker-compose.yml          # Local development
├── .env.example
├── .gitignore
├── .python-version
├── CLAUDE.md
├── pyproject.toml
└── README.md

Configuration

Environment Variables

Variable Default Description
AIRFLOW_UID 50000 User ID for Airflow processes
BRONZE_DB_PATH /opt/airflow/data/bronze/dataverse.db Bronze database path
SILVER_DB_PATH /opt/airflow/data/silver/igh_silver.db Silver database path
GOLD_DB_PATH /opt/airflow/data/gold/star_schema.db Gold star-schema database path
PRODUCTION_DB_PATH /opt/airflow/data/production/igh.db Production database path

Airflow Connections

Configure these in the Airflow UI (Admin → Connections):

Connection ID Type Fields
dataverse_api HTTP Host: API URL, Login: Client ID, Password: Client Secret

Airflow Variables

Configure these in the Airflow UI (Admin → Variables):

Variable Default Description
bronze_db_path /data/bronze/dataverse.db Bronze database path
silver_db_path /data/silver/igh_silver.db Silver database path
gold_db_path /data/gold/star_schema.db Gold star-schema database path
production_db_path /data/production/igh.db Production database path

Development

Install Dependencies

# Install UV if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies
uv sync

# Install with dev dependencies
uv sync --all-groups

Updating Internal Packages

This project depends on two internal Akvo packages installed directly from GitHub (not PyPI):

Package Repository Purpose
igh-data-sync akvo/igh-data-sync Dataverse ingestion
igh-data-transform akvo/igh-data-transform Bronze→Silver→Gold transforms

Their source URLs and branches are configured in pyproject.toml under [tool.uv.sources], and uv.lock pins each package to a specific commit hash. Running uv sync alone will not pull newer commits — you must explicitly upgrade the lock entry.

# Update a single package to the latest commit on its configured branch
uv lock --upgrade-package igh-data-sync && uv sync
uv lock --upgrade-package igh-data-transform && uv sync

# Update both at once
uv lock --upgrade-package igh-data-sync --upgrade-package igh-data-transform && uv sync

To switch a package to a different branch, edit pyproject.toml:

[tool.uv.sources]
igh-data-transform = { git = "https://github.com/akvo/igh-data-transform.git", branch = "new-branch" }

Then run uv lock --upgrade-package igh-data-transform && uv sync to resolve the new branch. To point back to the default branch, remove the branch key.

After updating packages, rebuild the Docker image so containers use the new code:

docker compose build --no-cache && docker compose up -d

Running Tests

# Run all tests
uv run pytest tests/ -v

# Run with coverage
uv run pytest tests/ --cov=dags --cov=config

# Run specific test file
uv run pytest tests/test_ingestion_dag.py -v

Linting

# Check code style
uv run ruff check dags/ config/ tests/

# Auto-fix issues
uv run ruff check --fix dags/ config/ tests/

# Format code
uv run ruff format dags/ config/ tests/

Docker Commands

# Start Airflow
docker compose up -d

# View logs
docker compose logs -f airflow-scheduler

# Access Airflow shell
docker compose exec airflow-webserver bash

# List DAGs
docker compose exec airflow-webserver airflow dags list

# Trigger DAG manually
docker compose exec airflow-webserver airflow dags trigger igh_ingestion

# Stop Airflow
docker compose down

# Stop and remove volumes
docker compose down -v

Production Deployment

Build Production Image

docker build -f docker/Dockerfile -t igh-airflow:latest .

Run Production Container

docker run -d \
  -e AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://... \
  -e AIRFLOW__CORE__FERNET_KEY=... \
  -e AIRFLOW__WEBSERVER__SECRET_KEY=... \
  -v /data:/opt/airflow/data \
  igh-airflow:latest webserver

License

Copyright Akvo Foundation

About

Apache Airflow orchestration for the IGH Data Pipeline.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors