The data import pipeline for Open Parliament TV. Fetches parliamentary proceedings and media feeds, parses them into a unified per-session JSON file, enriches with named-entity linking, sentence-level audio alignment, and named-entity recognition, then validates and publishes the result for the platform to ingest.
For the wider system context — repositories, data flow, the Stage 2 format — see the Architecture repo. The pipeline stages map to PIPELINE.md; the file format produced by the pipeline is specified in STAGE2-FORMAT.md.
Currently implemented: the German Bundestag (optv/parliaments/DE/).
python3 -m pip install -r requirements.txt
# fetch + process the current period's data into <data_dir>
./optv/parliaments/DE/update <data_dir>
# or run the workflow manually with finer control:
./optv/parliaments/DE/workflow.py --period=21 <data_dir> \
--download-original --merge-speeches \
--link-entities --align-sentences --extract-entities<data_dir> is the per-parliament data directory, expected to be a sibling clone of OpenParliamentTV-Data-DE. Each --* flag is opt-in and idempotent; --force re-runs an already-completed stage. A lockfile (<data_dir>/optv.lock) blocks concurrent runs.
External dependencies: aeneas needs ffmpeg and espeak; the NER stage needs a spaCy model (declared per-parliament in manifest.yaml as locale.spacy_model — de_core_news_md for DE) and an entityfishing API endpoint passed via --ner-api-endpoint.
optv/
├── parliaments/
│ └── DE/ # German Bundestag — only currently implemented parliament
│ ├── manifest.yaml # per-parliament metadata read by Conductor (stages, periods, …)
│ ├── workflow.py # main orchestration entry point
│ ├── common.py # Config class, SessionStatus, file naming
│ ├── scraper/ # fetch proceedings (TEI XML) and media (RSS)
│ ├── parsers/ # XML/RSS → intermediate JSON
│ ├── merger/ # join media + proceedings into Stage 2
│ ├── update # shell wrapper: --period=21 --retry-count=20
│ └── Makefile # download + merge targets driven by file mtimes
└── shared/ # cross-parliament infrastructure
├── align.py # forced sentence alignment (aeneas)
├── nel.py # named-entity linking (Wikidata)
├── ner.py # named-entity recognition (spaCy + entity-fishing)
├── schema/ # Stage 2 JSON schemas + reference doc
├── validators/ # structural + semantic validators, CLI
└── docs/EXAMPLES/ # example Stage 2 documents
manifest.yaml is the per-parliament metadata file. The Conductor reads it to know which stages a parliament supports, which entity dump to use, and the retry defaults — see optv/parliaments/DE/manifest.yaml for the canonical example.
Each stage produces a side-by-side cache file per session (e.g. 21001-merged.json, 21001-aligned.json, 21001-ner.json) and runs only when its input is newer than its output.
| Stage | Module / script | Input | Output |
|---|---|---|---|
| Fetch | scraper/fetch_proceedings.py, scraper/fetch_media.py |
parliament APIs | original/{proceedings,media}/ |
| Parse | parsers/proceedings2json.py, parsers/media2json.py |
TEI XML, RSS | intermediate JSON |
| Merge | merger/merge_session.py |
proceedings + media JSON | cache/merged/*-merged.json |
| NEL | optv/shared/nel.py |
merged JSON + entity dump | people[].wid, faction normalisation |
| Align | optv/shared/align.py |
merged JSON + audio | cache/aligned/*-aligned.json (sentence timings) |
| NER | optv/shared/ner.py |
aligned JSON + entity-fishing API | cache/ner/*-ner.json (sentence entities) |
| Publish | publish_as_processed() in workflow.py |
latest cache file | processed/*-session.json |
For the conceptual stage breakdown (parliament-agnostic), see Architecture/PIPELINE.md.
Stage 2 schemas and conventions: optv/shared/schema/README.md. Standalone CLI:
python -m optv.shared.validators.cli --dir <data_dir>/processed --schema full
python -m optv.shared.validators.cli --file session.json --no-semanticThe publish step also runs validation and logs findings; warnings do not block.