Open Parliament TV - Tools

The data import pipeline for Open Parliament TV. Fetches parliamentary proceedings and media feeds, parses them into a unified per-session JSON file, enriches with named-entity linking, sentence-level audio alignment, and named-entity recognition, then validates and publishes the result for the platform to ingest.

For the wider system context — repositories, data flow, the Stage 2 format — see the Architecture repo. The pipeline stages map to PIPELINE.md; the file format produced by the pipeline is specified in STAGE2-FORMAT.md.

Currently implemented: the German Bundestag (optv/parliaments/DE/).

Quick start

python3 -m pip install -r requirements.txt

# fetch + process the current period's data into <data_dir>
./optv/parliaments/DE/update <data_dir>

# or run the workflow manually with finer control:
./optv/parliaments/DE/workflow.py --period=21 <data_dir> \
    --download-original --merge-speeches \
    --link-entities --align-sentences --extract-entities

<data_dir> is the per-parliament data directory, expected to be a sibling clone of OpenParliamentTV-Data-DE. Each --* flag is opt-in and idempotent; --force re-runs an already-completed stage. A lockfile (<data_dir>/optv.lock) blocks concurrent runs.

External dependencies: aeneas needs ffmpeg and espeak; the NER stage needs a spaCy model (declared per-parliament in manifest.yaml as locale.spacy_model — de_core_news_md for DE) and an entityfishing API endpoint passed via --ner-api-endpoint.

Layout

optv/
├── parliaments/
│   └── DE/                  # German Bundestag — only currently implemented parliament
│       ├── manifest.yaml    # per-parliament metadata read by Conductor (stages, periods, …)
│       ├── workflow.py      # main orchestration entry point
│       ├── common.py        # Config class, SessionStatus, file naming
│       ├── scraper/         # fetch proceedings (TEI XML) and media (RSS)
│       ├── parsers/         # XML/RSS → intermediate JSON
│       ├── merger/          # join media + proceedings into Stage 2
│       ├── update           # shell wrapper: --period=21 --retry-count=20
│       └── Makefile         # download + merge targets driven by file mtimes
└── shared/                  # cross-parliament infrastructure
    ├── align.py             # forced sentence alignment (aeneas)
    ├── nel.py               # named-entity linking (Wikidata)
    ├── ner.py               # named-entity recognition (spaCy + entity-fishing)
    ├── schema/              # Stage 2 JSON schemas + reference doc
    ├── validators/          # structural + semantic validators, CLI
    └── docs/EXAMPLES/       # example Stage 2 documents

manifest.yaml is the per-parliament metadata file. The Conductor reads it to know which stages a parliament supports, which entity dump to use, and the retry defaults — see optv/parliaments/DE/manifest.yaml for the canonical example.

Pipeline stages

Each stage produces a side-by-side cache file per session (e.g. 21001-merged.json, 21001-aligned.json, 21001-ner.json) and runs only when its input is newer than its output.

Stage	Module / script	Input	Output
Fetch	`scraper/fetch_proceedings.py`, `scraper/fetch_media.py`	parliament APIs	`original/{proceedings,media}/`
Parse	`parsers/proceedings2json.py`, `parsers/media2json.py`	TEI XML, RSS	intermediate JSON
Merge	`merger/merge_session.py`	proceedings + media JSON	`cache/merged/*-merged.json`
NEL	`optv/shared/nel.py`	merged JSON + entity dump	`people[].wid`, faction normalisation
Align	`optv/shared/align.py`	merged JSON + audio	`cache/aligned/*-aligned.json` (sentence timings)
NER	`optv/shared/ner.py`	aligned JSON + entity-fishing API	`cache/ner/*-ner.json` (sentence entities)
Publish	`publish_as_processed()` in `workflow.py`	latest cache file	`processed/*-session.json`

For the conceptual stage breakdown (parliament-agnostic), see Architecture/PIPELINE.md.

Validation

Stage 2 schemas and conventions: optv/shared/schema/README.md. Standalone CLI:

python -m optv.shared.validators.cli --dir <data_dir>/processed --schema full
python -m optv.shared.validators.cli --file session.json --no-semantic

The publish step also runs validation and logs findings; warnings do not block.

Adding a new parliament

See docs/ADDING-A-PARLIAMENT.md.

Name		Name	Last commit message	Last commit date
Latest commit History 526 Commits
docs		docs
optv		optv
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Parliament TV - Tools

Quick start

Layout

Pipeline stages

Validation

Adding a new parliament

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Open Parliament TV - Tools

Quick start

Layout

Pipeline stages

Validation

Adding a new parliament

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages