Skip to content

OpenParliamentTV/OpenParliamentTV-Tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

526 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open Parliament TV - Tools

The data import pipeline for Open Parliament TV. Fetches parliamentary proceedings and media feeds, parses them into a unified per-session JSON file, enriches with named-entity linking, sentence-level audio alignment, and named-entity recognition, then validates and publishes the result for the platform to ingest.

For the wider system context — repositories, data flow, the Stage 2 format — see the Architecture repo. The pipeline stages map to PIPELINE.md; the file format produced by the pipeline is specified in STAGE2-FORMAT.md.

Currently implemented: the German Bundestag (optv/parliaments/DE/).

Quick start

python3 -m pip install -r requirements.txt

# fetch + process the current period's data into <data_dir>
./optv/parliaments/DE/update <data_dir>

# or run the workflow manually with finer control:
./optv/parliaments/DE/workflow.py --period=21 <data_dir> \
    --download-original --merge-speeches \
    --link-entities --align-sentences --extract-entities

<data_dir> is the per-parliament data directory, expected to be a sibling clone of OpenParliamentTV-Data-DE. Each --* flag is opt-in and idempotent; --force re-runs an already-completed stage. A lockfile (<data_dir>/optv.lock) blocks concurrent runs.

External dependencies: aeneas needs ffmpeg and espeak; the NER stage needs a spaCy model (declared per-parliament in manifest.yaml as locale.spacy_modelde_core_news_md for DE) and an entityfishing API endpoint passed via --ner-api-endpoint.

Layout

optv/
├── parliaments/
│   └── DE/                  # German Bundestag — only currently implemented parliament
│       ├── manifest.yaml    # per-parliament metadata read by Conductor (stages, periods, …)
│       ├── workflow.py      # main orchestration entry point
│       ├── common.py        # Config class, SessionStatus, file naming
│       ├── scraper/         # fetch proceedings (TEI XML) and media (RSS)
│       ├── parsers/         # XML/RSS → intermediate JSON
│       ├── merger/          # join media + proceedings into Stage 2
│       ├── update           # shell wrapper: --period=21 --retry-count=20
│       └── Makefile         # download + merge targets driven by file mtimes
└── shared/                  # cross-parliament infrastructure
    ├── align.py             # forced sentence alignment (aeneas)
    ├── nel.py               # named-entity linking (Wikidata)
    ├── ner.py               # named-entity recognition (spaCy + entity-fishing)
    ├── schema/              # Stage 2 JSON schemas + reference doc
    ├── validators/          # structural + semantic validators, CLI
    └── docs/EXAMPLES/       # example Stage 2 documents

manifest.yaml is the per-parliament metadata file. The Conductor reads it to know which stages a parliament supports, which entity dump to use, and the retry defaults — see optv/parliaments/DE/manifest.yaml for the canonical example.

Pipeline stages

Each stage produces a side-by-side cache file per session (e.g. 21001-merged.json, 21001-aligned.json, 21001-ner.json) and runs only when its input is newer than its output.

Stage Module / script Input Output
Fetch scraper/fetch_proceedings.py, scraper/fetch_media.py parliament APIs original/{proceedings,media}/
Parse parsers/proceedings2json.py, parsers/media2json.py TEI XML, RSS intermediate JSON
Merge merger/merge_session.py proceedings + media JSON cache/merged/*-merged.json
NEL optv/shared/nel.py merged JSON + entity dump people[].wid, faction normalisation
Align optv/shared/align.py merged JSON + audio cache/aligned/*-aligned.json (sentence timings)
NER optv/shared/ner.py aligned JSON + entity-fishing API cache/ner/*-ner.json (sentence entities)
Publish publish_as_processed() in workflow.py latest cache file processed/*-session.json

For the conceptual stage breakdown (parliament-agnostic), see Architecture/PIPELINE.md.

Validation

Stage 2 schemas and conventions: optv/shared/schema/README.md. Standalone CLI:

python -m optv.shared.validators.cli --dir <data_dir>/processed --schema full
python -m optv.shared.validators.cli --file session.json --no-semantic

The publish step also runs validation and logs findings; warnings do not block.

Adding a new parliament

See docs/ADDING-A-PARLIAMENT.md.

About

OpenParliamentTV-Tools for parsing parliamentary data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors