A framework for generating synthetic multilingual data to train faithfulness judges for text summarization.
This repository provides tools to:
- Generate faithful and unfaithful summaries from multilingual datasets (WikiLingua)
- Generate labeled training data for faithfulness judges using LLM-as-a-judge
Scripts run inside the official vLLM Docker container, which bundles compatible versions of vLLM, PyTorch, and Transformers.
docker pull vllm/vllm-openai:latestAdditional Python dependencies (installed inside the container):
pip install hydra-core omegaconf datasetsmultilingual-faithfulness/
├── conf/ # Hydra configuration files
│ ├── config.yaml # Main configuration
│ └── task/ # Task-specific configs
│ ├── gen_data.yaml # Training data generation
│ └── gen_summs.yaml # Summary generation
├── data/ # Benchmark datasets (CSV)
│ ├── llm_aggrefact.csv
│ ├── mface.csv
│ └── memerag.csv
├── scripts/ # Executable scripts
│ ├── gen_data.py # Training data generation
│ └── gen_summs.py # Summary generation
├── src/ # Library modules
│ ├── data_loader.py # WikiLingua dataset loader
│ ├── gen_data.py # Data generation functions
│ ├── gen_summs.py # Summary generation functions
│ ├── corrupt.py # Summary corruption strategies
│ ├── llm_inference/ # LLM inference utilities (vLLM)
│ └── utils/ # Helper functions and prompts
├── bash_files/ # Example shell scripts
└── requirements.txt
All scripts should be run inside the vLLM Docker container:
docker run --gpus all --rm \
-v /path/to/repo:/workspace \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--ipc=host --entrypoint bash \
vllm/vllm-openai:latest -c \
"pip install hydra-core omegaconf datasets && \
cd /workspace && \
python3 scripts/<script>.py <args>"Generate faithful and corrupted summaries from WikiLingua:
python3 scripts/gen_summs.py task=gen_summs \
model.base_llm=Qwen/Qwen3-4B-Instruct-2507 \
task.gen_summs.total_datapoints=14000 \
vllm.num_gpus=4 \
vllm.max_model_len=8192Create labeled training data for the faithfulness judge:
python3 scripts/gen_data.py task=gen_data \
model.base_llm=Qwen/Qwen3-4B-Instruct-2507 \
task.data_gen.n_samples=1000 \
task.data_gen.summaries_path=./output/data/corrupt_v2 \
vllm.num_gpus=4 \
vllm.max_model_len=8192If you use this work, please cite:
@inproceedings{alfano2026multilingual,
title = {Multilingual Self-Taught Faithfulness Evaluators},
author = {Carlo Alfano and Aymen Al Marjani and Zeno Jonke and Amin Mantrach and Saab Mansour and Marcello Federico},
year = {2026},
booktitle = {Findings of the Association for Computational Linguistics: EACL 2026}
}See CONTRIBUTING for more information.
This library is licensed under the CC-BY-4.0 License. See the LICENSE file.