A Final Year Research and Development Project
Department of Computer Science & Engineering, University of Moratuwa
This project aims to build an AI-powered real-time English-to-Sinhala dubbing system that preserves speaker identity and emotional tone. It enables users to input English videos (or real-time streams), and receive back synchronized Sinhala dubbed outputs β ideal for accessibility, education, and low-resource language preservation.
The system supports both offline (Phase 1) and real-time (Phase 2) dubbing pipelines.
- π§ Automatic Speech Recognition (ASR) using [Faster-Whisper]
- π Neural Machine Translation (NMT) using [Meta NLLB-200] with CTranslate2 for fast inference
- π£οΈ Text-to-Speech (TTS) synthesis using dual approach:
- Fine-tuned [XTTS_v2] for Sinhala
- [GPT-4o-mini-TTS] + [Seed-VC] for real-time speaker cloning
- π Voice preservation
- π§ Real-time streaming pipeline with sentence-aligned buffering
- π°οΈ Audio-video synchronization using time-stretching
The system follows a modular pipeline that processes English input (audio or video) and produces Sinhala dubbed output. It works in both offline and real-time modes.
-
π₯ Input (English Audio/Video)
The system accepts English video or audio files. In real-time mode, audio is processed in streamed chunks. -
π§ ASR β Automatic Speech Recognition
Uses Faster-Whisper to transcribe English speech into text.
In real-time, Voice Activity Detection (VAD) segments the audio into manageable units. -
π NMT β Neural Machine Translation
Transcribes are translated into Sinhala using Meta NLLB-200 (distilled 1.3B) with CTranslate2 for low-latency execution. -
π£οΈ TTS β Text-to-Speech Synthesis
Sinhala text is converted to Sinhala speech using one of two options:- XTTS_v2 (Fine-tuned): Default choice for high-quality, low-latency synthesis.
- GPT-4o-mini-TTS + Seed-VC: Generates base voice with GPT-4o-mini-TTS, then applies speaker voice cloning with Seed-VC.
-
β±οΈ Synchronization & Post-Processing
Synthesized Sinhala audio is time-stretched using Librosa or Rubberband to align with the original English video timing. -
π€ Output (Dubbed Sinhala Audio/Video)
The final Sinhala audio replaces the English audio track in the video.- In offline mode: Full video is re-rendered.
- In real-time mode: Output is played with ~2s latency buffering.