ComfyUI nodes for KugelAudio - Open-source text-to-speech with voice cloning for 24 European languages
Powered by an AR + Diffusion architecture
- Single Speaker TTS: Convert text to speech
- Voice Cloning: Clone any voice from reference audio (5-30 seconds)
- Multi-Speaker: Generate conversations with up to 6 speakers (Speaker 1-6)
- Natural Pacing: Configurable pause (0.0-2.0s) between speakers
- Watermark Detection: All output contains inaudible watermark (AudioSeal)
- 24 European Languages: English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Ukrainian, Czech, Romanian, Hungarian, Swedish, Danish, Finnish, Norwegian, Greek, Bulgarian, Slovak, Croatian, Serbian, Turkish
- 4-bit Quantization: Reduce VRAM from ~19GB to ~8GB (4-bit)
- Multiple Attention Types: Auto/SageAttention/FlashAttention/SDPA/Eager
- Progress Tracking: Real-time progress bars for long generations
- Text Chunking: Automatic sentence-boundary splitting for long texts
- Open ComfyUI Manager
- Click "Install Custom Nodes"
- Search for "KugelAudio"
- Click Install
cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-KugelAudio.gitBundled package: The kugelaudio-open folder is included and must be installed. ComfyUI will try to auto-install it on first launch. If that fails, install manually (see below).
β
Windows (Standard & Portable)
β
macOS (Intel & Apple Silicon)
β
Linux (Standard & Portable)
The auto-installer detects your Python environment automatically and installs to the correct location.
If you see errors on startup about missing kugelaudio-open package, install manually:
Windows Portable (recommended):
For first-time installation, use:
cd ComfyUI/custom_nodes/ComfyUI-KugelAudio
..\..\..\python_embeded\python.exe -m pip install -e ./kugelaudio-openFor reinstalling after code changes (safer - won't break dependencies):
cd ComfyUI/custom_nodes/ComfyUI-KugelAudio
..\..\..\python_embeded\python.exe -m pip install --no-deps --force-reinstall -e ./kugelaudio-openOr use the provided batch files in the ComfyUI-KugelAudio folder.
Two batch files are provided for Windows Portable:
| Script | When to Use | What It Does | Command Used |
|---|---|---|---|
install_portable.bat |
First-time installation | Installs kugelaudio-open in editable mode (-e flag). Creates a link so code changes take effect after restarting ComfyUI. |
pip install -e ./kugelaudio-open |
reinstall_no-deps.bat |
After modifying code (Recommended!) | Safely reinstalls kugelaudio-open without touching dependencies (--no-deps --force-reinstall -e). Use this when you've edited code or applied fixes and want changes to take effect without risking breaking your environment. |
pip install --no-deps --force-reinstall -e ./kugelaudio-open |
Why editable mode (-e flag) is CRITICAL:
- Without
-e: Code is copied to Python's site-packages. Editingkugelaudio-open/files won't do anything until you reinstall! - With
-e: Python creates a link to yourkugelaudio-open/folder. Code changes take effect immediately after restarting ComfyUI. - This is essential for development, bug fixes, and applying updates!
Why --no-deps is important for reinstalls:
- Without it: pip might try to reinstall dependencies (torch, transformers, etc.) which could break your ComfyUI environment
- With it: Only the kugelaudio-open package is reinstalled, keeping all your existing dependencies safe
When should I reinstall?
- After editing any files in
kugelaudio-open/folder - After pulling git updates that modify kugelaudio-open
- When fixes don't seem to be taking effect
- If you're told to "reinstall in editable mode"
Having issues? See Troubleshooting for more solutions.
Windows Portable (python_embeded):
# Navigate to node folder
cd ComfyUI/custom_nodes/ComfyUI-KugelAudio
# First install (editable mode)
..\..\..\python_embeded\python.exe -m pip install -e ./kugelaudio-open
# Safe reinstall after edits (editable + no deps)
..\..\..\python_embeded\python.exe -m pip install --no-deps --force-reinstall -e ./kugelaudio-openStandard Python (Windows/Linux/macOS):
# First install (editable mode)
pip install -e ./kugelaudio-open
# Safe reinstall after edits (editable + no deps)
pip install --no-deps --force-reinstall -e ./kugelaudio-open- Python 3.10+
- PyTorch 2.0+ (usually included with ComfyUI)
- Transformers 4.40+ (usually included with ComfyUI)
- ~19GB VRAM for full precision (7B model)
- ~8GB VRAM with 4-bit quantization (requires bitsandbytes)
- CUDA-capable GPU recommended (CPU/MPS supported but slower)
VRAM Comparison
| Mode | VRAM | Quality | Attention Types |
|---|---|---|---|
| Full Precision | ~19GB | Best | All (Sage/Flash/SDPA/Eager) |
| 4-bit Quantization | ~8GB | Slight reduction | SDPA/Eager only |
For 4-bit Quantization (Lower VRAM)
4-bit quantization requires bitsandbytes. If you encounter issues:
# Standard installation
pip install bitsandbytesNote: 4-bit quantization only supports SDPA and Eager attention types.
Node Reference
All nodes include a Device dropdown to control where inference runs:
| Option | Description |
|---|---|
auto |
Automatically detects best available (CUDA β MPS β CPU) |
cuda |
Force NVIDIA GPU (recommended for best performance) |
mps |
Force Apple Silicon GPU (may have compatibility issues) |
cpu |
Force CPU execution (slower but most compatible) |
Apple Silicon Users: mps_matmul errors during generation. If you encounter crashes, manually select cpu from the device dropdown.
4-bit Quantization: Requires CUDA GPU. Automatically disabled for CPU/MPS devices.
CPU Optimization: If using CPU mode with low utilization, set thread count:
import torch
torch.set_num_threads(8) # Match your CPU coresPrecision by Device:
The model automatically uses optimal precision for each device:
| Device | Precision | Notes |
|---|---|---|
| CUDA | bfloat16 | Best performance/quality balance on NVIDIA GPUs |
| MPS | float16 | Required for Apple Silicon (bfloat16 not supported) |
| CPU | float32 | Full precision for best compatibility |
- MPS uses fp16 for speed, but may still have compatibility issues
- CPU uses fp32 because PyTorch has limited fp16 support on CPU
- CUDA uses bfloat16 for optimal performance without quality loss
Generate speech from text with full control over generation parameters.
Inputs:
text: Text to synthesizemodel: Model selection (auto-downloads on first run)device: Device selection (auto/cuda/mps/cpu)attention_type: Attention implementation (auto/sage_attn/flash_attn/sdpa/eager)use_4bit: Enable 4-bit quantization (~8GB VRAM, requires CUDA, SDPA/Eager only)cfg_scale: Guidance scale (1.0-10.0, default 3.0) - higher = more adherence to textmax_new_tokens: Max generation length (512-4096, default 2048)language: Optional language hint (auto-detects if not set)keep_loaded: Keep model in VRAM (faster subsequent runs)output_stereo: Output stereo audioseed: Random seed for reproducibility (default 42)max_words_per_chunk: Split long text at sentence boundaries (100-500, default 250)do_sample: Enable sampling for varied output (default False)temperature: Sampling temperature (0.1-2.0, default 1.0)
Clone any voice using a short reference audio sample (5-30 seconds recommended).
Same inputs as TTS plus:
voice_prompt: Reference audio file for voice cloning- Higher quality reference = better voice similarity
Generate conversations with up to 6 speakers with automatic pause between speakers.
Inputs:
text: Conversation text (useSpeaker N:format, N=1-6)pause_between_speakers: Silence between speaker turns (0.0-2.0 seconds, default 0.2s)- Voice inputs for each speaker (optional)
- All TTS options (cfg_scale, attention type, etc.)
Text Format:
Speaker 1: Hello, I'm the first speaker.
Speaker 2: Hi there, I'm the second speaker.
Speaker 3: I'm the third speaker!
Speaker 4: And I'm the fourth.
Speaker 5: Adding a fifth voice here!
Speaker 6: And the sixth speaker!
Optional voice inputs:
speaker1_voicethroughspeaker6_voice: Voice samples for each speaker
All KugelAudio output contains an inaudible watermark using Facebook's AudioSeal technology. This node detects whether audio was generated by KugelAudio.
Returns:
detected: String ("Detected" / "Not Detected")confidence: Float (0.0-1.0)
Audio Format:
- Input: Any sample rate, mono or stereo (auto-converted)
- Output: 24kHz mono (optionally stereo)
Click to expand
The 4-bit toggle quantizes the LLM component (7B parameters), keeping the diffusion head and tokenizers at full precision for best audio quality.
Attention Type Compatibility:
| Mode | Available Attention Types |
|---|---|
| Full Precision | Auto β SageAttention β FlashAttention 2 β SDPA β Eager |
| 4-bit | Auto (falls back to SDPA) β SDPA β Eager only |
Tips:
- SageAttention: Fastest (CUDA only, GPU-optimized kernels)
- FlashAttention 2: Fast (CUDA only)
- SDPA: PyTorch optimized (all platforms)
- Eager: Standard/slowest (all platforms, required for 4-bit)
On first run:
- The kugelaudio-open package auto-installs from the bundled folder
- The model (kugelaudio-0-open, 7B parameters) automatically downloads to
ComfyUI/models/kugelaudio/
Both happen automatically on first generation.
Benchmark Results
KugelAudio achieves state-of-the-art performance, beating industry leaders including ElevenLabs in rigorous human preference testing.
Human Preference Benchmark (A/B Testing): 339 human evaluations comparing KugelAudio against leading TTS models.
OpenSkill Ranking:
| Rank | Model | Score | Record | Win Rate |
|---|---|---|---|---|
| π₯ 1 | KugelAudio | 26 | 71W / 20L / 23T | 78.0% |
| π₯ 2 | ElevenLabs Multi v2 | 25 | 56W / 34L / 22T | 62.2% |
| π₯ 3 | ElevenLabs v3 | 21 | 64W / 34L / 16T | 65.3% |
| 4 | Cartesia | 21 | 55W / 38L / 19T | 59.1% |
| 5 | VibeVoice | 10 | 30W / 74L / 8T | 28.8% |
| 6 | CosyVoice v3 | 9 | 15W / 91L / 8T | 14.2% |
Model Specs:
| Model | Parameters | Quality | RTF | VRAM |
|---|---|---|---|---|
| kugelaudio-0-open | 7B | Best | 1.00 | ~19GB / ~8GB (4-bit) |
RTF = Real-Time Factor (generation time / audio duration).
Troubleshooting
- Enable 4-bit quantization: Reduces VRAM from ~19GB to ~8GB
- Use SDPA or Eager attention: Required with 4-bit mode
- Reduce max_words_per_chunk: Lower chunk size reduces peak memory
- Restart ComfyUI: Sometimes VRAM doesn't release properly
- Close other GPU applications: Free up GPU memory
- Use SageAttention (CUDA only): Most memory-efficient attention type
Windows:
pip install bitsandbytesLinux:
pip install bitsandbytesmacOS (Apple Silicon):
pip install bitsandbytesNote: 4-bit quantization on macOS with MPS may have limited compatibility.
- Check internet connection
- Try manual download:
huggingface-cli download kugelaudio/kugelaudio-0-open --local-dir ComfyUI/models/kugelaudio/kugelaudio-0-open
- Set HF_TOKEN environment variable if using gated model
If the auto-installer fails or you need to reinstall the bundled package:
Find your Python path:
- Standard:
python - Portable Windows:
python_embeded\python.exe
Install the bundled package:
# Navigate to the custom node folder
cd ComfyUI/custom_nodes/ComfyUI-KugelAudio
# Install using portable Python (replace /path/to/ComfyUI with your actual path) or run .bat file
C:\path\to\ComfyUI\python_embeded\python.exe -m pip install ./kugelaudio-open
# For standard Python installation
python -m pip install ./kugelaudio-open
Verify installation:
C:\path\to\ComfyUI\python_embeded\python.exe -c "import kugelaudio_open; print('kugelaudio-open installed successfully')"Note: The bundled package is located at ComfyUI/custom_nodes/ComfyUI-KugelAudio/kugelaudio-open/
- Static/noise: Disable 4-bit quantization, use full precision
- Robot voice: Increase cfg_scale (try 3.0-5.0)
- Clipping/distortion: Lower cfg_scale (try 2.0-3.0)
- Slow generation: Use SageAttention with CUDA GPU
If you see warnings about attention types:
- 4-bit mode requires SDPA or Eager
- Auto mode will automatically select the best compatible option
- SageAttention and FlashAttention require CUDA
Apple Silicon (MPS) Issues:
- MPS may cause
mps_matmulerrors during generation - If you see "incompatible dimensions" or "LLVM ERROR", select
cpufrom the Device dropdown - MPS is auto-detected but not always stable with this model
CPU Mode:
- Select
cpufrom the Device dropdown - 4-bit quantization requires CUDA (automatically disabled on CPU)
- Set PyTorch threads for better utilization:
import torch torch.set_num_threads(8) # Match your CPU cores
- Expect significantly slower generation than GPU
Forcing Specific Device:
- Use the Device dropdown in any KugelAudio node
auto: Tries CUDA β MPS β CPU (with warnings for MPS)cuda: NVIDIA GPU onlymps: Apple Silicon GPU only (may have issues)cpu: CPU only (most compatible)
- Set
max_words_per_chunkto split long text (recommended: 200-300) - Check for proper sentence-ending punctuation
- Progress bar shows stage completion in ComfyUI console
- Format text exactly:
Speaker N: Text(N = 1-6) - Ensure voice inputs are provided if using voice cloning
- Each line must have a speaker prefix
- Maximum 6 speakers (1-6)
- Watermark detection works on generated audio only
- Very short audio clips may have reduced detection accuracy
- Output shows "Detected" or "Not Detected" as string
Click to expand
- Accessibility: Text-to-speech for visually impaired users
- Content Creation: Podcasts, videos, audiobooks, e-learning
- Voice Assistants: Chatbots and virtual assistants
- Language Learning: Pronunciation practice and language education
- Creative Projects: With proper consent and attribution
- Creating deepfakes or misleading content
- Impersonating individuals without explicit consent
- Fraud, deception, or scams
- Harassment or abuse
- Any illegal activities
- VRAM Requirements: Requires ~19GB VRAM for full precision, ~8GB with 4-bit quantization
- Speed: Approximately 1.0x real-time on modern GPUs
- Voice Cloning Quality: Best results with 5-30 seconds of clear reference audio
- Language Quality Variation: Quality may vary across languages based on training data distribution
MIT License - Same as KugelAudio
This model would not have been possible without the contributions of many individuals and organizations:
- Microsoft VibeVoice Team: Foundation architecture
- YODAS2 Dataset: Training data (~200,000 hours)
- Qwen Team: Language model backbone
- Facebook AudioSeal: Audio watermarking
- Carlos Menke: For invaluable efforts in gathering datasets and extensive benchmarking
- AI Service Center Berlin-Brandenburg (KI-Servicezentrum): For providing GPU resources (8x H100)
@software{kugelaudio2026,
title = {KugelAudio: Open-Source Text-to-Speech for European Languages with Voice Cloning},
author = {Kratzenstein, Kajo and Menke, Carlos},
year = {2026},
institution = {Hasso-Plattner-Institut},
url = {https://huggingface.co/kugelaudio/kugelaudio-0-open}
}