Skip to content

Saganaki22/ComfyUI-KugelAudio

Repository files navigation

ComfyUI-KugelAudio

ComfyUI nodes for KugelAudio - Open-source text-to-speech with voice cloning for 24 European languages
Powered by an AR + Diffusion architecture

GitHub Repository HuggingFace Model Python 3.10+ ComfyUI License: MIT

Screenshot 2026-02-02 202247

Features

  • Single Speaker TTS: Convert text to speech
  • Voice Cloning: Clone any voice from reference audio (5-30 seconds)
  • Multi-Speaker: Generate conversations with up to 6 speakers (Speaker 1-6)
  • Natural Pacing: Configurable pause (0.0-2.0s) between speakers
  • Watermark Detection: All output contains inaudible watermark (AudioSeal)
  • 24 European Languages: English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Ukrainian, Czech, Romanian, Hungarian, Swedish, Danish, Finnish, Norwegian, Greek, Bulgarian, Slovak, Croatian, Serbian, Turkish
  • 4-bit Quantization: Reduce VRAM from ~19GB to ~8GB (4-bit)
  • Multiple Attention Types: Auto/SageAttention/FlashAttention/SDPA/Eager
  • Progress Tracking: Real-time progress bars for long generations
  • Text Chunking: Automatic sentence-boundary splitting for long texts

Installation

Method 1: ComfyUI Manager (Recommended)

  1. Open ComfyUI Manager
  2. Click "Install Custom Nodes"
  3. Search for "KugelAudio"
  4. Click Install

Method 2: Manual Installation

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-KugelAudio.git

Bundled package: The kugelaudio-open folder is included and must be installed. ComfyUI will try to auto-install it on first launch. If that fails, install manually (see below).

Platform Support

βœ… Windows (Standard & Portable)
βœ… macOS (Intel & Apple Silicon)
βœ… Linux (Standard & Portable)

The auto-installer detects your Python environment automatically and installs to the correct location.

Manual Installation (if auto-install fails)

If you see errors on startup about missing kugelaudio-open package, install manually:

Windows Portable (recommended):

For first-time installation, use:

cd ComfyUI/custom_nodes/ComfyUI-KugelAudio
..\..\..\python_embeded\python.exe -m pip install -e ./kugelaudio-open

For reinstalling after code changes (safer - won't break dependencies):

cd ComfyUI/custom_nodes/ComfyUI-KugelAudio
..\..\..\python_embeded\python.exe -m pip install --no-deps --force-reinstall -e ./kugelaudio-open

Or use the provided batch files in the ComfyUI-KugelAudio folder.

Installation Scripts Explained

Two batch files are provided for Windows Portable:

Script When to Use What It Does Command Used
install_portable.bat First-time installation Installs kugelaudio-open in editable mode (-e flag). Creates a link so code changes take effect after restarting ComfyUI. pip install -e ./kugelaudio-open
reinstall_no-deps.bat After modifying code (Recommended!) Safely reinstalls kugelaudio-open without touching dependencies (--no-deps --force-reinstall -e). Use this when you've edited code or applied fixes and want changes to take effect without risking breaking your environment. pip install --no-deps --force-reinstall -e ./kugelaudio-open

Why editable mode (-e flag) is CRITICAL:

  • Without -e: Code is copied to Python's site-packages. Editing kugelaudio-open/ files won't do anything until you reinstall!
  • With -e: Python creates a link to your kugelaudio-open/ folder. Code changes take effect immediately after restarting ComfyUI.
  • This is essential for development, bug fixes, and applying updates!

Why --no-deps is important for reinstalls:

  • Without it: pip might try to reinstall dependencies (torch, transformers, etc.) which could break your ComfyUI environment
  • With it: Only the kugelaudio-open package is reinstalled, keeping all your existing dependencies safe

When should I reinstall?

  • After editing any files in kugelaudio-open/ folder
  • After pulling git updates that modify kugelaudio-open
  • When fixes don't seem to be taking effect
  • If you're told to "reinstall in editable mode"

Having issues? See Troubleshooting for more solutions.

Manual CLI Commands Reference

Windows Portable (python_embeded):

# Navigate to node folder
cd ComfyUI/custom_nodes/ComfyUI-KugelAudio

# First install (editable mode)
..\..\..\python_embeded\python.exe -m pip install -e ./kugelaudio-open

# Safe reinstall after edits (editable + no deps)
..\..\..\python_embeded\python.exe -m pip install --no-deps --force-reinstall -e ./kugelaudio-open

Standard Python (Windows/Linux/macOS):

# First install (editable mode)
pip install -e ./kugelaudio-open

# Safe reinstall after edits (editable + no deps)
pip install --no-deps --force-reinstall -e ./kugelaudio-open

Requirements

  • Python 3.10+
  • PyTorch 2.0+ (usually included with ComfyUI)
  • Transformers 4.40+ (usually included with ComfyUI)
  • ~19GB VRAM for full precision (7B model)
  • ~8GB VRAM with 4-bit quantization (requires bitsandbytes)
  • CUDA-capable GPU recommended (CPU/MPS supported but slower)
VRAM Comparison
Mode VRAM Quality Attention Types
Full Precision ~19GB Best All (Sage/Flash/SDPA/Eager)
4-bit Quantization ~8GB Slight reduction SDPA/Eager only
For 4-bit Quantization (Lower VRAM)

4-bit quantization requires bitsandbytes. If you encounter issues:

# Standard installation
pip install bitsandbytes

Note: 4-bit quantization only supports SDPA and Eager attention types.

Node Reference

Device Selection

All nodes include a Device dropdown to control where inference runs:

Option Description
auto Automatically detects best available (CUDA β†’ MPS β†’ CPU)
cuda Force NVIDIA GPU (recommended for best performance)
mps Force Apple Silicon GPU (may have compatibility issues)
cpu Force CPU execution (slower but most compatible)

Apple Silicon Users: ⚠️ MPS may cause mps_matmul errors during generation. If you encounter crashes, manually select cpu from the device dropdown.

4-bit Quantization: Requires CUDA GPU. Automatically disabled for CPU/MPS devices.

CPU Optimization: If using CPU mode with low utilization, set thread count:

import torch
torch.set_num_threads(8)  # Match your CPU cores

Precision by Device:

The model automatically uses optimal precision for each device:

Device Precision Notes
CUDA bfloat16 Best performance/quality balance on NVIDIA GPUs
MPS float16 Required for Apple Silicon (bfloat16 not supported)
CPU float32 Full precision for best compatibility
  • MPS uses fp16 for speed, but may still have compatibility issues
  • CPU uses fp32 because PyTorch has limited fp16 support on CPU
  • CUDA uses bfloat16 for optimal performance without quality loss

KugelAudio TTS

Generate speech from text with full control over generation parameters.

Inputs:

  • text: Text to synthesize
  • model: Model selection (auto-downloads on first run)
  • device: Device selection (auto/cuda/mps/cpu)
  • attention_type: Attention implementation (auto/sage_attn/flash_attn/sdpa/eager)
  • use_4bit: Enable 4-bit quantization (~8GB VRAM, requires CUDA, SDPA/Eager only)
  • cfg_scale: Guidance scale (1.0-10.0, default 3.0) - higher = more adherence to text
  • max_new_tokens: Max generation length (512-4096, default 2048)
  • language: Optional language hint (auto-detects if not set)
  • keep_loaded: Keep model in VRAM (faster subsequent runs)
  • output_stereo: Output stereo audio
  • seed: Random seed for reproducibility (default 42)
  • max_words_per_chunk: Split long text at sentence boundaries (100-500, default 250)
  • do_sample: Enable sampling for varied output (default False)
  • temperature: Sampling temperature (0.1-2.0, default 1.0)

KugelAudio Voice Clone

Clone any voice using a short reference audio sample (5-30 seconds recommended).

Same inputs as TTS plus:

  • voice_prompt: Reference audio file for voice cloning
  • Higher quality reference = better voice similarity

KugelAudio Multi-Speaker

Generate conversations with up to 6 speakers with automatic pause between speakers.

Inputs:

  • text: Conversation text (use Speaker N: format, N=1-6)
  • pause_between_speakers: Silence between speaker turns (0.0-2.0 seconds, default 0.2s)
  • Voice inputs for each speaker (optional)
  • All TTS options (cfg_scale, attention type, etc.)

Text Format:

Speaker 1: Hello, I'm the first speaker.
Speaker 2: Hi there, I'm the second speaker.
Speaker 3: I'm the third speaker!
Speaker 4: And I'm the fourth.
Speaker 5: Adding a fifth voice here!
Speaker 6: And the sixth speaker!

Optional voice inputs:

  • speaker1_voice through speaker6_voice: Voice samples for each speaker

KugelAudio Watermark Check

All KugelAudio output contains an inaudible watermark using Facebook's AudioSeal technology. This node detects whether audio was generated by KugelAudio.

Returns:

  • detected: String ("Detected" / "Not Detected")
  • confidence: Float (0.0-1.0)

Audio Format:

  • Input: Any sample rate, mono or stereo (auto-converted)
  • Output: 24kHz mono (optionally stereo)

Quantization Details

Click to expand

The 4-bit toggle quantizes the LLM component (7B parameters), keeping the diffusion head and tokenizers at full precision for best audio quality.

Attention Type Compatibility:

Mode Available Attention Types
Full Precision Auto β†’ SageAttention β†’ FlashAttention 2 β†’ SDPA β†’ Eager
4-bit Auto (falls back to SDPA) β†’ SDPA β†’ Eager only

Tips:

  • SageAttention: Fastest (CUDA only, GPU-optimized kernels)
  • FlashAttention 2: Fast (CUDA only)
  • SDPA: PyTorch optimized (all platforms)
  • Eager: Standard/slowest (all platforms, required for 4-bit)

Model Auto-Download

On first run:

  1. The kugelaudio-open package auto-installs from the bundled folder
  2. The model (kugelaudio-0-open, 7B parameters) automatically downloads to ComfyUI/models/kugelaudio/

Both happen automatically on first generation.

Benchmark Results

KugelAudio achieves state-of-the-art performance, beating industry leaders including ElevenLabs in rigorous human preference testing.

Human Preference Benchmark (A/B Testing): 339 human evaluations comparing KugelAudio against leading TTS models.

OpenSkill Ranking:

Rank Model Score Record Win Rate
πŸ₯‡ 1 KugelAudio 26 71W / 20L / 23T 78.0%
πŸ₯ˆ 2 ElevenLabs Multi v2 25 56W / 34L / 22T 62.2%
πŸ₯‰ 3 ElevenLabs v3 21 64W / 34L / 16T 65.3%
4 Cartesia 21 55W / 38L / 19T 59.1%
5 VibeVoice 10 30W / 74L / 8T 28.8%
6 CosyVoice v3 9 15W / 91L / 8T 14.2%

Model Specs:

Model Parameters Quality RTF VRAM
kugelaudio-0-open 7B Best 1.00 ~19GB / ~8GB (4-bit)

RTF = Real-Time Factor (generation time / audio duration).

Troubleshooting

Voice cloning failed: 'Qwen2Config' object has no attribute 'pad_token_id'

run install_portable.bat in ComfyUI\custom_nodes\ComfyUI-KugelAudio

Out of Memory (OOM) Errors

  1. Enable 4-bit quantization: Reduces VRAM from ~19GB to ~8GB
  2. Use SDPA or Eager attention: Required with 4-bit mode
  3. Reduce max_words_per_chunk: Lower chunk size reduces peak memory
  4. Restart ComfyUI: Sometimes VRAM doesn't release properly
  5. Close other GPU applications: Free up GPU memory
  6. Use SageAttention (CUDA only): Most memory-efficient attention type

bitsandbytes Installation Fails

Windows:

pip install bitsandbytes

Linux:

pip install bitsandbytes

macOS (Apple Silicon):

pip install bitsandbytes

Note: 4-bit quantization on macOS with MPS may have limited compatibility.

Model Download Fails

  1. Check internet connection
  2. Try manual download:
    huggingface-cli download kugelaudio/kugelaudio-0-open --local-dir ComfyUI/models/kugelaudio/kugelaudio-0-open
  3. Set HF_TOKEN environment variable if using gated model

Manual Package Installation (Portable ComfyUI)

If the auto-installer fails or you need to reinstall the bundled package:

Find your Python path:

  • Standard: python
  • Portable Windows: python_embeded\python.exe

Install the bundled package:

# Navigate to the custom node folder
cd ComfyUI/custom_nodes/ComfyUI-KugelAudio

# Install using portable Python (replace /path/to/ComfyUI with your actual path) or run .bat file
C:\path\to\ComfyUI\python_embeded\python.exe -m pip install ./kugelaudio-open

# For standard Python installation
python -m pip install ./kugelaudio-open

Verify installation:

C:\path\to\ComfyUI\python_embeded\python.exe -c "import kugelaudio_open; print('kugelaudio-open installed successfully')"

Note: The bundled package is located at ComfyUI/custom_nodes/ComfyUI-KugelAudio/kugelaudio-open/

Audio Quality Issues

  1. Static/noise: Disable 4-bit quantization, use full precision
  2. Robot voice: Increase cfg_scale (try 3.0-5.0)
  3. Clipping/distortion: Lower cfg_scale (try 2.0-3.0)
  4. Slow generation: Use SageAttention with CUDA GPU

Attention Type Warnings

If you see warnings about attention types:

  • 4-bit mode requires SDPA or Eager
  • Auto mode will automatically select the best compatible option
  • SageAttention and FlashAttention require CUDA

Device Selection & Compatibility

Apple Silicon (MPS) Issues:

  • MPS may cause mps_matmul errors during generation
  • If you see "incompatible dimensions" or "LLVM ERROR", select cpu from the Device dropdown
  • MPS is auto-detected but not always stable with this model

CPU Mode:

  • Select cpu from the Device dropdown
  • 4-bit quantization requires CUDA (automatically disabled on CPU)
  • Set PyTorch threads for better utilization:
    import torch
    torch.set_num_threads(8)  # Match your CPU cores
  • Expect significantly slower generation than GPU

Forcing Specific Device:

  • Use the Device dropdown in any KugelAudio node
  • auto: Tries CUDA β†’ MPS β†’ CPU (with warnings for MPS)
  • cuda: NVIDIA GPU only
  • mps: Apple Silicon GPU only (may have issues)
  • cpu: CPU only (most compatible)

Long Text Not Processing

  1. Set max_words_per_chunk to split long text (recommended: 200-300)
  2. Check for proper sentence-ending punctuation
  3. Progress bar shows stage completion in ComfyUI console

Multi-Speaker Not Working

  1. Format text exactly: Speaker N: Text (N = 1-6)
  2. Ensure voice inputs are provided if using voice cloning
  3. Each line must have a speaker prefix
  4. Maximum 6 speakers (1-6)

Watermark Detection Issues

  • Watermark detection works on generated audio only
  • Very short audio clips may have reduced detection accuracy
  • Output shows "Detected" or "Not Detected" as string

Intended Use

Click to expand

βœ… Appropriate Uses

  • Accessibility: Text-to-speech for visually impaired users
  • Content Creation: Podcasts, videos, audiobooks, e-learning
  • Voice Assistants: Chatbots and virtual assistants
  • Language Learning: Pronunciation practice and language education
  • Creative Projects: With proper consent and attribution

❌ Prohibited Uses

  • Creating deepfakes or misleading content
  • Impersonating individuals without explicit consent
  • Fraud, deception, or scams
  • Harassment or abuse
  • Any illegal activities

Limitations

  • VRAM Requirements: Requires ~19GB VRAM for full precision, ~8GB with 4-bit quantization
  • Speed: Approximately 1.0x real-time on modern GPUs
  • Voice Cloning Quality: Best results with 5-30 seconds of clear reference audio
  • Language Quality Variation: Quality may vary across languages based on training data distribution

License

MIT License - Same as KugelAudio

Acknowledgments

This model would not have been possible without the contributions of many individuals and organizations:

Special Thanks

  • Carlos Menke: For invaluable efforts in gathering datasets and extensive benchmarking
  • AI Service Center Berlin-Brandenburg (KI-Servicezentrum): For providing GPU resources (8x H100)

Citation

@software{kugelaudio2026,
  title = {KugelAudio: Open-Source Text-to-Speech for European Languages with Voice Cloning},
  author = {Kratzenstein, Kajo and Menke, Carlos},
  year = {2026},
  institution = {Hasso-Plattner-Institut},
  url = {https://huggingface.co/kugelaudio/kugelaudio-0-open}
}