ComfyUI-KugelAudio

ComfyUI nodes for KugelAudio - Open-source text-to-speech with voice cloning for 24 European languages
Powered by an AR + Diffusion architecture

Features

Single Speaker TTS: Convert text to speech
Voice Cloning: Clone any voice from reference audio (5-30 seconds)
Multi-Speaker: Generate conversations with up to 6 speakers (Speaker 1-6)
Natural Pacing: Configurable pause (0.0-2.0s) between speakers
Watermark Detection: All output contains inaudible watermark (AudioSeal)
24 European Languages: English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Ukrainian, Czech, Romanian, Hungarian, Swedish, Danish, Finnish, Norwegian, Greek, Bulgarian, Slovak, Croatian, Serbian, Turkish
4-bit Quantization: Reduce VRAM from ~19GB to ~8GB (4-bit)
Multiple Attention Types: Auto/SageAttention/FlashAttention/SDPA/Eager
Progress Tracking: Real-time progress bars for long generations
Text Chunking: Automatic sentence-boundary splitting for long texts

Installation

Method 1: ComfyUI Manager (Recommended)

Open ComfyUI Manager
Click "Install Custom Nodes"
Search for "KugelAudio"
Click Install

Method 2: Manual Installation

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-KugelAudio.git

Bundled package: The kugelaudio-open folder is included and must be installed. ComfyUI will try to auto-install it on first launch. If that fails, install manually (see below).

Platform Support

✅ Windows (Standard & Portable)
✅ macOS (Intel & Apple Silicon)
✅ Linux (Standard & Portable)

The auto-installer detects your Python environment automatically and installs to the correct location.

Manual Installation (if auto-install fails)

If you see errors on startup about missing kugelaudio-open package, install manually:

Windows Portable (recommended):

For first-time installation, use:

cd ComfyUI/custom_nodes/ComfyUI-KugelAudio
..\..\..\python_embeded\python.exe -m pip install -e ./kugelaudio-open

For reinstalling after code changes (safer - won't break dependencies):

cd ComfyUI/custom_nodes/ComfyUI-KugelAudio
..\..\..\python_embeded\python.exe -m pip install --no-deps --force-reinstall -e ./kugelaudio-open

Or use the provided batch files in the ComfyUI-KugelAudio folder.

Installation Scripts Explained

Two batch files are provided for Windows Portable:

Script	When to Use	What It Does	Command Used
`install_portable.bat`	First-time installation	Installs kugelaudio-open in editable mode (`-e` flag). Creates a link so code changes take effect after restarting ComfyUI.	`pip install -e ./kugelaudio-open`
`reinstall_no-deps.bat`	After modifying code (Recommended!)	Safely reinstalls kugelaudio-open without touching dependencies (`--no-deps --force-reinstall -e`). Use this when you've edited code or applied fixes and want changes to take effect without risking breaking your environment.	`pip install --no-deps --force-reinstall -e ./kugelaudio-open`

Why editable mode (-e flag) is CRITICAL:

Without -e: Code is copied to Python's site-packages. Editing kugelaudio-open/ files won't do anything until you reinstall!
With -e: Python creates a link to your kugelaudio-open/ folder. Code changes take effect immediately after restarting ComfyUI.
This is essential for development, bug fixes, and applying updates!

Why --no-deps is important for reinstalls:

Without it: pip might try to reinstall dependencies (torch, transformers, etc.) which could break your ComfyUI environment
With it: Only the kugelaudio-open package is reinstalled, keeping all your existing dependencies safe

When should I reinstall?

After editing any files in kugelaudio-open/ folder
After pulling git updates that modify kugelaudio-open
When fixes don't seem to be taking effect
If you're told to "reinstall in editable mode"

Having issues? See Troubleshooting for more solutions.

Manual CLI Commands Reference

Windows Portable (python_embeded):

# Navigate to node folder
cd ComfyUI/custom_nodes/ComfyUI-KugelAudio

# First install (editable mode)
..\..\..\python_embeded\python.exe -m pip install -e ./kugelaudio-open

# Safe reinstall after edits (editable + no deps)
..\..\..\python_embeded\python.exe -m pip install --no-deps --force-reinstall -e ./kugelaudio-open

Standard Python (Windows/Linux/macOS):

# First install (editable mode)
pip install -e ./kugelaudio-open

# Safe reinstall after edits (editable + no deps)
pip install --no-deps --force-reinstall -e ./kugelaudio-open

Requirements

Python 3.10+
PyTorch 2.0+ (usually included with ComfyUI)
Transformers 4.40+ (usually included with ComfyUI)
~19GB VRAM for full precision (7B model)
~8GB VRAM with 4-bit quantization (requires bitsandbytes)
CUDA-capable GPU recommended (CPU/MPS supported but slower)

VRAM Comparison

Mode	VRAM	Quality	Attention Types
Full Precision	~19GB	Best	All (Sage/Flash/SDPA/Eager)
4-bit Quantization	~8GB	Slight reduction	SDPA/Eager only

For 4-bit Quantization (Lower VRAM)

4-bit quantization requires bitsandbytes. If you encounter issues:

# Standard installation
pip install bitsandbytes

Note: 4-bit quantization only supports SDPA and Eager attention types.

Node Reference

Device Selection

All nodes include a Device dropdown to control where inference runs:

Option	Description
`auto`	Automatically detects best available (CUDA → MPS → CPU)
`cuda`	Force NVIDIA GPU (recommended for best performance)
`mps`	Force Apple Silicon GPU (may have compatibility issues)
`cpu`	Force CPU execution (slower but most compatible)

Apple Silicon Users: ⚠️ MPS may cause mps_matmul errors during generation. If you encounter crashes, manually select cpu from the device dropdown.

4-bit Quantization: Requires CUDA GPU. Automatically disabled for CPU/MPS devices.

CPU Optimization: If using CPU mode with low utilization, set thread count:

import torch
torch.set_num_threads(8)  # Match your CPU cores

Precision by Device:

The model automatically uses optimal precision for each device:

Device	Precision	Notes
CUDA	bfloat16	Best performance/quality balance on NVIDIA GPUs
MPS	float16	Required for Apple Silicon (bfloat16 not supported)
CPU	float32	Full precision for best compatibility

MPS uses fp16 for speed, but may still have compatibility issues
CPU uses fp32 because PyTorch has limited fp16 support on CPU
CUDA uses bfloat16 for optimal performance without quality loss

KugelAudio TTS

Generate speech from text with full control over generation parameters.

Inputs:

text: Text to synthesize
model: Model selection (auto-downloads on first run)
device: Device selection (auto/cuda/mps/cpu)
attention_type: Attention implementation (auto/sage_attn/flash_attn/sdpa/eager)
use_4bit: Enable 4-bit quantization (~8GB VRAM, requires CUDA, SDPA/Eager only)
cfg_scale: Guidance scale (1.0-10.0, default 3.0) - higher = more adherence to text
max_new_tokens: Max generation length (512-4096, default 2048)
language: Optional language hint (auto-detects if not set)
keep_loaded: Keep model in VRAM (faster subsequent runs)
output_stereo: Output stereo audio
seed: Random seed for reproducibility (default 42)
max_words_per_chunk: Split long text at sentence boundaries (100-500, default 250)
do_sample: Enable sampling for varied output (default False)
temperature: Sampling temperature (0.1-2.0, default 1.0)

KugelAudio Voice Clone

Clone any voice using a short reference audio sample (5-30 seconds recommended).

Same inputs as TTS plus:

voice_prompt: Reference audio file for voice cloning
Higher quality reference = better voice similarity

KugelAudio Multi-Speaker

Generate conversations with up to 6 speakers with automatic pause between speakers.

Inputs:

text: Conversation text (use Speaker N: format, N=1-6)
pause_between_speakers: Silence between speaker turns (0.0-2.0 seconds, default 0.2s)
Voice inputs for each speaker (optional)
All TTS options (cfg_scale, attention type, etc.)

Text Format:

Speaker 1: Hello, I'm the first speaker.
Speaker 2: Hi there, I'm the second speaker.
Speaker 3: I'm the third speaker!
Speaker 4: And I'm the fourth.
Speaker 5: Adding a fifth voice here!
Speaker 6: And the sixth speaker!

Optional voice inputs:

speaker1_voice through speaker6_voice: Voice samples for each speaker

KugelAudio Watermark Check

All KugelAudio output contains an inaudible watermark using Facebook's AudioSeal technology. This node detects whether audio was generated by KugelAudio.

Returns:

detected: String ("Detected" / "Not Detected")
confidence: Float (0.0-1.0)

Audio Format:

Input: Any sample rate, mono or stereo (auto-converted)
Output: 24kHz mono (optionally stereo)

Quantization Details

Click to expand

The 4-bit toggle quantizes the LLM component (7B parameters), keeping the diffusion head and tokenizers at full precision for best audio quality.

Attention Type Compatibility:

Mode	Available Attention Types
Full Precision	Auto → SageAttention → FlashAttention 2 → SDPA → Eager
4-bit	Auto (falls back to SDPA) → SDPA → Eager only

Tips:

SageAttention: Fastest (CUDA only, GPU-optimized kernels)
FlashAttention 2: Fast (CUDA only)
SDPA: PyTorch optimized (all platforms)
Eager: Standard/slowest (all platforms, required for 4-bit)

Model Auto-Download

On first run:

The kugelaudio-open package auto-installs from the bundled folder
The model (kugelaudio-0-open, 7B parameters) automatically downloads to ComfyUI/models/kugelaudio/

Both happen automatically on first generation.

Benchmark Results

KugelAudio achieves state-of-the-art performance, beating industry leaders including ElevenLabs in rigorous human preference testing.

Human Preference Benchmark (A/B Testing): 339 human evaluations comparing KugelAudio against leading TTS models.

OpenSkill Ranking:

Rank	Model	Score	Record	Win Rate
🥇 1	KugelAudio	26	71W / 20L / 23T	78.0%
🥈 2	ElevenLabs Multi v2	25	56W / 34L / 22T	62.2%
🥉 3	ElevenLabs v3	21	64W / 34L / 16T	65.3%
4	Cartesia	21	55W / 38L / 19T	59.1%
5	VibeVoice	10	30W / 74L / 8T	28.8%
6	CosyVoice v3	9	15W / 91L / 8T	14.2%

Model Specs:

Model	Parameters	Quality	RTF	VRAM
kugelaudio-0-open	7B	Best	1.00	~19GB / ~8GB (4-bit)

RTF = Real-Time Factor (generation time / audio duration).

Troubleshooting

Voice cloning failed: 'Qwen2Config' object has no attribute 'pad_token_id'

run install_portable.bat in ComfyUI\custom_nodes\ComfyUI-KugelAudio

Out of Memory (OOM) Errors

Enable 4-bit quantization: Reduces VRAM from ~19GB to ~8GB
Use SDPA or Eager attention: Required with 4-bit mode
Reduce max_words_per_chunk: Lower chunk size reduces peak memory
Restart ComfyUI: Sometimes VRAM doesn't release properly
Close other GPU applications: Free up GPU memory
Use SageAttention (CUDA only): Most memory-efficient attention type

bitsandbytes Installation Fails

Windows:

pip install bitsandbytes

Linux:

pip install bitsandbytes

macOS (Apple Silicon):

pip install bitsandbytes

Note: 4-bit quantization on macOS with MPS may have limited compatibility.

Model Download Fails

Check internet connection

Try manual download:

huggingface-cli download kugelaudio/kugelaudio-0-open --local-dir ComfyUI/models/kugelaudio/kugelaudio-0-open

Set HF_TOKEN environment variable if using gated model

Manual Package Installation (Portable ComfyUI)

If the auto-installer fails or you need to reinstall the bundled package:

Find your Python path:

Standard: python
Portable Windows: python_embeded\python.exe

Install the bundled package:

# Navigate to the custom node folder
cd ComfyUI/custom_nodes/ComfyUI-KugelAudio

# Install using portable Python (replace /path/to/ComfyUI with your actual path) or run .bat file
C:\path\to\ComfyUI\python_embeded\python.exe -m pip install ./kugelaudio-open

# For standard Python installation
python -m pip install ./kugelaudio-open

Verify installation:

C:\path\to\ComfyUI\python_embeded\python.exe -c "import kugelaudio_open; print('kugelaudio-open installed successfully')"

Note: The bundled package is located at ComfyUI/custom_nodes/ComfyUI-KugelAudio/kugelaudio-open/

Audio Quality Issues

Static/noise: Disable 4-bit quantization, use full precision
Robot voice: Increase cfg_scale (try 3.0-5.0)
Clipping/distortion: Lower cfg_scale (try 2.0-3.0)
Slow generation: Use SageAttention with CUDA GPU

Attention Type Warnings

If you see warnings about attention types:

4-bit mode requires SDPA or Eager
Auto mode will automatically select the best compatible option
SageAttention and FlashAttention require CUDA

Device Selection & Compatibility

Apple Silicon (MPS) Issues:

MPS may cause mps_matmul errors during generation
If you see "incompatible dimensions" or "LLVM ERROR", select cpu from the Device dropdown
MPS is auto-detected but not always stable with this model

CPU Mode:

Select cpu from the Device dropdown
4-bit quantization requires CUDA (automatically disabled on CPU)

Set PyTorch threads for better utilization:

import torch
torch.set_num_threads(8)  # Match your CPU cores

Expect significantly slower generation than GPU

Forcing Specific Device:

Use the Device dropdown in any KugelAudio node
auto: Tries CUDA → MPS → CPU (with warnings for MPS)
cuda: NVIDIA GPU only
mps: Apple Silicon GPU only (may have issues)
cpu: CPU only (most compatible)

Long Text Not Processing

Set max_words_per_chunk to split long text (recommended: 200-300)
Check for proper sentence-ending punctuation
Progress bar shows stage completion in ComfyUI console

Multi-Speaker Not Working

Format text exactly: Speaker N: Text (N = 1-6)
Ensure voice inputs are provided if using voice cloning
Each line must have a speaker prefix
Maximum 6 speakers (1-6)

Watermark Detection Issues

Watermark detection works on generated audio only
Very short audio clips may have reduced detection accuracy
Output shows "Detected" or "Not Detected" as string

Intended Use

Click to expand

✅ Appropriate Uses

Accessibility: Text-to-speech for visually impaired users
Content Creation: Podcasts, videos, audiobooks, e-learning
Voice Assistants: Chatbots and virtual assistants
Language Learning: Pronunciation practice and language education
Creative Projects: With proper consent and attribution

❌ Prohibited Uses

Creating deepfakes or misleading content
Impersonating individuals without explicit consent
Fraud, deception, or scams
Harassment or abuse
Any illegal activities

Limitations

VRAM Requirements: Requires ~19GB VRAM for full precision, ~8GB with 4-bit quantization
Speed: Approximately 1.0x real-time on modern GPUs
Voice Cloning Quality: Best results with 5-30 seconds of clear reference audio
Language Quality Variation: Quality may vary across languages based on training data distribution

License

MIT License - Same as KugelAudio

Acknowledgments

This model would not have been possible without the contributions of many individuals and organizations:

Microsoft VibeVoice Team: Foundation architecture
YODAS2 Dataset: Training data (~200,000 hours)
Qwen Team: Language model backbone
Facebook AudioSeal: Audio watermarking

Special Thanks

Carlos Menke: For invaluable efforts in gathering datasets and extensive benchmarking
AI Service Center Berlin-Brandenburg (KI-Servicezentrum): For providing GPU resources (8x H100)

Citation

@software{kugelaudio2026,
  title = {KugelAudio: Open-Source Text-to-Speech for European Languages with Voice Cloning},
  author = {Kratzenstein, Kajo and Menke, Carlos},
  year = {2026},
  institution = {Hasso-Plattner-Institut},
  url = {https://huggingface.co/kugelaudio/kugelaudio-0-open}
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
example_workflow		example_workflow
kugelaudio-open		kugelaudio-open
nodes		nodes
.gitignore		.gitignore
README.md		README.md
STUTTERING_FIX.md		STUTTERING_FIX.md
__init__.py		__init__.py
install_portable.bat		install_portable.bat
pyproject.toml		pyproject.toml
reinstall_no-deps.bat		reinstall_no-deps.bat
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation