Live Demo: https://huggingface.co/spaces/MWill727/Sound2Scene
Sound2Scene is an end-to-end AI project that turns audio into generated visuals. It combines speech recognition, transcript processing, audio feature analysis, sentiment detection, and text-to-image generation into one pipeline.
You upload an audio file, and the app creates an image based on what it hears and how the track sounds.
At a high level:
- Takes an audio file (
mp3,wav,flac,ogg,m4a, oraac) - Converts it into a WAV file for processing
- Optionally separates vocals using Demucs
- Transcribes the audio using Whisper
- Cleans and filters the transcript
- Selects meaningful lyric snippets
- Estimates mood using sentiment analysis
- Extracts lightweight audio features such as energy, tempo feel, and tonal brightness
- Builds a descriptive prompt
- Generates an image using Stable Diffusion XL
The flow inside the app looks like this:
- Audio input
- Audio conversion with Pydub / ffmpeg
- Optional vocal separation with Demucs
- Whisper transcription through Hugging Face Transformers
- Transcript cleanup and filtering
- Lyric snippet selection
- Sentiment analysis
- Lightweight audio feature extraction
- Prompt construction
- Image generation with Stable Diffusion XL
- Python
- PyTorch
- Gradio
- Hugging Face Transformers
- Diffusers
- Stable Diffusion XL
- Whisper
- Demucs
- Pydub
- SoundFile
- NumPy
Install the Python dependencies:
pip install -r requirements.txtThe project also needs ffmpeg for audio conversion.
On macOS:
brew install ffmpegClone the repo and install dependencies:
git clone https://github.com/MWill727/Sound2Scene.git
cd Sound2Scene
pip install -r requirements.txtRun the app:
python app.pyThen open the local Gradio link in your browser.
- Upload an audio file
- Choose ASR mode:
speeduses Whisper Smallqualityuses Whisper Large v3
- Optionally enable vocal isolation
- Click Submit
- View the generated prompt and image
Below are a few example outputs generated by Sound2Scene.
Input audio: America - A Horse with No Name
Generated image:
Input audio: Tracy Chapman - Fast Car
Generated image:
Input audio: Daydreamer - Adele
Generated image:
- CPU mode works, but image generation can be slow
- A GPU is recommended for faster generation
- Vocal isolation improves lyric transcription in some cases, but it also adds extra processing time
- Works best with clear vocals
- Heavy background noise can reduce transcription quality
- Instrumental or low-lyric audio may produce more abstract prompts
- The system combines pretrained models rather than training a single end-to-end model
- Better prompt generation from audio features
- More detailed handling of instrumental tracks
- More accurate lyric transcription
- Multiple image outputs per song
- Stronger alignment between specific lyric moments and image details
- Optional user controls for image style
MIT


