Sound2Scene

Live Demo: https://huggingface.co/spaces/MWill727/Sound2Scene

Sound2Scene is an end-to-end AI project that turns audio into generated visuals. It combines speech recognition, transcript processing, audio feature analysis, sentiment detection, and text-to-image generation into one pipeline.

You upload an audio file, and the app creates an image based on what it hears and how the track sounds.

What it does

At a high level:

Takes an audio file (mp3, wav, flac, ogg, m4a, or aac)
Converts it into a WAV file for processing
Optionally separates vocals using Demucs
Transcribes the audio using Whisper
Cleans and filters the transcript
Selects meaningful lyric snippets
Estimates mood using sentiment analysis
Extracts lightweight audio features such as energy, tempo feel, and tonal brightness
Builds a descriptive prompt
Generates an image using Stable Diffusion XL

Pipeline

The flow inside the app looks like this:

Audio input
Audio conversion with Pydub / ffmpeg
Optional vocal separation with Demucs
Whisper transcription through Hugging Face Transformers
Transcript cleanup and filtering
Lyric snippet selection
Sentiment analysis
Lightweight audio feature extraction
Prompt construction
Image generation with Stable Diffusion XL

Tech stack

Python
PyTorch
Gradio
Hugging Face Transformers
Diffusers
Stable Diffusion XL
Whisper
Demucs
Pydub
SoundFile
NumPy

Requirements

Install the Python dependencies:

pip install -r requirements.txt

The project also needs ffmpeg for audio conversion.

On macOS:

brew install ffmpeg

Running locally

Clone the repo and install dependencies:

git clone https://github.com/MWill727/Sound2Scene.git
cd Sound2Scene
pip install -r requirements.txt

Run the app:

python app.py

Then open the local Gradio link in your browser.

Usage

Upload an audio file
Choose ASR mode:
- speed uses Whisper Small
- quality uses Whisper Large v3
Optionally enable vocal isolation
Click Submit
View the generated prompt and image

Examples

Below are a few example outputs generated by Sound2Scene.

Example 1:

Input audio: America - A Horse with No Name

Generated image:

Example 2:

Input audio: Tracy Chapman - Fast Car

Generated image:

Example 3:

Input audio: Daydreamer - Adele

Generated image:

Notes on performance

CPU mode works, but image generation can be slow
A GPU is recommended for faster generation
Vocal isolation improves lyric transcription in some cases, but it also adds extra processing time

Limitations

Works best with clear vocals
Heavy background noise can reduce transcription quality
Instrumental or low-lyric audio may produce more abstract prompts
The system combines pretrained models rather than training a single end-to-end model

Future work

Better prompt generation from audio features
More detailed handling of instrumental tracks
More accurate lyric transcription
Multiple image outputs per song
Stronger alignment between specific lyric moments and image details
Optional user controls for image style

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
examples		examples
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
packages.txt		packages.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sound2Scene

What it does

Pipeline

Tech stack

Requirements

Running locally

Usage

Examples

Example 1:

Example 2:

Example 3:

Notes on performance

Limitations

Future work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sound2Scene

What it does

Pipeline

Tech stack

Requirements

Running locally

Usage

Examples

Example 1:

Example 2:

Example 3:

Notes on performance

Limitations

Future work

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages