Skip to content

MWill727/Sound2Scene

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sound2Scene

Live Demo: https://huggingface.co/spaces/MWill727/Sound2Scene

Sound2Scene is an end-to-end AI project that turns audio into generated visuals. It combines speech recognition, transcript processing, audio feature analysis, sentiment detection, and text-to-image generation into one pipeline.

You upload an audio file, and the app creates an image based on what it hears and how the track sounds.


What it does

At a high level:

  1. Takes an audio file (mp3, wav, flac, ogg, m4a, or aac)
  2. Converts it into a WAV file for processing
  3. Optionally separates vocals using Demucs
  4. Transcribes the audio using Whisper
  5. Cleans and filters the transcript
  6. Selects meaningful lyric snippets
  7. Estimates mood using sentiment analysis
  8. Extracts lightweight audio features such as energy, tempo feel, and tonal brightness
  9. Builds a descriptive prompt
  10. Generates an image using Stable Diffusion XL

Pipeline

The flow inside the app looks like this:

  • Audio input
  • Audio conversion with Pydub / ffmpeg
  • Optional vocal separation with Demucs
  • Whisper transcription through Hugging Face Transformers
  • Transcript cleanup and filtering
  • Lyric snippet selection
  • Sentiment analysis
  • Lightweight audio feature extraction
  • Prompt construction
  • Image generation with Stable Diffusion XL

Tech stack

  • Python
  • PyTorch
  • Gradio
  • Hugging Face Transformers
  • Diffusers
  • Stable Diffusion XL
  • Whisper
  • Demucs
  • Pydub
  • SoundFile
  • NumPy

Requirements

Install the Python dependencies:

pip install -r requirements.txt

The project also needs ffmpeg for audio conversion.

On macOS:

brew install ffmpeg

Running locally

Clone the repo and install dependencies:

git clone https://github.com/MWill727/Sound2Scene.git
cd Sound2Scene
pip install -r requirements.txt

Run the app:

python app.py

Then open the local Gradio link in your browser.


Usage

  1. Upload an audio file
  2. Choose ASR mode:
    • speed uses Whisper Small
    • quality uses Whisper Large v3
  3. Optionally enable vocal isolation
  4. Click Submit
  5. View the generated prompt and image

Examples

Below are a few example outputs generated by Sound2Scene.

Example 1:

Input audio: America - A Horse with No Name

Generated image:

Example 1 - Desert scene


Example 2:

Input audio: Tracy Chapman - Fast Car

Generated image:

Example 2 - Urban night scene


Example 3:

Input audio: Daydreamer - Adele

Generated image:

Example 3 - Dreamlike emotional scene

Notes on performance

  • CPU mode works, but image generation can be slow
  • A GPU is recommended for faster generation
  • Vocal isolation improves lyric transcription in some cases, but it also adds extra processing time

Limitations

  • Works best with clear vocals
  • Heavy background noise can reduce transcription quality
  • Instrumental or low-lyric audio may produce more abstract prompts
  • The system combines pretrained models rather than training a single end-to-end model

Future work

  • Better prompt generation from audio features
  • More detailed handling of instrumental tracks
  • More accurate lyric transcription
  • Multiple image outputs per song
  • Stronger alignment between specific lyric moments and image details
  • Optional user controls for image style

License

MIT

About

Sound2Scene is an end-to-end AI project that turns audio into generated visuals. It combines speech recognition, transcript processing, audio feature analysis, sentiment detection, and text-to-image generation into one pipeline.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages