Skip to content

feat(types): add Audio class and audio field to Message for multimodal models#654

Open
Ghraven wants to merge 1 commit intoollama:mainfrom
Ghraven:feat/audio-message-field
Open

feat(types): add Audio class and audio field to Message for multimodal models#654
Ghraven wants to merge 1 commit intoollama:mainfrom
Ghraven:feat/audio-message-field

Conversation

@Ghraven
Copy link
Copy Markdown

@Ghraven Ghraven commented Apr 29, 2026

Summary

Adds a dedicated Audio class and audio field to Message, mirroring the existing Image pattern.

Closes #650

Motivation

Currently, audio data must be passed via the images key, which is confusing and blocks future models that support both images and audio simultaneously. This PR adds a first-class audio field so callers can pass audio data cleanly:

ollama.chat(
    model="gemma4:e2b",
    messages=[{
        "role": "user",
        "content": "Transcribe this",
        "audio": ["recording.wav"],   # clear, not crammed into images
    }]
)

Changes

ollama/_types.py

  • Added Audio(BaseModel) class after Image, with identical serialisation logic:

    • Path / bytes → base64-encodes the data
    • str path that exists on disk → base64-encodes the file
    • str with a known audio extension that doesn't exist → raises ValueError with a clear message
    • str that looks like existing base64 → passes through
    • Unknown string → raises ValueError
    • Supported extension check covers: mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm
  • Added audio: Optional[Sequence[Audio]] = None field to Message (after images), with the same docstring style as images.

Compatibility

  • No breaking changes — audio is optional and defaults to None
  • The serialisation behaviour is consistent with Image, so the wire format is already what the Ollama server expects (raw base64)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add dedicated 'audio' key for multimodal models

1 participant