Audio Understand
Endpoints
Audio Understand
Audio scene Q&A. Hears tone, music, SFX, language, speaker emotion. Custom Kyma endpoint, no OpenAI equivalent.
POST
Audio Understand
Synchronous endpoint for audio understanding beyond transcription. Upload a clip, ask a question — get back a natural-language answer that captures the parts a transcript loses: tone, mood, music style, ambient SFX, speaker emotion, language.
A good rule of thumb: if the answer can be reconstructed from transcript text, use
Billed per minute, rounded up (a 30-second clip costs 1 minute = 0.039.
When to use this vs transcribe
| Question | Endpoint |
|---|---|
| What words are spoken? | /v1/audio/transcriptions |
| Is the speaker angry, calm, tired? | this endpoint |
| What kind of music is playing? | this endpoint |
| Are there sirens, applause, traffic in the background? | this endpoint |
| What language is being spoken? | either |
transcribe. If it depends on how something sounds, use audio-understand.
Request
multipart/form-data upload.
Audio file. Supports
mp3, wav, m4a, ogg, webm, flac. Max 25 MB inline (~30 minutes of mono 16kHz mp3).Free-form text question about the audio. Be specific about what you want — “What is the mood?” is fine, but “Describe the music style, BPM, and any background SFX” gets more useful answers.
Either the alias
audio-understand (recommended) or a pinned SKU like gemini-3-flash-audio. See Audio models.Optional duration hint in seconds. When supplied, billing rounds up from this exact value. When omitted, Kyma estimates duration from file size (assumes 32 kbps mp3) which can over-estimate for high-bitrate inputs. Pass the real duration when you have it — that’s what
ffprobe gives you in one line.Response
200 OK with the answer text and a Kyma billing block.
The model’s answer to your question.
The Kyma model SKU that served the request.
Either
caller_hint (you passed duration_sec) or size_estimate (Kyma estimated from file bytes).Final cost charged for this request.
Pricing
| Model | Per minute |
|---|---|
gemini-3-flash-audio | $0.000648 |
Errors
| Status | error.code | When |
|---|---|---|
400 | invalid_request | Missing file or question field |
400 | not_an_audio_model | model is not an audio-understanding SKU |
401 | auth_error | Missing or invalid API key |
402 | billing_error | Insufficient credits |
404 | not_enabled | Audio gate not enabled on this account |
413 | invalid_request | File > 25 MB |
502 | provider_error | Upstream call failed |
Examples
Mood and music brief for a video clip
Speaker emotion check
Pass exact duration for accurate billing
See also
- Audio Transcriptions - speech-to-text
- Audio models - SKUs behind the
audio-understandalias watch-cli- open-source CLI that pairs this endpoint withtranscribeto give agents full audio understanding from any social video URL