- Transcription — what was said. OpenAI Whisper API compatible.
- Audio understanding — everything else. Tone, music, SFX, language, speaker emotion. Custom Kyma endpoint.
multipart/form-data upload, and accept mp3 / wav / m4a / ogg / webm / flac up to 25 MB (~30 min of mono 16kHz mp3).
Pick a model
| Model | Endpoint | Best for | Cost / min | Min billable |
|---|---|---|---|---|
whisper-v3-turbo | /v1/audio/transcriptions | Transcripts, captions, voice agents | $0.0009 | 1 min |
gemini-3-flash-audio | /v1/audio/understand | Tone, music, SFX, mood, language | $0.000648 | 1 min |
Aliases
Use these in themodel field instead of pinning a specific SKU. Aliases auto-track the current best model — when a faster Whisper or a stronger Gemini lands, you don’t change your code.
| Alias | Resolves to |
|---|---|
transcribe | whisper-v3-turbo |
audio-understand | gemini-3-flash-audio |
whisper-v3-turbo
Speech-to-text. 228x realtime inference. Returns transcripts with per-segment timestamps and detected language. OpenAI Whisper API compatible.- Cost: $0.0009 / min
- Best for: meeting notes, voice agent input, podcast captions
- Output: text + segments with timestamps
gemini-3-flash-audio
Audio understanding. Hears tone, music, SFX, language, speaker emotion — the things a transcript drops on the floor. Ask a free-form question, get a natural-language answer.- Cost: $0.000648 / min
- Best for: mood/tone analysis, music recognition, scene understanding
- Output: free-form answer to your question
Use them together — the decomposition trick
A video is just frames + audio. You almost never need a multimodal LLM on the full video — each piece has a fast, near-free tool:watch-cli is an open-source orchestrator built on exactly this pattern — install it, point at any social video URL, and your agent gets back frames + transcript + audio scene.
| Approach | Cost / 1-hour video | Time |
|---|---|---|
| Multimodal LLM on full video | ~$5 | 30-60s |
| Decompose with Kyma audio | < $0.10 | ~10-15s |
See also
POST /v1/audio/transcriptions- endpoint reference for transcribePOST /v1/audio/understand- endpoint reference for audio Q&A- Model aliases - why aliases stay stable when SKUs change