Audio Transcriptions
Endpoints
Audio Transcriptions
Speech-to-text. Two input modes — multipart file (25 MB, OpenAI Whisper compatible) or JSON audio_url (100 MB, Kyma extension). Synchronous, billed per minute.
POST
Audio Transcriptions
Synchronous endpoint. Send audio in, get the transcript back in one call. Two input modes:
Mode 2 — JSON
Billed per minute, rounded up (a 5-second clip costs 1 minute). Failover transcriptions are billed at the same rate as the primary
Returns the bare transcript. If the primary model has a transient hiccup, the request transparently completes via a secondary or tertiary tier and you’ll see
- Mode 1 — File upload (
multipart/form-data, up to 25 MB). Drop-in OpenAI Whisper replacement. - Mode 2 — URL fetch (
application/json,audio_urlup to 100 MB, https only). Kyma extension — pass a public URL, Kyma fetches the bytes upstream so you don’t proxy them through your client.
Choosing a mode
| Mode 1 — multipart | Mode 2 — JSON audio_url | |
|---|---|---|
| Content-Type | multipart/form-data | application/json |
| Source field | file (binary part) | audio_url (https string) |
| Max size | 25 MB | 100 MB |
| OpenAI SDK compatible | Yes | No — use raw HTTP |
| Never-die failover | Full chain (timestamp + plain-text tiers) | Retry + timestamp-preserving secondary |
| Best for | Local files, recordings, short clips | Cloud-hosted media, podcasts, long-form |
Mode 2 is a Kyma extension. It is not part of the OpenAI Whisper API. The OpenAI Python and Node SDKs only support multipart upload — for
audio_url mode, use requests (Python), fetch (Node), or curl.Request
Both modes accept the same set of model / language / format parameters.Mode 1 — multipart upload
Audio file. Supports
mp3, wav, m4a, ogg, webm, flac. Max 25 MB. ~30 minutes of mono 16kHz mp3 fits comfortably.Choose by alias or SKU:
transcribe(default) →whisper-v3-turbo— fastest and cheapest, $0.0009/min. Best for high-volume bulk transcription.transcribe-quality→gpt-4o-mini-transcribe-2025-12-15— premium accuracy on noisy / conversational / code-switching audio (Vi/En mixing), $0.00405/min.- Or pin a specific SKU directly (e.g.
whisper-v3-turbo,gpt-4o-mini-transcribe-2025-12-15). See Audio models and Model aliases.
ISO-639-1 code (e.g.
en, vi, ja). Optional — both providers auto-detect when omitted. Supplying it improves accuracy on short clips.One of:
json, verbose_json, text, srt, vtt. JSON formats embed a billing block in the response body. text returns the bare transcript and srt / vtt return subtitle files; for those three, billing rides on X-Kyma-* response headers so the body stays a clean transcript or subtitle file.Sampling temperature 0–1. Default 0 (deterministic).
Optional priming text. Use it to nudge the model toward known proper nouns, acronyms, or domain vocabulary in your audio.
Mode 2 — JSON audio_url
Public HTTPS URL of the audio file. Kyma fetches the bytes upstream — no need to download and re-upload from your client. Max 100 MB.
http:// and other schemes are rejected to prevent SSRF / mixed-content. Supports mp3, wav, m4a, ogg, webm, flac.Same as Mode 1.
Same as Mode 1.
Same as Mode 1. Mode 2 still benefits from never-die failover (retry + the timestamp-preserving secondary); the plain-text tertiary tier is skipped in URL mode. See Automatic failover.
Same as Mode 1.
Same as Mode 1.
Response
200 OK with the transcript and a Kyma billing block.
The full transcript.
Detected language (full name, e.g.
"English").Audio duration in seconds (decoded from the file, not estimated).
0 only when the plain-text tertiary tier (gemini-3-flash-audio) served the request, which does not return duration.Per-segment timestamps and text. Present when
response_format is verbose_json and a timestamp-preserving model served the request — that is, the primary (whisper-v3-turbo) or the secondary failover (whisper-1). Absent only on the plain-text tertiary tier.The Kyma model SKU that served the request.
Minutes charged. Audio is billed in 1-minute increments, rounded up.
Final cost charged for this request.
Remaining balance after this charge.
Non-JSON formats
Whenresponse_format is text, srt, or vtt, the body is a plain transcript or subtitle file (no JSON envelope) and billing comes back on response headers (see below).
srt returns a SubRip subtitle file (application/x-subrip; charset=utf-8); vtt returns a WebVTT file (text/vtt; charset=utf-8). Both are built from the same per-segment timestamps verbose_json exposes, so the timing matches across formats.
Response headers
Returned on every 200 response:| Header | Meaning |
|---|---|
X-Kyma-Model | The model SKU that served the request (e.g. whisper-v3-turbo) |
X-Kyma-Duration-Sec | Detected audio duration in seconds (0 only when the plain-text tertiary tier served) |
X-Kyma-Billable-Minutes | Minutes charged |
X-Kyma-Cost-USD | Final cost in USD |
X-Kyma-Balance-USD | Remaining account balance |
X-Kyma-Fallback | Secondary model SKU that served (e.g. whisper-1, gemini-3-flash-audio) only when a secondary served. Absent on the primary path and on a bare primary retry. |
X-Kyma-Fallback-Layer | Numeric failover tier that served: 1 retry, 2 secondary (whisper-1), 3 tertiary (gemini-3-flash-audio). Absent when the primary served on the first try. |
Automatic failover (never-die STT)
Transcription is never-die. On a transient hiccup with the primary model (defaulttranscribe alias / whisper-v3-turbo SKU), Kyma transparently works through a failover chain so your request still completes — same request, same price, same response shape. You don’t opt in: it’s on by default.
Why it matters. A single transcription request is normally a single point of failure — one transient hiccup and the pipeline built on top of it (a dub, a caption job, a voice agent) dies, leaving you to write retry and fallback logic yourself. Kyma absorbs that for you: the request either completes or fails honestly — never silently mangled. You ship the feature, not the plumbing.
The chain is format-aware, because only some models preserve per-segment timestamps:
- Retry the primary once (short backoff). Absorbs the vast majority of transient blips, which recover immediately. Nothing in your response changes.
- Secondary —
whisper-1. A timestamp-preserving model. Because it returns real per-segment timestamps, it serves every caller, includingverbose_json,srt, andvtt. - Tertiary —
gemini-3-flash-audio. A plain-text transcription tier. Used only for plain-text transcripts (text/json) — it does not return segment timestamps, so callers that need them (verbose_json/srt/vtt) never route here. URL mode (audio_url) also skips this tier (it needs inline bytes).
400 / 413 / 415 / 422) is surfaced immediately with no failover — failover is a reliability tool, not an error-hiding one. Rate-limit responses (429) are forwarded verbatim with a Retry-After header so your client can back off cleanly.
You can see exactly which tier served via response metadata:
X-Kyma-Fallback-Layerheader —1retry,2secondary,3tertiary (absent when the primary served first try).X-Kyma-Fallbackheader /billing.fallbackfield — the secondary model SKU that served (when a secondary served).segmentsis present whenever a timestamp-preserving model served (primary orwhisper-1); absent only on the plain-text tertiary tier, wheredurationis also0(billed as the 1-minute minimum).
The
transcribe-quality alias (gpt-4o-mini-transcribe-2025-12-15) opts out of the chain — you chose that model for accuracy, so Kyma won’t silently swap it. Its errors surface as 502 transcription_failed.Pricing
| Alias | SKU | Per minute | 1-hour file |
|---|---|---|---|
transcribe (default) | whisper-v3-turbo | $0.0009 | $0.054 |
transcribe-quality | gpt-4o-mini-transcribe-2025-12-15 | $0.00405 | $0.243 |
transcribe path — it doesn’t matter which tier ultimately served, you always pay the whisper-v3-turbo price. The Quality tier does not fall back (your choice of model is respected).
Errors
| Status | error.type | error.code | When |
|---|---|---|---|
400 | invalid_request | — | Missing file (Mode 1) or invalid JSON body (Mode 2) |
400 | invalid_request | audio_url_required | Mode 2 body without audio_url field |
400 | invalid_request | audio_url_invalid_scheme | audio_url not https:// (SSRF protection) |
400 | invalid_request | not_a_transcription_model | model is not a transcription SKU |
401 | auth_error | — | Missing or invalid API key |
402 | billing_error | insufficient_credits | Balance too low |
404 | not_enabled | — | Audio gate not enabled on this account |
413 | invalid_request | — | Multipart body > 25 MB. For larger files, use Mode 2 (audio_url up to 100 MB). |
429 | rate_limit_error | concurrent_limit_exceeded | Audio concurrency cap reached. See Rate Limits. |
502 | provider_error | transcription_failed | The never-die chain was exhausted (every eligible tier failed). The error message is provider-isolated — no upstream identity is leaked. |
Examples
Pin a specific model
Just the transcript text (full never-die chain)
X-Kyma-Fallback-Layer (and X-Kyma-Fallback naming the secondary model SKU) on the response.
URL mode — Node fetch
URL mode — Python requests
The OpenAI Python SDK only supports multipart uploads — use
requests directly for Mode 2.Python (OpenAI SDK — Mode 1 only)
See also
- Audio Understand — the rest of the audio scene (tone, music, mood)
- Audio models — SKUs behind the
transcribealias - Rate Limits — concurrency caps for audio endpoints
watch-cli— open-source CLI that uses these endpoints to give any agent eyes and ears for any social video