Skip to main content
POST
/
v1
/
audio
/
transcriptions
Audio Transcriptions
curl --request POST \
  --url https://kymaapi.com/v1/audio/transcriptions \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "<string>",
  "language": "<string>",
  "response_format": "<string>",
  "temperature": 123,
  "prompt": "<string>",
  "audio_url": "<string>"
}
'
{
  "text": "<string>",
  "language": "<string>",
  "duration": 123,
  "segments": [
    {}
  ],
  "model": "<string>",
  "billing.billable_minutes": 123,
  "billing.cost_usd": 123,
  "billing.balance_usd": 123
}
Synchronous endpoint. Send audio in, get the transcript back in one call. Two input modes:
  • Mode 1 — File upload (multipart/form-data, up to 25 MB). Drop-in OpenAI Whisper replacement.
  • Mode 2 — URL fetch (application/json, audio_url up to 100 MB, https only). Kyma extension — pass a public URL, Kyma fetches the bytes upstream so you don’t proxy them through your client.
The same request also benefits from automatic never-die failover when the primary model has a transient hiccup — see Automatic failover (never-die STT).
# Mode 1 — multipart upload (OpenAI-compatible)
curl -X POST https://kymaapi.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@meeting.mp3" \
  -F "model=transcribe"
# Mode 2 — JSON audio_url (Kyma extension, up to 100 MB)
curl -X POST https://kymaapi.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"audio_url": "https://cdn.example.com/podcast.mp3", "model": "transcribe"}'

Choosing a mode

Mode 1 — multipartMode 2 — JSON audio_url
Content-Typemultipart/form-dataapplication/json
Source fieldfile (binary part)audio_url (https string)
Max size25 MB100 MB
OpenAI SDK compatibleYesNo — use raw HTTP
Never-die failoverFull chain (timestamp + plain-text tiers)Retry + timestamp-preserving secondary
Best forLocal files, recordings, short clipsCloud-hosted media, podcasts, long-form
Mode 2 is a Kyma extension. It is not part of the OpenAI Whisper API. The OpenAI Python and Node SDKs only support multipart upload — for audio_url mode, use requests (Python), fetch (Node), or curl.

Request

Both modes accept the same set of model / language / format parameters.

Mode 1 — multipart upload

file
file
required
Audio file. Supports mp3, wav, m4a, ogg, webm, flac. Max 25 MB. ~30 minutes of mono 16kHz mp3 fits comfortably.
model
string
default:"transcribe"
Choose by alias or SKU:
  • transcribe (default) → whisper-v3-turbo — fastest and cheapest, $0.0009/min. Best for high-volume bulk transcription.
  • transcribe-qualitygpt-4o-mini-transcribe-2025-12-15 — premium accuracy on noisy / conversational / code-switching audio (Vi/En mixing), $0.00405/min.
  • Or pin a specific SKU directly (e.g. whisper-v3-turbo, gpt-4o-mini-transcribe-2025-12-15). See Audio models and Model aliases.
language
string
ISO-639-1 code (e.g. en, vi, ja). Optional — both providers auto-detect when omitted. Supplying it improves accuracy on short clips.
response_format
string
default:"verbose_json"
One of: json, verbose_json, text, srt, vtt. JSON formats embed a billing block in the response body. text returns the bare transcript and srt / vtt return subtitle files; for those three, billing rides on X-Kyma-* response headers so the body stays a clean transcript or subtitle file.
temperature
number
default:"0"
Sampling temperature 0–1. Default 0 (deterministic).
prompt
string
Optional priming text. Use it to nudge the model toward known proper nouns, acronyms, or domain vocabulary in your audio.

Mode 2 — JSON audio_url

audio_url
string
required
Public HTTPS URL of the audio file. Kyma fetches the bytes upstream — no need to download and re-upload from your client. Max 100 MB. http:// and other schemes are rejected to prevent SSRF / mixed-content. Supports mp3, wav, m4a, ogg, webm, flac.
model
string
default:"transcribe"
Same as Mode 1.
language
string
Same as Mode 1.
response_format
string
default:"verbose_json"
Same as Mode 1. Mode 2 still benefits from never-die failover (retry + the timestamp-preserving secondary); the plain-text tertiary tier is skipped in URL mode. See Automatic failover.
temperature
number
Same as Mode 1.
prompt
string
Same as Mode 1.

Response

200 OK with the transcript and a Kyma billing block.
{
  "task": "transcribe",
  "language": "English",
  "duration": 5.03,
  "text": "For too long, I have watched mortals suffer.",
  "segments": [
    {
      "id": 0,
      "start": 0,
      "end": 4.74,
      "text": "For too long, I have watched mortals suffer.",
      "tokens": [50365, 1171, 886, 938, 11, 286, 362, 6337, 6599, 1124, 9753, 13, 50602],
      "temperature": 0,
      "avg_logprob": -0.20,
      "compression_ratio": 0.85,
      "no_speech_prob": 0.0
    }
  ],
  "model": "whisper-v3-turbo",
  "billing": {
    "duration_sec": 5.03,
    "billable_minutes": 1,
    "cost_usd": 0.0009,
    "balance_usd": 41.469
  }
}
text
string
The full transcript.
language
string
Detected language (full name, e.g. "English").
duration
number
Audio duration in seconds (decoded from the file, not estimated). 0 only when the plain-text tertiary tier (gemini-3-flash-audio) served the request, which does not return duration.
segments
array
Per-segment timestamps and text. Present when response_format is verbose_json and a timestamp-preserving model served the request — that is, the primary (whisper-v3-turbo) or the secondary failover (whisper-1). Absent only on the plain-text tertiary tier.
model
string
The Kyma model SKU that served the request.
billing.billable_minutes
number
Minutes charged. Audio is billed in 1-minute increments, rounded up.
billing.cost_usd
number
Final cost charged for this request.
billing.balance_usd
number
Remaining balance after this charge.

Non-JSON formats

When response_format is text, srt, or vtt, the body is a plain transcript or subtitle file (no JSON envelope) and billing comes back on response headers (see below). srt returns a SubRip subtitle file (application/x-subrip; charset=utf-8); vtt returns a WebVTT file (text/vtt; charset=utf-8). Both are built from the same per-segment timestamps verbose_json exposes, so the timing matches across formats.

Response headers

Returned on every 200 response:
HeaderMeaning
X-Kyma-ModelThe model SKU that served the request (e.g. whisper-v3-turbo)
X-Kyma-Duration-SecDetected audio duration in seconds (0 only when the plain-text tertiary tier served)
X-Kyma-Billable-MinutesMinutes charged
X-Kyma-Cost-USDFinal cost in USD
X-Kyma-Balance-USDRemaining account balance
X-Kyma-FallbackSecondary model SKU that served (e.g. whisper-1, gemini-3-flash-audio) only when a secondary served. Absent on the primary path and on a bare primary retry.
X-Kyma-Fallback-LayerNumeric failover tier that served: 1 retry, 2 secondary (whisper-1), 3 tertiary (gemini-3-flash-audio). Absent when the primary served on the first try.

Automatic failover (never-die STT)

Transcription is never-die. On a transient hiccup with the primary model (default transcribe alias / whisper-v3-turbo SKU), Kyma transparently works through a failover chain so your request still completes — same request, same price, same response shape. You don’t opt in: it’s on by default. Why it matters. A single transcription request is normally a single point of failure — one transient hiccup and the pipeline built on top of it (a dub, a caption job, a voice agent) dies, leaving you to write retry and fallback logic yourself. Kyma absorbs that for you: the request either completes or fails honestly — never silently mangled. You ship the feature, not the plumbing. The chain is format-aware, because only some models preserve per-segment timestamps:
  1. Retry the primary once (short backoff). Absorbs the vast majority of transient blips, which recover immediately. Nothing in your response changes.
  2. Secondary — whisper-1. A timestamp-preserving model. Because it returns real per-segment timestamps, it serves every caller, including verbose_json, srt, and vtt.
  3. Tertiary — gemini-3-flash-audio. A plain-text transcription tier. Used only for plain-text transcripts (text / json) — it does not return segment timestamps, so callers that need them (verbose_json / srt / vtt) never route here. URL mode (audio_url) also skips this tier (it needs inline bytes).
If every tier is exhausted, you get a clean error (the chain genuinely had nothing left), not a leaked upstream message. Only genuine client errors fail fast. A bad file, a too-large upload, or an unsupported format (400 / 413 / 415 / 422) is surfaced immediately with no failover — failover is a reliability tool, not an error-hiding one. Rate-limit responses (429) are forwarded verbatim with a Retry-After header so your client can back off cleanly. You can see exactly which tier served via response metadata:
  • X-Kyma-Fallback-Layer header — 1 retry, 2 secondary, 3 tertiary (absent when the primary served first try).
  • X-Kyma-Fallback header / billing.fallback field — the secondary model SKU that served (when a secondary served).
  • segments is present whenever a timestamp-preserving model served (primary or whisper-1); absent only on the plain-text tertiary tier, where duration is also 0 (billed as the 1-minute minimum).
The transcribe-quality alias (gpt-4o-mini-transcribe-2025-12-15) opts out of the chain — you chose that model for accuracy, so Kyma won’t silently swap it. Its errors surface as 502 transcription_failed.

Pricing

AliasSKUPer minute1-hour file
transcribe (default)whisper-v3-turbo$0.0009$0.054
transcribe-qualitygpt-4o-mini-transcribe-2025-12-15$0.00405$0.243
Billed per minute, rounded up (a 5-second clip costs 1 minute). Failover transcriptions are billed at the same rate as the primary transcribe path — it doesn’t matter which tier ultimately served, you always pay the whisper-v3-turbo price. The Quality tier does not fall back (your choice of model is respected).

Errors

Statuserror.typeerror.codeWhen
400invalid_requestMissing file (Mode 1) or invalid JSON body (Mode 2)
400invalid_requestaudio_url_requiredMode 2 body without audio_url field
400invalid_requestaudio_url_invalid_schemeaudio_url not https:// (SSRF protection)
400invalid_requestnot_a_transcription_modelmodel is not a transcription SKU
401auth_errorMissing or invalid API key
402billing_errorinsufficient_creditsBalance too low
404not_enabledAudio gate not enabled on this account
413invalid_requestMultipart body > 25 MB. For larger files, use Mode 2 (audio_url up to 100 MB).
429rate_limit_errorconcurrent_limit_exceededAudio concurrency cap reached. See Rate Limits.
502provider_errortranscription_failedThe never-die chain was exhausted (every eligible tier failed). The error message is provider-isolated — no upstream identity is leaked.

Examples

Pin a specific model

curl -X POST https://kymaapi.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@interview.mp3" \
  -F "model=whisper-v3-turbo" \
  -F "response_format=verbose_json" \
  -F "language=en"

Just the transcript text (full never-die chain)

curl -X POST https://kymaapi.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@clip.mp3" \
  -F "model=transcribe" \
  -F "response_format=text"
Returns the bare transcript. If the primary model has a transient hiccup, the request transparently completes via a secondary or tertiary tier and you’ll see X-Kyma-Fallback-Layer (and X-Kyma-Fallback naming the secondary model SKU) on the response.

URL mode — Node fetch

const response = await fetch("https://kymaapi.com/v1/audio/transcriptions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.KYMA_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    audio_url: "https://cdn.example.com/podcast.mp3",
    model: "transcribe",
    response_format: "verbose_json",
  }),
});
const result = await response.json();
console.log(result.text);

URL mode — Python requests

import os
import requests

resp = requests.post(
    "https://kymaapi.com/v1/audio/transcriptions",
    headers={"Authorization": f"Bearer {os.environ['KYMA_API_KEY']}"},
    json={
        "audio_url": "https://cdn.example.com/podcast.mp3",
        "model": "transcribe",
        "response_format": "verbose_json",
    },
)
result = resp.json()
print(result["text"])
The OpenAI Python SDK only supports multipart uploads — use requests directly for Mode 2.

Python (OpenAI SDK — Mode 1 only)

from openai import OpenAI

client = OpenAI(
    base_url="https://kymaapi.com/v1",
    api_key="kyma-...",
)

with open("meeting.mp3", "rb") as f:
    result = client.audio.transcriptions.create(
        model="transcribe",
        file=f,
    )

print(result.text)

See also

  • Audio Understand — the rest of the audio scene (tone, music, mood)
  • Audio models — SKUs behind the transcribe alias
  • Rate Limits — concurrency caps for audio endpoints
  • watch-cli — open-source CLI that uses these endpoints to give any agent eyes and ears for any social video