Audio Transcriptions

curl --request POST \
  --url https://kymaapi.com/v1/audio/transcriptions \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "<string>",
  "language": "<string>",
  "response_format": "<string>",
  "temperature": 123,
  "prompt": "<string>",
  "audio_url": "<string>"
}
'

{
  "text": "<string>",
  "language": "<string>",
  "duration": 123,
  "segments": [
    {}
  ],
  "model": "<string>",
  "billing.billable_minutes": 123,
  "billing.cost_usd": 123,
  "billing.balance_usd": 123
}

POST

audio

transcriptions

Audio Transcriptions

curl --request POST \
  --url https://kymaapi.com/v1/audio/transcriptions \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "<string>",
  "language": "<string>",
  "response_format": "<string>",
  "temperature": 123,
  "prompt": "<string>",
  "audio_url": "<string>"
}
'

{
  "text": "<string>",
  "language": "<string>",
  "duration": 123,
  "segments": [
    {}
  ],
  "model": "<string>",
  "billing.billable_minutes": 123,
  "billing.cost_usd": 123,
  "billing.balance_usd": 123
}

Synchronous endpoint. Send audio in, get the transcript back in one call. Two input modes:

Mode 1 — File upload (multipart/form-data, up to 25 MB). Drop-in OpenAI Whisper replacement.
Mode 2 — URL fetch (application/json, audio_url up to 100 MB, https only). Kyma extension — pass a public URL, Kyma fetches the bytes upstream so you don’t proxy them through your client.

The same request also benefits from automatic never-die failover when the primary model has a transient hiccup — see Automatic failover (never-die STT).

# Mode 1 — multipart upload (OpenAI-compatible)
curl -X POST https://kymaapi.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@meeting.mp3" \
  -F "model=transcribe"

# Mode 2 — JSON audio_url (Kyma extension, up to 100 MB)
curl -X POST https://kymaapi.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"audio_url": "https://cdn.example.com/podcast.mp3", "model": "transcribe"}'

Choosing a mode

	Mode 1 — multipart	Mode 2 — JSON `audio_url`
Content-Type	`multipart/form-data`	`application/json`
Source field	`file` (binary part)	`audio_url` (https string)
Max size	25 MB	100 MB
OpenAI SDK compatible	Yes	No — use raw HTTP
Never-die failover	Full chain (timestamp + plain-text tiers)	Retry + timestamp-preserving secondary
Best for	Local files, recordings, short clips	Cloud-hosted media, podcasts, long-form

Mode 2 is a Kyma extension. It is not part of the OpenAI Whisper API. The OpenAI Python and Node SDKs only support multipart upload — for audio_url mode, use requests (Python), fetch (Node), or curl.

Request

Both modes accept the same set of model / language / format parameters.

Mode 1 — multipart upload

file

required

Audio file. Supports mp3, wav, m4a, ogg, webm, flac. Max 25 MB. ~30 minutes of mono 16kHz mp3 fits comfortably.

model

string

default:"transcribe"

Choose by alias or SKU:

transcribe (default) → whisper-v3-turbo — fastest and cheapest, $0.0009/min. Best for high-volume bulk transcription.
transcribe-quality → gpt-4o-mini-transcribe-2025-12-15 — premium accuracy on noisy / conversational / code-switching audio (Vi/En mixing), $0.00405/min.
Or pin a specific SKU directly (e.g. whisper-v3-turbo, gpt-4o-mini-transcribe-2025-12-15). See Audio models and Model aliases.

language

string

ISO-639-1 code (e.g. en, vi, ja). Optional — both providers auto-detect when omitted. Supplying it improves accuracy on short clips.

response_format

string

default:"verbose_json"

One of: json, verbose_json, text, srt, vtt. JSON formats embed a billing block in the response body. text returns the bare transcript and srt / vtt return subtitle files; for those three, billing rides on X-Kyma-* response headers so the body stays a clean transcript or subtitle file.

temperature

number

default:"0"

Sampling temperature 0–1. Default 0 (deterministic).

prompt

string

Optional priming text. Use it to nudge the model toward known proper nouns, acronyms, or domain vocabulary in your audio.

Mode 2 — JSON `audio_url`

audio_url

string

required

Public HTTPS URL of the audio file. Kyma fetches the bytes upstream — no need to download and re-upload from your client. Max 100 MB. http:// and other schemes are rejected to prevent SSRF / mixed-content. Supports mp3, wav, m4a, ogg, webm, flac.

model

string

default:"transcribe"

Same as Mode 1.

language

string

Same as Mode 1.

response_format

string

default:"verbose_json"

Same as Mode 1. Mode 2 still benefits from never-die failover (retry + the timestamp-preserving secondary); the plain-text tertiary tier is skipped in URL mode. See Automatic failover.

temperature

number

Same as Mode 1.

prompt

string

Same as Mode 1.

Response

200 OK with the transcript and a Kyma billing block.

{
  "task": "transcribe",
  "language": "English",
  "duration": 5.03,
  "text": "For too long, I have watched mortals suffer.",
  "segments": [
    {
      "id": 0,
      "start": 0,
      "end": 4.74,
      "text": "For too long, I have watched mortals suffer.",
      "tokens": [50365, 1171, 886, 938, 11, 286, 362, 6337, 6599, 1124, 9753, 13, 50602],
      "temperature": 0,
      "avg_logprob": -0.20,
      "compression_ratio": 0.85,
      "no_speech_prob": 0.0
    }
  ],
  "model": "whisper-v3-turbo",
  "billing": {
    "duration_sec": 5.03,
    "billable_minutes": 1,
    "cost_usd": 0.0009,
    "balance_usd": 41.469
  }
}

text

string

The full transcript.

language

string

Detected language (full name, e.g. "English").

duration

number

Audio duration in seconds (decoded from the file, not estimated). 0 only when the plain-text tertiary tier (gemini-3-flash-audio) served the request, which does not return duration.

segments

array

Per-segment timestamps and text. Present when response_format is verbose_json and a timestamp-preserving model served the request — that is, the primary (whisper-v3-turbo) or the secondary failover (whisper-1). Absent only on the plain-text tertiary tier.

model

string

The Kyma model SKU that served the request.

billing.billable_minutes

number

Minutes charged. Audio is billed in 1-minute increments, rounded up.

billing.cost_usd

number

Final cost charged for this request.

billing.balance_usd

number

Remaining balance after this charge.

Non-JSON formats

When response_format is text, srt, or vtt, the body is a plain transcript or subtitle file (no JSON envelope) and billing comes back on response headers (see below). srt returns a SubRip subtitle file (application/x-subrip; charset=utf-8); vtt returns a WebVTT file (text/vtt; charset=utf-8). Both are built from the same per-segment timestamps verbose_json exposes, so the timing matches across formats.

Response headers

Returned on every 200 response:

Header	Meaning
`X-Kyma-Model`	The model SKU that served the request (e.g. `whisper-v3-turbo`)
`X-Kyma-Duration-Sec`	Detected audio duration in seconds (`0` only when the plain-text tertiary tier served)
`X-Kyma-Billable-Minutes`	Minutes charged
`X-Kyma-Cost-USD`	Final cost in USD
`X-Kyma-Balance-USD`	Remaining account balance
`X-Kyma-Fallback`	Secondary model SKU that served (e.g. `whisper-1`, `gemini-3-flash-audio`) only when a secondary served. Absent on the primary path and on a bare primary retry.
`X-Kyma-Fallback-Layer`	Numeric failover tier that served: `1` retry, `2` secondary (`whisper-1`), `3` tertiary (`gemini-3-flash-audio`). Absent when the primary served on the first try.

Automatic failover (never-die STT)

Transcription is never-die. On a transient hiccup with the primary model (default transcribe alias / whisper-v3-turbo SKU), Kyma transparently works through a failover chain so your request still completes — same request, same price, same response shape. You don’t opt in: it’s on by default. Why it matters. A single transcription request is normally a single point of failure — one transient hiccup and the pipeline built on top of it (a dub, a caption job, a voice agent) dies, leaving you to write retry and fallback logic yourself. Kyma absorbs that for you: the request either completes or fails honestly — never silently mangled. You ship the feature, not the plumbing. The chain is format-aware, because only some models preserve per-segment timestamps:

Retry the primary once (short backoff). Absorbs the vast majority of transient blips, which recover immediately. Nothing in your response changes.
Secondary — whisper-1. A timestamp-preserving model. Because it returns real per-segment timestamps, it serves every caller, including verbose_json, srt, and vtt.
Tertiary — gemini-3-flash-audio. A plain-text transcription tier. Used only for plain-text transcripts (text / json) — it does not return segment timestamps, so callers that need them (verbose_json / srt / vtt) never route here. URL mode (audio_url) also skips this tier (it needs inline bytes).

If every tier is exhausted, you get a clean error (the chain genuinely had nothing left), not a leaked upstream message. Only genuine client errors fail fast. A bad file, a too-large upload, or an unsupported format (400 / 413 / 415 / 422) is surfaced immediately with no failover — failover is a reliability tool, not an error-hiding one. Rate-limit responses (429) are forwarded verbatim with a Retry-After header so your client can back off cleanly. You can see exactly which tier served via response metadata:

X-Kyma-Fallback-Layer header — 1 retry, 2 secondary, 3 tertiary (absent when the primary served first try).
X-Kyma-Fallback header / billing.fallback field — the secondary model SKU that served (when a secondary served).
segments is present whenever a timestamp-preserving model served (primary or whisper-1); absent only on the plain-text tertiary tier, where duration is also 0 (billed as the 1-minute minimum).

The transcribe-quality alias (gpt-4o-mini-transcribe-2025-12-15) opts out of the chain — you chose that model for accuracy, so Kyma won’t silently swap it. Its errors surface as 502 transcription_failed.

Pricing

Alias	SKU	Per minute	1-hour file
`transcribe` (default)	`whisper-v3-turbo`	$0.0009	$0.054
`transcribe-quality`	`gpt-4o-mini-transcribe-2025-12-15`	$0.00405	$0.243

Billed per minute, rounded up (a 5-second clip costs 1 minute). Failover transcriptions are billed at the same rate as the primary transcribe path — it doesn’t matter which tier ultimately served, you always pay the whisper-v3-turbo price. The Quality tier does not fall back (your choice of model is respected).

Errors

Status	`error.type`	`error.code`	When
`400`	`invalid_request`	—	Missing `file` (Mode 1) or invalid JSON body (Mode 2)
`400`	`invalid_request`	`audio_url_required`	Mode 2 body without `audio_url` field
`400`	`invalid_request`	`audio_url_invalid_scheme`	`audio_url` not `https://` (SSRF protection)
`400`	`invalid_request`	`not_a_transcription_model`	`model` is not a transcription SKU
`401`	`auth_error`	—	Missing or invalid API key
`402`	`billing_error`	`insufficient_credits`	Balance too low
`404`	`not_enabled`	—	Audio gate not enabled on this account
`413`	`invalid_request`	—	Multipart body > 25 MB. For larger files, use Mode 2 (`audio_url` up to 100 MB).
`429`	`rate_limit_error`	`concurrent_limit_exceeded`	Audio concurrency cap reached. See Rate Limits.
`502`	`provider_error`	`transcription_failed`	The never-die chain was exhausted (every eligible tier failed). The error message is provider-isolated — no upstream identity is leaked.

Examples

Pin a specific model

curl -X POST https://kymaapi.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@interview.mp3" \
  -F "model=whisper-v3-turbo" \
  -F "response_format=verbose_json" \
  -F "language=en"

Just the transcript text (full never-die chain)

curl -X POST https://kymaapi.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@clip.mp3" \
  -F "model=transcribe" \
  -F "response_format=text"

Returns the bare transcript. If the primary model has a transient hiccup, the request transparently completes via a secondary or tertiary tier and you’ll see X-Kyma-Fallback-Layer (and X-Kyma-Fallback naming the secondary model SKU) on the response.

URL mode — Node fetch

const response = await fetch("https://kymaapi.com/v1/audio/transcriptions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.KYMA_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    audio_url: "https://cdn.example.com/podcast.mp3",
    model: "transcribe",
    response_format: "verbose_json",
  }),
});
const result = await response.json();
console.log(result.text);

URL mode — Python requests

import os
import requests

resp = requests.post(
    "https://kymaapi.com/v1/audio/transcriptions",
    headers={"Authorization": f"Bearer {os.environ['KYMA_API_KEY']}"},
    json={
        "audio_url": "https://cdn.example.com/podcast.mp3",
        "model": "transcribe",
        "response_format": "verbose_json",
    },
)
result = resp.json()
print(result["text"])

The OpenAI Python SDK only supports multipart uploads — use requests directly for Mode 2.

Python (OpenAI SDK — Mode 1 only)

from openai import OpenAI

client = OpenAI(
    base_url="https://kymaapi.com/v1",
    api_key="kyma-...",
)

with open("meeting.mp3", "rb") as f:
    result = client.audio.transcriptions.create(
        model="transcribe",
        file=f,
    )

print(result.text)

​Choosing a mode

​Request

​Mode 1 — multipart upload

​Mode 2 — JSON audio_url

​Response

​Non-JSON formats

​Response headers

​Automatic failover (never-die STT)

​Pricing

​Errors

​Examples

​Pin a specific model

​Just the transcript text (full never-die chain)

​URL mode — Node fetch

​URL mode — Python requests

​Python (OpenAI SDK — Mode 1 only)

​See also

Choosing a mode

Request

Mode 1 — multipart upload

Mode 2 — JSON `audio_url`

Response

Non-JSON formats

Response headers

Automatic failover (never-die STT)

Pricing

Errors

Examples

Pin a specific model

Just the transcript text (full never-die chain)

URL mode — Node fetch

URL mode — Python requests

Python (OpenAI SDK — Mode 1 only)

See also