Skip to main content
POST
/
v1
/
audio
/
speech
Audio Speech (TTS)
curl --request POST \
  --url https://kymaapi.com/v1/audio/speech \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "<string>",
  "input": "<string>",
  "voice_id": "<string>",
  "response_format": "<string>",
  "voice_settings": {},
  "stream": true
}
'
Synchronous endpoint. Send text, get back audio bytes in one call. Pick a model based on quality needs vs latency budget.
curl -X POST https://kymaapi.com/v1/audio/speech \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "eleven-multilingual-v2",
    "input": "The first move is what sets everything in motion.",
    "voice_id": "JBFqnCBsd6RMkjVDRZzb",
    "response_format": "mp3_44100_128"
  }' \
  --output speech.mp3

Request

application/json body.
model
string
default:"eleven-multilingual-v2"
One of:
  • eleven-v3 — most expressive, audio tags + emotional range, 70+ languages ($0.405/1K char)
  • eleven-multilingual-v2 — hero quality, 29 languages, expressive ($0.405/1K char)
  • eleven-flash-v2-5 — ~75ms time-to-first-byte, real-time agents ($0.20/1K char)
  • eleven-turbo-v2-5 — balanced quality + speed ($0.20/1K char)
input
string
required
Text to synthesize. Max 5000 characters per request — chunk longer text client-side. Also accepts the alias text for OpenAI compatibility.
voice_id
string
required
ElevenLabs voice id — opaque string from GET /v1/audio/voices. Also accepts the alias voice. There is no global default; pick one explicitly.
response_format
string
default:"mp3_44100_128"
Audio format. mp3_44100_128 (default), mp3_44100_192, mp3_22050_32, pcm_16000/22050/24000/44100, ulaw_8000.
voice_settings
object
Optional fine-tuning: stability (0–1), similarity_boost (0–1), style (0–1), use_speaker_boost (boolean).
stream
boolean
default:"false"
Opt-in low-latency mode. When true, audio is delivered progressively as it is synthesized — time-to-first-audio drops to ~0.4s (vs ~1.8s for the default full-buffer path). Transparent to your code: the response is still a single complete audio/mpeg stream you can pipe straight to disk or a player. Currently applies to the MiniMax speech models; ignored by models that don’t support progressive synthesis.

Response

200 OK with raw audio bytes. Billing rides on X-Kyma-* response headers — the body stays a clean audio file you can pipe straight to disk or play.
HeaderWhat
Content-Typematches the requested format (audio/mpeg for mp3, etc.)
X-Kyma-Modelresolved model id
X-Kyma-Chars-Billedinput char count used for pricing
X-Kyma-Cost-USDactual cost charged
X-Kyma-Balance-USDremaining balance

Errors

Statuserror.codeWhen
400not_a_tts_modelmodel is not a TTS SKU
400voice_requiredvoice / voice_id missing
400input_too_longinput text > 5000 chars
400invalid_requestinvalid JSON or missing input
401auth_errormissing or invalid API key
402billing_errorbalance too low
502provider_errorupstream TTS provider failure

See also