Realtime Audio - Kyma API

Kyma’s realtime audio endpoint streams live conversation audio between your client and Google’s Gemini Live native-audio model. Use it for voice agents, live translation, interactive tutors, and any product where the user speaks and the model speaks back in real time.

# 1. Mint a session
curl -X POST https://api.kymaapi.com/v1/live/sessions \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.5-flash-native-audio-preview-12-2025",
    "config": {
      "speech_config": {
        "language_code": "vi",
        "voice_config": { "prebuilt_voice_config": { "voice_name": "Kore" } }
      }
    }
  }'
# → { "session_token": "...", "ws_url": "wss://api.kymaapi.com/v1/live/proxy/<id>?token=...", ... }

How it works

The realtime endpoint is a two-step flow because of how browser security and Google’s Vertex AI auth model interact:

POST mints a short-lived session token (5 min TTL) and returns a ws_url.
WebSocket opens to wss://api.kymaapi.com/v1/live/proxy/{id}?token=… — Kyma terminates this connection and proxies it to Vertex Live API using a server-side service-account credential.

Why a server-side proxy: Vertex Live API does not yet expose an ephemeral-token endpoint (js-genai#766, unscheduled). Handing a browser a long-lived Vertex OAuth token would grant project-wide Vertex AI access for an hour — unsafe. The proxy is the architecture Google itself recommends for browser clients (gemini-live-api-examples). Same model, same voices, same languages as the previous AI Studio integration — but the Vertex region unlocks 5000 concurrent sessions per project (vs the 50 cap that AI Studio enforced).

Flow

The client must wait for the BidiGenerateContentSetupComplete frame before sending audio. Frames sent before setup completes will be dropped silently.

Quickstart — browser JS

Mint the session

const resp = await fetch("https://api.kymaapi.com/v1/live/sessions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${KYMA_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "gemini-2.5-flash-native-audio-preview-12-2025",
    config: {
      speech_config: {
        language_code: "vi",
        voice_config: { prebuilt_voice_config: { voice_name: "Kore" } },
      },
    },
  }),
});
const session = await resp.json();
// session.ws_url, session.session_token, session.heartbeat_url, session.end_url, ...

Open the WebSocket

const ws = new WebSocket(session.ws_url);
ws.binaryType = "arraybuffer";

ws.addEventListener("message", (event) => {
  const frame = JSON.parse(event.data);
  if (frame.setupComplete) {
    // Ready — safe to start sending audio.
    startCaptureLoop();
    return;
  }
  if (frame.serverContent?.modelTurn?.parts) {
    // PCM16 24kHz audio chunk arrived — see Step 4.
    handleAudioChunk(frame.serverContent.modelTurn.parts);
  }
  if (frame.serverContent?.interrupted) {
    // User started speaking over the model — stop any queued playback.
    stopPlayback();
  }
});

Capture mic with AudioWorklet

MediaRecorder outputs Opus/WebM container, not raw PCM. Vertex Live requires PCM16 16kHz mono. Use AudioWorklet to tap raw Float32 samples from getUserMedia, downsample 48kHz → 16kHz, and quantize to Int16. See google-gemini/gemini-live-api-examples for a complete browser implementation (Apache-2.0).

Pseudo-code outline:

const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext({ sampleRate: 48000 });
await audioContext.audioWorklet.addModule("/pcm-recorder-worklet.js");
const source = audioContext.createMediaStreamSource(stream);
const worklet = new AudioWorkletNode(audioContext, "pcm-recorder");

worklet.port.onmessage = (event) => {
  // event.data is Int16Array at 16kHz from the worklet.
  const base64 = btoa(String.fromCharCode(...new Uint8Array(event.data.buffer)));
  ws.send(JSON.stringify({
    realtimeInput: {
      mediaChunks: [{ mimeType: "audio/pcm;rate=16000", data: base64 }],
    },
  }));
};

source.connect(worklet);

Play audio response

The model returns PCM16 24kHz audio chunks (base64-wrapped JSON). Decode to a Float32Array, schedule into an AudioBufferSourceNode:

const playbackContext = new AudioContext({ sampleRate: 24000 });
let nextPlayTime = playbackContext.currentTime;

function handleAudioChunk(parts) {
  for (const part of parts) {
    if (!part.inlineData?.data) continue;
    const bytes = Uint8Array.from(atob(part.inlineData.data), c => c.charCodeAt(0));
    const int16 = new Int16Array(bytes.buffer);
    const float32 = Float32Array.from(int16, v => v / 32768);
    const buffer = playbackContext.createBuffer(1, float32.length, 24000);
    buffer.copyToChannel(float32, 0);
    const source = playbackContext.createBufferSource();
    source.buffer = buffer;
    source.connect(playbackContext.destination);
    nextPlayTime = Math.max(nextPlayTime, playbackContext.currentTime);
    source.start(nextPlayTime);
    nextPlayTime += buffer.duration;
  }
}

Heartbeat and end

const heartbeat = setInterval(() => {
  fetch(session.heartbeat_url, {
    method: "POST",
    headers: { Authorization: `Bearer ${KYMA_API_KEY}` },
  });
}, session.heartbeat_interval_ms); // 30000

// On disconnect:
async function endSession() {
  clearInterval(heartbeat);
  ws.close();
  await fetch(session.end_url, {
    method: "POST",
    headers: { Authorization: `Bearer ${KYMA_API_KEY}` },
  });
}

Audio format reference

Direction	Format	Sample rate	Channels	Encoding
Client → server	PCM16 (signed 16-bit little-endian)	16,000 Hz	mono	base64 string in JSON `mediaChunks[].data`
Server → client	PCM16 (signed 16-bit little-endian)	24,000 Hz	mono	base64 string in JSON `inlineData.data`

Note the asymmetric sample rates. The output playback context must be 24 kHz; using 16 kHz will pitch the model’s voice up by 50%.

Configuration

The mint request body accepts:

model

string

required

Currently only gemini-2.5-flash-native-audio-preview-12-2025 is supported.

config.speech_config.language_code

string

default:"en"

BCP-47 short code. One of:ar, bn, de, en, es, fa, fr, hi, id, it, ja, ko, nl, pl, pt, ru, sv, ta, te, th, tr, ur, vi, zh

config.speech_config.voice_config.prebuilt_voice_config.voice_name

string

default:"Kore"

One of 30 prebuilt voices:Achernar, Achird, Algenib, Algieba, Alnilam, Aoede, Autonoe, Callirrhoe, Charon, Despina, Enceladus, Erinome, Fenrir, Gacrux, Iapetus, Kore, Laomedeia, Leda, Orus, Pulcherrima, Puck, Rasalgethi, Sadachbia, Sadaltager, Schedar, Sulafat, Umbriel, Vindemiatrix, Zephyr, Zubenelgenubi

config.system_instruction

string

Currently dropped on the Vertex proxy path. If you pass a custom instruction it will be ignored and the default conversational persona is used. Threading this through the token store is a planned follow-up. Contact Kyma if you need it before the upstream patch lands.

Response shape

200 OK from POST /v1/live/sessions:

{
  "session_token": "f3c2b1d4e5a6...",
  "ws_url": "wss://api.kymaapi.com/v1/live/proxy/<uuid>?token=<session_token>",
  "expires_at": 1700000000,
  "kyma_session_id": "<uuid>",
  "model": "gemini-2.5-flash-native-audio-preview-12-2025",
  "heartbeat_url": "https://api.kymaapi.com/v1/live/sessions/<uuid>/heartbeat",
  "end_url": "https://api.kymaapi.com/v1/live/sessions/<uuid>/end",
  "heartbeat_interval_ms": 30000
}

The session_token is single-use — once consumed by your WebSocket handshake it is invalidated, so a leaked token cannot be replayed. The token TTL is 5 minutes; if you don’t open the WebSocket within that window the session is reaped and you can mint a fresh one.

Pricing

Item	Rate
Active session	$0.0389 / minute
Initial collateral hold	$0.20 (placed on session create)
Unused minutes	Refunded on session end

Kyma marks up Google’s wholesale rate ($0.0288/min worst case) by 1.35× — same markup convention as every other Kyma model. Sessions are settled minute-by-minute; if you end after 4m30s the 5th minute is refunded.

Limits

Limit	Value	Behavior on breach
Concurrent sessions per user	Tier 0 (free): 8. Paid tiers: uncapped (bounded by balance + per-session hold). Combined across providers.	`429 too_many_sessions` (Tier 0 only)
Max session duration	30 minutes (1800s)	Auto-close, unused minutes refunded
Session token TTL	5 minutes	WS handshake rejected; mint a new session
Heartbeat interval	30 seconds	Required to prove liveness
Heartbeat timeout	90 seconds	Session reaped, unused minutes refunded

Need higher concurrency? See Rate Limits — Need higher limits?.

Errors

POST /v1/live/sessions:

Status	`error.type`	`error.code`	When
`400`	`invalid_request`	—	Body missing / invalid, unsupported language, unsupported voice
`400`	`model_not_found`	—	Wrong model id
`401`	`unauthorized`	—	Missing API key / session token
`402`	`billing_error`	`insufficient_balance`	Balance < $0.20 hold
`429`	`rate_limit_error`	`too_many_sessions`	Free-tier (Tier 0) user at the 8-session cap; paid tiers are uncapped
`500`	`internal_error`	—	DB insert or hold RPC failed
`503`	`service_unavailable`	—	Live sessions disabled or Vertex token mint failed

WebSocket close codes:

Code	Reason	When
`1000`	normal	Client / server closed cleanly
`1011`	`vertex_auth_failed`	Vertex SA token mint failed at proxy open
`1011`	`upstream_construct_failed`	Vertex WebSocket constructor threw

Reference

POST /v1/live/sessions — mint a session token
POST /v1/live/sessions/{id}/heartbeat — keep session alive (every 30s)
POST /v1/live/sessions/{id}/end — close session and settle billing
wss://api.kymaapi.com/v1/live/proxy/{id}?token=… — WebSocket bidirectional audio

​How it works

​Flow

​Quickstart — browser JS

​Audio format reference

​Configuration

​Response shape

​Pricing

​Limits

​Errors

​Reference

​See also

How it works

Flow

Quickstart — browser JS

Audio format reference

Configuration

Response shape

Pricing

Limits

Errors

Reference

See also