Kyma’s realtime audio endpoint streams live conversation audio between your client and Google’s Gemini Live native-audio model. Use it for voice agents, live translation, interactive tutors, and any product where the user speaks and the model speaks back in real time.
# 1. Mint a session
curl -X POST https://api.kymaapi.com/v1/live/sessions \
-H "Authorization: Bearer $KYMA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gemini-2.5-flash-native-audio-preview-12-2025",
"config": {
"speech_config": {
"language_code": "vi",
"voice_config": { "prebuilt_voice_config": { "voice_name": "Kore" } }
}
}
}'
# → { "session_token": "...", "ws_url": "wss://api.kymaapi.com/v1/live/proxy/<id>?token=...", ... }
How it works
The realtime endpoint is a two-step flow because of how browser security and Google’s Vertex AI auth model interact:
- POST mints a short-lived session token (5 min TTL) and returns a
ws_url.
- WebSocket opens to
wss://api.kymaapi.com/v1/live/proxy/{id}?token=… — Kyma terminates this connection and proxies it to Vertex Live API using a server-side service-account credential.
Why a server-side proxy: Vertex Live API does not yet expose an ephemeral-token endpoint (js-genai#766, unscheduled). Handing a browser a long-lived Vertex OAuth token would grant project-wide Vertex AI access for an hour — unsafe. The proxy is the architecture Google itself recommends for browser clients (gemini-live-api-examples).
Same model, same voices, same languages as the previous AI Studio integration — but the Vertex region unlocks 5000 concurrent sessions per project (vs the 50 cap that AI Studio enforced).
Flow
The client must wait for the BidiGenerateContentSetupComplete frame before sending audio. Frames sent before setup completes will be dropped silently.
Quickstart — browser JS
Mint the session
const resp = await fetch("https://api.kymaapi.com/v1/live/sessions", {
method: "POST",
headers: {
"Authorization": `Bearer ${KYMA_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "gemini-2.5-flash-native-audio-preview-12-2025",
config: {
speech_config: {
language_code: "vi",
voice_config: { prebuilt_voice_config: { voice_name: "Kore" } },
},
},
}),
});
const session = await resp.json();
// session.ws_url, session.session_token, session.heartbeat_url, session.end_url, ...
Open the WebSocket
const ws = new WebSocket(session.ws_url);
ws.binaryType = "arraybuffer";
ws.addEventListener("message", (event) => {
const frame = JSON.parse(event.data);
if (frame.setupComplete) {
// Ready — safe to start sending audio.
startCaptureLoop();
return;
}
if (frame.serverContent?.modelTurn?.parts) {
// PCM16 24kHz audio chunk arrived — see Step 4.
handleAudioChunk(frame.serverContent.modelTurn.parts);
}
if (frame.serverContent?.interrupted) {
// User started speaking over the model — stop any queued playback.
stopPlayback();
}
});
Capture mic with AudioWorklet
MediaRecorder outputs Opus/WebM container, not raw PCM. Vertex Live requires PCM16 16kHz mono. Use AudioWorklet to tap raw Float32 samples from getUserMedia, downsample 48kHz → 16kHz, and quantize to Int16. See google-gemini/gemini-live-api-examples for a complete browser implementation (Apache-2.0). Pseudo-code outline:const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext({ sampleRate: 48000 });
await audioContext.audioWorklet.addModule("/pcm-recorder-worklet.js");
const source = audioContext.createMediaStreamSource(stream);
const worklet = new AudioWorkletNode(audioContext, "pcm-recorder");
worklet.port.onmessage = (event) => {
// event.data is Int16Array at 16kHz from the worklet.
const base64 = btoa(String.fromCharCode(...new Uint8Array(event.data.buffer)));
ws.send(JSON.stringify({
realtimeInput: {
mediaChunks: [{ mimeType: "audio/pcm;rate=16000", data: base64 }],
},
}));
};
source.connect(worklet);
Play audio response
The model returns PCM16 24kHz audio chunks (base64-wrapped JSON). Decode to a Float32Array, schedule into an AudioBufferSourceNode:const playbackContext = new AudioContext({ sampleRate: 24000 });
let nextPlayTime = playbackContext.currentTime;
function handleAudioChunk(parts) {
for (const part of parts) {
if (!part.inlineData?.data) continue;
const bytes = Uint8Array.from(atob(part.inlineData.data), c => c.charCodeAt(0));
const int16 = new Int16Array(bytes.buffer);
const float32 = Float32Array.from(int16, v => v / 32768);
const buffer = playbackContext.createBuffer(1, float32.length, 24000);
buffer.copyToChannel(float32, 0);
const source = playbackContext.createBufferSource();
source.buffer = buffer;
source.connect(playbackContext.destination);
nextPlayTime = Math.max(nextPlayTime, playbackContext.currentTime);
source.start(nextPlayTime);
nextPlayTime += buffer.duration;
}
}
Heartbeat and end
const heartbeat = setInterval(() => {
fetch(session.heartbeat_url, {
method: "POST",
headers: { Authorization: `Bearer ${KYMA_API_KEY}` },
});
}, session.heartbeat_interval_ms); // 30000
// On disconnect:
async function endSession() {
clearInterval(heartbeat);
ws.close();
await fetch(session.end_url, {
method: "POST",
headers: { Authorization: `Bearer ${KYMA_API_KEY}` },
});
}
| Direction | Format | Sample rate | Channels | Encoding |
|---|
| Client → server | PCM16 (signed 16-bit little-endian) | 16,000 Hz | mono | base64 string in JSON mediaChunks[].data |
| Server → client | PCM16 (signed 16-bit little-endian) | 24,000 Hz | mono | base64 string in JSON inlineData.data |
Note the asymmetric sample rates. The output playback context must be 24 kHz; using 16 kHz will pitch the model’s voice up by 50%.
Configuration
The mint request body accepts:
Currently only gemini-2.5-flash-native-audio-preview-12-2025 is supported.
config.speech_config.language_code
BCP-47 short code. One of:ar, bn, de, en, es, fa, fr, hi, id, it, ja, ko, nl, pl, pt, ru, sv, ta, te, th, tr, ur, vi, zh
config.speech_config.voice_config.prebuilt_voice_config.voice_name
One of 30 prebuilt voices:Achernar, Achird, Algenib, Algieba, Alnilam, Aoede, Autonoe, Callirrhoe, Charon, Despina, Enceladus, Erinome, Fenrir, Gacrux, Iapetus, Kore, Laomedeia, Leda, Orus, Pulcherrima, Puck, Rasalgethi, Sadachbia, Sadaltager, Schedar, Sulafat, Umbriel, Vindemiatrix, Zephyr, Zubenelgenubi
config.system_instruction
Currently dropped on the Vertex proxy path. If you pass a custom instruction it will be ignored and the default conversational persona is used. Threading this through the token store is a planned follow-up. Contact Kyma if you need it before the upstream patch lands.
Response shape
200 OK from POST /v1/live/sessions:
{
"session_token": "f3c2b1d4e5a6...",
"ws_url": "wss://api.kymaapi.com/v1/live/proxy/<uuid>?token=<session_token>",
"expires_at": 1700000000,
"kyma_session_id": "<uuid>",
"model": "gemini-2.5-flash-native-audio-preview-12-2025",
"heartbeat_url": "https://api.kymaapi.com/v1/live/sessions/<uuid>/heartbeat",
"end_url": "https://api.kymaapi.com/v1/live/sessions/<uuid>/end",
"heartbeat_interval_ms": 30000
}
The session_token is single-use — once consumed by your WebSocket handshake it is invalidated, so a leaked token cannot be replayed. The token TTL is 5 minutes; if you don’t open the WebSocket within that window the session is reaped and you can mint a fresh one.
Pricing
| Item | Rate |
|---|
| Active session | $0.0389 / minute |
| Initial collateral hold | $0.20 (placed on session create) |
| Unused minutes | Refunded on session end |
Kyma marks up Google’s wholesale rate ($0.0288/min worst case) by 1.35× — same markup convention as every other Kyma model. Sessions are settled minute-by-minute; if you end after 4m30s the 5th minute is refunded.
Limits
| Limit | Value | Behavior on breach |
|---|
| Concurrent sessions per user | Tier 0 (free): 8. Paid tiers: uncapped (bounded by balance + per-session hold). Combined across providers. | 429 too_many_sessions (Tier 0 only) |
| Max session duration | 30 minutes (1800s) | Auto-close, unused minutes refunded |
| Session token TTL | 5 minutes | WS handshake rejected; mint a new session |
| Heartbeat interval | 30 seconds | Required to prove liveness |
| Heartbeat timeout | 90 seconds | Session reaped, unused minutes refunded |
Need higher concurrency? See Rate Limits — Need higher limits?.
Errors
POST /v1/live/sessions:
| Status | error.type | error.code | When |
|---|
400 | invalid_request | — | Body missing / invalid, unsupported language, unsupported voice |
400 | model_not_found | — | Wrong model id |
401 | unauthorized | — | Missing API key / session token |
402 | billing_error | insufficient_balance | Balance < $0.20 hold |
429 | rate_limit_error | too_many_sessions | Free-tier (Tier 0) user at the 8-session cap; paid tiers are uncapped |
500 | internal_error | — | DB insert or hold RPC failed |
503 | service_unavailable | — | Live sessions disabled or Vertex token mint failed |
WebSocket close codes:
| Code | Reason | When |
|---|
1000 | normal | Client / server closed cleanly |
1011 | vertex_auth_failed | Vertex SA token mint failed at proxy open |
1011 | upstream_construct_failed | Vertex WebSocket constructor threw |
Reference
POST /v1/live/sessions — mint a session token
POST /v1/live/sessions/{id}/heartbeat — keep session alive (every 30s)
POST /v1/live/sessions/{id}/end — close session and settle billing
wss://api.kymaapi.com/v1/live/proxy/{id}?token=… — WebSocket bidirectional audio
See also