Prompt Caching - Kyma API

Prompt caching reduces costs and latency by reusing previously computed context. When you send requests with the same system prompt, tool definitions, or conversation history, cached tokens are charged at a discounted rate.

See real-time cache stats across all models on the Rankings page.

How It Works

First request: Full prompt is processed and cached by the provider
Subsequent requests: Cached prefix is reused — up to 90% cheaper and 80% faster

Caching works automatically for supported providers. No code changes required for most use cases.

Real-World Impact

Based on production data from the Kyma community (7-day rolling window):

22%+ overall cache hit rate across all models
deepseek-v3 leads with 56% cache hit rate — heavy agentic usage
gemini-2.5-flash and gemma-4-31b consistently hit 20-35%
Coding agents (OpenClaw, Cline, Roo Code) see the highest cache rates due to repeated system prompts

Check the live numbers at kymaapi.com/rankings?tab=cache.

Automatic Caching

For OpenAI-compatible requests, caching is automatic when your prompt exceeds 1,024 tokens. Place static content (system prompt, tool definitions) at the beginning:

from openai import OpenAI

client = OpenAI(
    base_url="https://kymaapi.com/v1",
    api_key="ky-your-api-key"
)

response = client.chat.completions.create(
    model="deepseek-v3",
    messages=[
        # Static content (cached after first request)
        {"role": "system", "content": "Your long system prompt here..."},
        # Dynamic content (never cached)
        {"role": "user", "content": "User's question"}
    ]
)

# Check cache stats in response
print(f"Cached: {response.usage.cached_tokens}")
print(f"Cost: ${response.usage.cost}")
print(f"Saved: ${response.usage.cache_discount}")

Best Practices

Structure prompts for caching

Place stable content first, dynamic content last:

System instructions (static) ← CACHED
Tool definitions (static)    ← CACHED
Few-shot examples (static)   ← CACHED
Conversation history          ← CACHED (grows incrementally)
Current user message          ← NOT CACHED (changes each request)

For coding agents

Coding agents (OpenClaw, Cline, Roo Code, Claude Code) automatically benefit from caching because they send the same system prompt + tool definitions with every request. Real production example — 50-request coding session with deepseek-v3:

	Without caching	With caching (56% hit rate)
Effective input tokens	250,000	110,000
Input cost	$0.203	$0.049
Savings	—	$0.154 (76%)

What to avoid

Don’t put timestamps or request IDs in system prompts — breaks cache
Don’t reorder tool definitions between requests
Keep system prompt identical across requests

Cache Stats in Response

Kyma normalizes cache statistics from all providers into a unified format:

{
  "usage": {
    "prompt_tokens": 5050,
    "completion_tokens": 200,
    "cached_tokens": 5000,
    "cache_write_tokens": 0,
    "cost": 0.000382,
    "cache_discount": 0.002430
  }
}

Field	Description
`cached_tokens`	Tokens read from cache (90% discounted)
`cache_write_tokens`	Tokens written to cache on first request
`cost`	Total cost charged for this request (USD)
`cache_discount`	Amount saved from caching (USD)

These fields appear in both streaming (final usage chunk) and non-streaming responses.

Tracking Your Savings

Per-request

Every API response includes usage.cost (what you paid) and usage.cache_discount (what you saved). Sum these over your session to track total savings.

Community-wide

Visit the Cache Stats rankings to see:

Overall cache hit rate across all Kyma users
Per-model cache breakdown (cached vs uncached vs output tokens)
Total community savings in USD

Supported Models

All models on Kyma support prompt caching. Cache effectiveness varies by model — Kyma normalizes the behavior so you always see the same cached_tokens shape and the same 90% discount. Check which models are actively caching:

curl https://kymaapi.com/v1/models | jq '.data[] | {id, supports_caching}'

Pricing

Cached tokens are charged at 10% of the normal input price (90% discount).

Token type	Rate
Input (non-cached)	Full price
Input (cached)	10% of input price
Output	Full price

Example — deepseek-v3 pricing:

	Price per 1M tokens
Input (full)	$0.810
Input (cached)	$0.081
Output	$2.295

50-request coding session breakdown:

System prompt: 5,000 tokens (stable across requests)
User messages: ~500 tokens each (dynamic)

Without caching:
  50 × 5,000 × $0.810/1M = $0.203 (input only)

With caching:
  1 × 5,000 × $0.810/1M (first request) +
  49 × 5,000 × $0.081/1M (cached) = $0.024

Savings: $0.179 (88% reduction)

The usage.cost and usage.cache_discount fields in every response let you track savings in real-time.

​How It Works

​Real-World Impact

​Automatic Caching

​Best Practices

​Structure prompts for caching

​For coding agents

​What to avoid

​Cache Stats in Response

​Tracking Your Savings

​Per-request

​Community-wide

​Supported Models

​Pricing

How It Works

Real-World Impact

Automatic Caching

Best Practices

Structure prompts for caching

For coding agents

What to avoid

Cache Stats in Response

Tracking Your Savings

Per-request

Community-wide

Supported Models

Pricing