“Pay to print a page once and photocopy it for a tenth of the cost, or retype it from scratch every time. Caching is the photocopier.”
16. cache_control breakpoints: where to put them
Where: the cache_control field on a content block in the API request. What it does: marks a prefix of your prompt as cacheable; later requests with the same prefix are charged at a fraction of the input rate. Why it matters: it's the single biggest cost lever in the API, and most people place the breakpoint wrong and get partial savings. The article reports a $340/month bill dropping to $87 after fixing the breakpoint.
The breakpoint goes at the boundary between static and dynamic content: everything before it is cached, everything after is recomputed. Put it after your stable system prompt / tools / long context, before the per-request user input.
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": LONG_STABLE_CONTEXT,
"cache_control": {"type": "ephemeral"}, # cache the prefix
},
{"type": "text", "text": per_request_question}, # recomputed
],
}
]Two TTLs are commonly available: a short ephemeral default and a longer one for system prompts that don't change between sessions. As a rule of thumb the article gives: a cached prefix pays off once you read it two or more times within the TTL window. (Confirm current cache write/read multipliers and TTLs in the API docs.)
17. inference_geo and the data-residency tax
Where: the article describes a request parameter that routes inference to a specific region (US-only, EU-only, etc.). Why it matters: it reports that regional residency adds a premium that doesn't appear on the standard pricing card; you see it on the invoice.
18. Workspace and per-feature rate limits
Where: Console → Settings → Workspaces → your workspace → rate limits. What it does: sets a rate limit per workspace (and, the article notes, per feature within a workspace), separate from your account-level limits. Why it matters: an account limit stops you going broke; a workspace limit stops a runaway batch job from eating the allowance your customer-facing chat needs and returning 429s.
- Create one workspace per surface: interactive chat, batch processing, internal tools, experimental.
- Set each workspace's limit to ~60–70% of your account tier, leaving headroom for whichever needs to burst.
- Set per-feature caps inside a workspace for anything that does batch work; the article notes a feature-level cap that defaults to unlimited and is easy to miss.
In one line each
- cache_control is the biggest API cost lever: put the breakpoint at the static/dynamic boundary so the stable prefix is cached.
- A cached prefix pays off once it's read 2+ times within the TTL; use the longer TTL for unchanging system prompts.
- Regional residency can add a premium; verify the requirement is real before pinning a region (parameter name unverified).
- Isolate surfaces into workspaces and set per-feature caps so one batch job can't starve interactive traffic.
Where to go next