API & Console: 3 cost levers

“Pay to print a page once and photocopy it for a tenth of the cost, or retype it from scratch every time. Caching is the photocopier.”

The API console manages keys, rate limits, and spend.

16. cache_control breakpoints: where to put them

Where: the cache_control field on a content block in the API request. What it does: marks a prefix of your prompt as cacheable; later requests with the same prefix are charged at a fraction of the input rate. Why it matters: it's the single biggest cost lever in the API, and most people place the breakpoint wrong and get partial savings. The article reports a $340/month bill dropping to $87 after fixing the breakpoint.

The breakpoint goes at the boundary between static and dynamic content: everything before it is cached, everything after is recomputed. Put it after your stable system prompt / tools / long context, before the per-request user input.

pythonCopy prompt

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": LONG_STABLE_CONTEXT,
                "cache_control": {"type": "ephemeral"},  # cache the prefix
            },
            {"type": "text", "text": per_request_question},  # recomputed
        ],
    }
]

Two TTLs are commonly available: a short ephemeral default and a longer one for system prompts that don't change between sessions. As a rule of thumb the article gives: a cached prefix pays off once you read it two or more times within the TTL window. (Confirm current cache write/read multipliers and TTLs in the API docs.)

17. inference_geo and the data-residency tax

Where: the article describes a request parameter that routes inference to a specific region (US-only, EU-only, etc.). Why it matters: it reports that regional residency adds a premium that doesn't appear on the standard pricing card; you see it on the invoice.

18. Workspace and per-feature rate limits

Where: Console → Settings → Workspaces → your workspace → rate limits. What it does: sets a rate limit per workspace (and, the article notes, per feature within a workspace), separate from your account-level limits. Why it matters: an account limit stops you going broke; a workspace limit stops a runaway batch job from eating the allowance your customer-facing chat needs and returning 429s.

Create one workspace per surface: interactive chat, batch processing, internal tools, experimental.
Set each workspace's limit to ~60 to 70% of your account tier, leaving headroom for whichever needs to burst.
Set per-feature caps inside a workspace for anything that does batch work; the article notes a feature-level cap that defaults to unlimited and is easy to miss.

In one line each

cache_control is the biggest API cost lever: put the breakpoint at the static/dynamic boundary so the stable prefix is cached.
A cached prefix pays off once it's read 2+ times within the TTL; use the longer TTL for unchanging system prompts.
Regional residency can add a premium; verify the requirement is real before pinning a region (parameter name unverified).
Isolate surfaces into workspaces and set per-feature caps so one batch job can't starve interactive traffic.

Where to go next

Chapter 5: Checklist & weekly audit