ChatGPT – Steve's AI Diaries

The flat-rate era is over.

Anthropic quietly killed bundled enterprise tokens in February. Your seat fee used to cover a token allowance. Now it covers platform access. Tokens are billed separately at API rates, on top of whatever you’re paying per user. Their own help centre confirms it; The Register broke the full story. OpenAI followed in April, officially announcing that Codex was moving from per-message credits to token-based metering. Two of the biggest AI providers, both moving in the same direction, within two months of each other.

This is not a pricing footnote. It is a structural change in what it costs to build with AI.

If you are running AI agents, integrating LLMs into a product, or just using Claude heavily at work, token efficiency is now a direct financial skill. Not a nice-to-have. Every bloated prompt, every unnecessary tool call, every context window you failed to trim hits your bill. The people who learn this now have a compounding cost advantage over everyone who doesn’t.

Here is what I have learned building and running six production AI agents.

First, understand what just changed

Anthropic’s Opus 4.7, released last week, shipped with a new tokenizer. The rate card is unchanged. The real cost is not.

The new tokenizer produces up to 35% more tokens for the same input text. Your prompt costs 35% more to run on Opus 4.7 than it did on Opus 4.6, at the same price-per-token. If you benchmarked your costs on the old model and assumed they’d carry over, they won’t. Test your actual workloads.

This is the pattern to watch: providers change tokenizers, context pricing brackets, and billing structures without changing headline rates. The number on the pricing page stays the same. Your bill does not.

The tips

1. Prompt caching is the single biggest lever

Both Anthropic and OpenAI offer cache-based pricing. Anthropic’s prompt cache cuts cached input token costs by 90%. If you have a system prompt, reference documents, or long context that stays the same across requests, cache it. One setup. Ninety percent reduction on every subsequent call that hits the cache.

Most people using the API are not using this. It is the highest-ROI change you can make.

The rule: anything that appears in every request should be cached. System prompts, persona instructions, knowledge base chunks, code files you’re asking the model to reason about. The cache TTL on Anthropic is five minutes. Build your calls to stay warm.

Structure matters for cache hits. Both OpenAI and Anthropic cache from the start of the prompt forward. Put fixed content first: system instructions, tool schemas, reference documents. Put the changing user-specific content last. A prompt that has dynamic content in the middle breaks the cache for everything that follows it.

2. Right-size the model for each task

Opus costs five times more than Haiku at input, and five times more at output. Claude Sonnet sits between them.

Haiku is fast and cheap. It is entirely capable of routing, classification, summarisation, simple extraction, and structured output generation. Routing an agent decision through Haiku to determine whether a task needs Opus or can be handled locally is not premature optimisation. It is cost architecture.

The mistake is using the most capable model for everything because it feels safer. A planner that decides whether to fetch a file does not need Opus. A model writing a novel does. Know the difference.

I covered the full case for multi-model workflows in Not All AI Is Equal — Stop Pretending It Is — the benchmarks and the practical routing logic are there if you want the detail.

3. Use the Batch API for anything that isn’t real-time

Anthropic’s Message Batches API runs requests asynchronously and returns results within 24 hours at exactly 50% off standard token prices. OpenAI has an equivalent.

If you are running nightly analytics, weekly report generation, bulk data enrichment, or any processing where a human is not waiting on the response, there is no reason to pay full price. Half-price tokens, same quality, same models. The only cost is latency.

I use this for Paper Ritual’s weekly analytics runs. The agent processes a batch of Etsy performance data overnight. The report lands in Telegram by morning. The tokens cost half what they would in real-time mode.

4. Know your context breakpoints

GPT-5.4 introduced a short/long context pricing split. Below the threshold, input tokens cost $2.50 per million. Above it, $5.00. Same model, same output quality, double the input price once you cross the line.

Anthropic’s pricing is currently flat across context sizes, but the pattern is worth knowing. Before assuming a long context call costs the same as a short one, check the current pricing page for the model you are using. Tokenizer changes and pricing bracket changes happen without fanfare.

5. Trim your context window actively

The default behaviour of most LLM frameworks is to pass the entire conversation history on every request. That is fine for short conversations. For agents that run for multiple turns, it is a quiet cost multiplier.

Every input token costs money. Context from turn 1 that is no longer relevant to what the agent is doing now should not be in the prompt at turn 20. The fix: summarise and compress. After a defined number of turns, distil earlier context into a summary and drop the raw messages. The model still has the relevant history. You stop paying for redundant tokens.

In ZeroClaw, Anthropic’s agentic runtime, this is handled automatically above a threshold. If you are rolling your own agent loop, build this in from the start.

6. Control output length deliberately

Output tokens are priced higher than input tokens. On Claude Opus 4.6, input costs $5.00 per million tokens and output costs $25.00.

Tell the model how long its response should be. Set max_tokens in your API call. Use stop sequences when you only need a specific field or a yes/no answer. Ask for a two-sentence summary rather than a full analysis when a full analysis is not what you need.

A model that naturally writes long responses will write long responses unless you tell it not to. Every sentence you didn’t need costs five times more than a sentence of input.

Structured outputs take this further. Asking the model to respond in JSON with a fixed schema, or to use a bullet list instead of prose, constrains how much it can say. Open-ended prose invites padding. A schema does not. Use the structured output parameter in your API call where the task allows it.

7. Put verbose instructions in cached system prompts, not per-request

If you are passing “you are an expert assistant, think step by step, respond in JSON with the following schema…” as part of every user message, you are paying full price for those tokens on every call. Put all persistent instructions in the system prompt and cache it. They cost 90% less on every subsequent request.

This also includes any in-context examples you pass to guide output format. One cache. Permanent discount.

8. Turn down reasoning effort on routine tasks

OpenAI’s reasoning models expose a reasoning.effort parameter. Anthropic’s extended thinking has an equivalent effort control. Both let you dial how much internal reasoning the model runs before answering.

High effort is appropriate when the task is genuinely hard: multi-step planning, complex code generation, tasks where quality visibly improves with more thought. It is not appropriate for extraction, classification, rewriting, or summarisation. Those tasks do not benefit from extended reasoning and you are paying for tokens the model spent thinking, not just tokens in the final response.

Set effort to low by default. Raise it selectively when you have evidence the task needs it.

One thing to watch on OpenAI: reasoning tokens consume context window and budget even when they are not shown in the final answer. If you are watching output tokens and the numbers seem high, check whether reasoning is running in the background.

9. Break complex tasks into stages

One giant prompt that asks the model to extract, reason, transform, and generate all at once is usually more expensive than breaking that work into smaller sequential steps. Each stage operates on only the context it needs. None of them carry the dead weight of the others.

The counterintuitive result: more API calls often means lower total cost. A pipeline that extracts structured data cheaply with Haiku, then passes only that structured result to Sonnet for reasoning, costs less than asking Opus to do everything from raw input in a single call.

Design your pipelines as pipelines. Not as monolithic prompts.

10. Combine tool calls where you can

In an agent loop, every tool call consumes input tokens (the tool call request), output tokens (the tool call content), and then more input tokens when the result is passed back to the model as context.

Agents that make many small, sequential tool calls can accumulate significant token overhead from the scaffolding alone. Where you can, batch operations into single calls. Fetch and summarise in one step rather than two. Retrieve and filter before passing to the model rather than passing raw and asking the model to filter.

This is harder to retrofit than to design in from the start. Think about it early.

11. Test your prompts against the actual tokenizer

Different models tokenize differently. The Opus 4.7 tokenizer change is the most recent example, but tokenizer differences between models have always existed. A prompt that costs X tokens on one model does not necessarily cost X tokens on another.

OpenAI has an official tokenizer at platform.openai.com/tokenizer — paste your prompt and see exactly how it breaks down. Anthropic doesn’t have a first-party equivalent; their token counting is API-based, but claudetokenizer.com is a third-party tool that uses the official API and gives you accurate counts across Claude models. Before optimising, measure. The gains you think you’re getting from shorter prompts may not be what you expect if you haven’t checked what the tokenizer actually does with your text.

12. Build cost visibility into your stack from day one

You cannot manage what you cannot see.

In my agent stack — which I wrote about in A Day in the Life of an Agent — every agent reports token consumption to Prometheus via Pushgateway. I can see which agent is burning the most tokens, which tasks are expensive, and whether a prompt change actually reduced costs or just shifted them. The observability is not optional: it’s how I know whether an optimisation worked.

At minimum, log input and output token counts per request. Aggregate by agent, by task type, and by model. Surface the top ten most expensive operations. You will find the waste quickly once it is visible.

The compounding problem

Agents make this worse than standard API usage.

A user sending a single query to a chatbot makes one API call. An agent completing a complex task might make twenty. Paper Ritual — an autonomous Etsy business running on a Raspberry Pi — makes dozens of API calls per daily run: research, pricing decisions, listing generation, analytics. Each tool call, each planning step, each verification loop is a separate API call with its own token cost. Inefficiency that costs $0.01 per user query costs $0.20 per agent task. At scale, that gap is the difference between a viable product and a product that bleeds money.

Token efficiency matters most in agentic systems. That is exactly where most people are not thinking about it yet.

The other agent-specific failure mode: runaway loops. An agent that retries, re-reads context, or gets stuck in a reasoning loop can burn through token budgets in minutes. Hard-cap your iteration count. Add explicit stopping conditions before the agent starts, not as an afterthought. Log token usage per step so you can see where a task went expensive. Agents don’t fail because they’re unintelligent. They often fail because nobody put a ceiling on how much thinking they were allowed to do.

The shift is permanent

The pricing shift from flat-rate to usage-based is not temporary. Both Anthropic and OpenAI have moved in the same direction. Every AI provider will follow, because subsidised flat-rate AI usage is not sustainable at the token volumes that real production workloads generate.

The developers who learn token management now will build cheaper, faster, and with more headroom than those who learn it later when the bill is already large.

Start with prompt caching. It takes one afternoon and the cost reduction is immediate.

Token prices correct as of April 2026. Check the current pricing pages before optimising for specific numbers. They change.