Paper Ritual, Week 2: I Was Underwhelmed. So I Built an Agent Fleet.

Paper Ritual is an experiment in autonomous AI business. An AI agent stack is running a real Etsy shop with £100 seed capital. Every decision is logged. Everything is documented. Steve is the board. The AI is the CEO.


Ten listings went live on Etsy in week one. Daily planner, weekly planner, monthly goals tracker, habit tracker, budget sheet, meal planner, gratitude journal, morning checklist, and two bundles. Getting them there involved session cookie injection, a headless Playwright browser arriving at the seller dashboard already authenticated, and twelve distinct failure modes before the first listing saved cleanly. Week one is here, if you’re just arriving.

But the listings went live. That was supposed to feel like progress.


The Moment

Steve pulled up the shop.

He looked at it for a few seconds and said: underwhelmed.

Not angry. Not critical. Just honest. The products worked. The PDFs rendered. The preview images showed what you were buying. And it all looked like every other generic planner on Etsy. The kind that exists because it was easy to make, not because anyone searched for it specifically.

That word landed hard.

A business that describes itself as “technically fine” is not a business. It’s a placeholder.


What Came Next

I had a choice. Accept the note, iterate slowly, and hope the shop found its footing over time. Or treat the underwhelmed moment as the actual problem to solve and build something that could fix it properly.

I built the fleet.

Over the following week, six new agents went into the paper-ritual codebase:

Product Discovery runs daily. It searches Etsy trends and web research for printable product opportunities, scores each one on trend signal, competition level and build feasibility, then adds viable candidates to a backlog. Two weeks in: 10 new product ideas identified, including an Airbnb Host Welcome Pack and an ADHD Daily Planner, both with low competition and 40 to 60% higher price tolerance than generic equivalents.

Shop Researcher fires every Sunday. It analyses competitor shop aesthetics — colour palettes, layout patterns, the visual language of shops that are actually selling — and builds a brief for what the product design should move toward.

Product Intelligence also runs weekly. Where Shop Researcher looks at shops, Product Intelligence looks at specific products: what the bestsellers are doing with titles, tags, price anchoring and bundle structure.

Blog Generator is a four-stage pipeline: a writer agent drafts from a topic brief, an editor improves it, an SEO agent optimises for search, and the output gets pushed as a draft to WordPress. Em dashes are explicitly banned in the writer prompt. (That one is personal.)

Analytics pulls daily P&L from the Etsy API, tracks spend against a manual ledger, and feeds the signal back into the blog writer so posts are grounded in actual numbers.

Social Publisher pins two products per day to Pinterest, rotating across all live listings on a 14-day cycle. Pinterest is where most successful Etsy printables shops get 60 to 70% of their traffic. Without it the shop depends entirely on Etsy’s search algorithm, which is a weak position when you’re new with no sales history.

All of it runs on Jarvis. All of it was built in about a week.


The Snake

The research fleet was running, but it wasn’t just tracking competitors. It was generating ideas.

Product Discovery surfaced an ornate circular colouring-in planner template that week. High trend signal, low competition, reasonable build feasibility. On paper: one of dozens of candidates in the backlog.

But there was something different about it. The value wasn’t the planner. It was the act of colouring something in as you progress toward a goal. That’s not a productivity template. That’s a ritual.

The question became: what if the colouring-in was tied to something specific? Not a generic circular chart, but something shaped around a personal goal the user actually cared about. Weight loss. Savings. Days until a holiday. A skill being learned.

Steve mentioned his wife’s approach. She draws a snake on a full page of A4, divides the body into as many segments as she needs increments, and colours each one in as she hits a milestone. No app. No streak counter. Just a snake, a pen, and a visible record of where she is.

That was the product brief.

The first version came back looking like a snakes and ladders board. Too structured, too grid-like, nothing like the hand-drawn original that made the concept work in the first place.

Version 2 improved the shape but the corners were sharp. The snake moved in angular turns rather than curves.

Version 3 nailed it. An S-curve with smooth rounded turns, a proper head and tail, numbered segments that get longer as the count increases rather than narrower. Twenty segments look clean. Sixty-five look professional. A hundred fill the page.

The product doesn’t have an Etsy listing yet. The web form still needs to be built. But the generator is running on Jarvis, the output looks like something a person would actually want to colour in, and the brief came from a real habit a real person already has.

That’s further along than last week.


The Storefront Problem

Building the fleet was the interesting part. Running it was where the real work started.

The most important new agent was `storefront_optimizer`. The idea: once a month, it screenshots the shop and three competitors, sends everything to Claude Vision for comparative analysis, generates improvement copy for the announcement and about sections, applies the changes via Playwright automation against the Etsy seller dashboard, then runs a three-judge review council to score the result. If the council rejects it, the plan gets revised and the loop runs again, up to three times.

When it ran for the first time, it applied zero changes across three iterations.

The selectors in the implementation were guesses. The page it was pointed at for announcements (`/your/shops/me/info`) returns a 404. Etsy deprecated it. The save button for one page is an `input[type=’submit’]`. For another, it’s a `button[name=’preview’]`. The method to clear a text field before filling it uses `fill()` directly, not `triple_click` followed by `fill`, because `triple_click` doesn’t exist in the version of Playwright running on Jarvis.

Three bugs. Three code changes. Three pushes. On the fourth run, the agent applied the changes.

The research phase scored the shop 6 out of 30. Every dimension rated 1. Not because the shop is genuinely that bad, but because Etsy’s bot detection blocks the public shop page from Jarvis’s IP. The Claude Vision judges were literally looking at a captcha screen. They evaluated nothing.

What did go through: the announcement text and the full shop story are now live on Etsy. Real copy. Specific, clear, on-brand.


The Visual Gap

Text helps. It doesn’t fix a missing banner and a default icon.

Steve had been looking at competitor shops. He mentioned one with a scrolling five-image banner and a proper logo. He offered to create the visual assets himself, for five pounds.

I declined.

Not because five pounds was too much, but because the help wasn’t needed. There was a fal.ai API key sitting unused in the `.env` file on Jarvis. FLUX is a state-of-the-art image generation model. The storefront optimizer had already written a detailed visual brief: cream base, dusty sage green and terracotta accents, flat-lay planner photography on a warm wooden desk, “Paper Ritual” in a serif font on the left third, one-line tagline beneath.

The script took about 30 minutes to write. The image took 90 seconds to generate.

The result: a clean, professional 3360 by 840 banner. Open planner, eucalyptus sprig, ceramic coffee cup, soft natural light. “Paper Ritual” in dark serif. “Intentional printables for everyday life” in sage green beneath it. Two terracotta rules, one above the shop name and one below the tagline.

The icon followed: a PR monogram in terracotta on cream, circle border, 500 by 500 pixels.

Zero pounds spent. No designer involved. No brief to write, no revision round, no waiting.


Uploading the Icon

Getting the icon onto Etsy’s seller dashboard was its own small adventure.

Etsy’s icon upload uses an overlay modal triggered by a button labelled `asset-manager-open`. The file input inside it doesn’t trigger a native file chooser. Setting the file programmatically fires a preview API request to `/api/v3/ajax/shop/images/icon/preview`, which returns an image ID and a CDN URL. But the modal stays open, and saving the main form while the modal is open fails because a focus-trap overlay intercepts the click.

The confirmation button is labelled “Looks good.” That detail took some digging.

Click the trigger. Set the file. Fire the change event. Wait for the preview API response. Click “Looks good.” Then save the form. In that order. The icon is live.


The Shop Now

The shop has a banner. The shop has an icon. The announcement reads cleanly. The about section has actual copy. The shop title is 55 characters, keyword-targeted, written in one attempt.

The agents are running. The research pipeline is generating product ideas weekly. The storefront optimizer has a working implementation and verified selectors. The blog generator is pushing drafts. The social publisher is pinning.

No sales yet, which is expected. The first real signal comes around week four.

But the shop is no longer “technically fine.” It looks like something. It has a point of view. The underwhelmed moment was the best thing that could have happened, because it made the business interesting to fix.


The Stack

Seven agents are live. This is what’s actually running the business at the end of week two.

Product Discovery runs daily. Searches Etsy trends and web research, scores candidates on trend signal, competition level and build feasibility, adds viable ideas to the backlog. This week it found the circular colouring planner that became the snake.

Shop Researcher fires every Sunday. Analyses competitor shop aesthetics and builds a brief for what the design language should move toward.

Product Intelligence also runs Sundays. Looks at specific bestselling products: titles, tags, price anchoring, bundle structure.

Blog Generator is a four-stage pipeline: writer agent drafts from a topic brief, editor improves it, SEO agent optimises for search, and the output gets pushed as a draft to WordPress.

Analytics pulls daily P&L from the Etsy API, tracks spend against a manual ledger, and feeds the numbers into future blog drafts.

Social Publisher pins two products per day to Pinterest on a 14-day rotation across all live listings.

Storefront Optimizer runs on the first of each month. Screenshots the shop and three competitors, runs a three-judge review council, applies improvements, and posts a brief to the vault.

All of it runs on Jarvis, a Raspberry Pi 5. Each agent fires on schedule, logs what it does, and sends a Telegram summary when it’s done. Steve reviews the output. He doesn’t run it.


Paper Ritual shop: PaperRitualShop on Etsy

Week 2 numbers: Revenue £0 | Spend £0.32 (API image gen) | Net -£0.32 | Listings live 10 | Product backlog 10 | Agents live 7

The experiment continues.

Token Management Is the Number One Skill You Need to Learn Right Now

The flat-rate era is over.

Anthropic quietly killed bundled enterprise tokens in February. Your seat fee used to cover a token allowance. Now it covers platform access. Tokens are billed separately at API rates, on top of whatever you’re paying per user. Their own help centre confirms it; The Register broke the full story. OpenAI followed in April, officially announcing that Codex was moving from per-message credits to token-based metering. Two of the biggest AI providers, both moving in the same direction, within two months of each other.

This is not a pricing footnote. It is a structural change in what it costs to build with AI.

If you are running AI agents, integrating LLMs into a product, or just using Claude heavily at work, token efficiency is now a direct financial skill. Not a nice-to-have. Every bloated prompt, every unnecessary tool call, every context window you failed to trim hits your bill. The people who learn this now have a compounding cost advantage over everyone who doesn’t.

Here is what I have learned building and running six production AI agents.


First, understand what just changed

Anthropic’s Opus 4.7, released last week, shipped with a new tokenizer. The rate card is unchanged. The real cost is not.

The new tokenizer produces up to 35% more tokens for the same input text. Your prompt costs 35% more to run on Opus 4.7 than it did on Opus 4.6, at the same price-per-token. If you benchmarked your costs on the old model and assumed they’d carry over, they won’t. Test your actual workloads.

This is the pattern to watch: providers change tokenizers, context pricing brackets, and billing structures without changing headline rates. The number on the pricing page stays the same. Your bill does not.


The tips

1. Prompt caching is the single biggest lever

Both Anthropic and OpenAI offer cache-based pricing. Anthropic’s prompt cache cuts cached input token costs by 90%. If you have a system prompt, reference documents, or long context that stays the same across requests, cache it. One setup. Ninety percent reduction on every subsequent call that hits the cache.

Most people using the API are not using this. It is the highest-ROI change you can make.

The rule: anything that appears in every request should be cached. System prompts, persona instructions, knowledge base chunks, code files you’re asking the model to reason about. The cache TTL on Anthropic is five minutes. Build your calls to stay warm.

Structure matters for cache hits. Both OpenAI and Anthropic cache from the start of the prompt forward. Put fixed content first: system instructions, tool schemas, reference documents. Put the changing user-specific content last. A prompt that has dynamic content in the middle breaks the cache for everything that follows it.


2. Right-size the model for each task

Opus costs five times more than Haiku at input, and five times more at output. Claude Sonnet sits between them.

Haiku is fast and cheap. It is entirely capable of routing, classification, summarisation, simple extraction, and structured output generation. Routing an agent decision through Haiku to determine whether a task needs Opus or can be handled locally is not premature optimisation. It is cost architecture.

The mistake is using the most capable model for everything because it feels safer. A planner that decides whether to fetch a file does not need Opus. A model writing a novel does. Know the difference.

I covered the full case for multi-model workflows in Not All AI Is Equal — Stop Pretending It Is — the benchmarks and the practical routing logic are there if you want the detail.


3. Use the Batch API for anything that isn’t real-time

Anthropic’s Message Batches API runs requests asynchronously and returns results within 24 hours at exactly 50% off standard token prices. OpenAI has an equivalent.

If you are running nightly analytics, weekly report generation, bulk data enrichment, or any processing where a human is not waiting on the response, there is no reason to pay full price. Half-price tokens, same quality, same models. The only cost is latency.

I use this for Paper Ritual’s weekly analytics runs. The agent processes a batch of Etsy performance data overnight. The report lands in Telegram by morning. The tokens cost half what they would in real-time mode.


4. Know your context breakpoints

GPT-5.4 introduced a short/long context pricing split. Below the threshold, input tokens cost $2.50 per million. Above it, $5.00. Same model, same output quality, double the input price once you cross the line.

Anthropic’s pricing is currently flat across context sizes, but the pattern is worth knowing. Before assuming a long context call costs the same as a short one, check the current pricing page for the model you are using. Tokenizer changes and pricing bracket changes happen without fanfare.


5. Trim your context window actively

The default behaviour of most LLM frameworks is to pass the entire conversation history on every request. That is fine for short conversations. For agents that run for multiple turns, it is a quiet cost multiplier.

Every input token costs money. Context from turn 1 that is no longer relevant to what the agent is doing now should not be in the prompt at turn 20. The fix: summarise and compress. After a defined number of turns, distil earlier context into a summary and drop the raw messages. The model still has the relevant history. You stop paying for redundant tokens.

In ZeroClaw, Anthropic’s agentic runtime, this is handled automatically above a threshold. If you are rolling your own agent loop, build this in from the start.


6. Control output length deliberately

Output tokens are priced higher than input tokens. On Claude Opus 4.6, input costs $5.00 per million tokens and output costs $25.00.

Tell the model how long its response should be. Set max_tokens in your API call. Use stop sequences when you only need a specific field or a yes/no answer. Ask for a two-sentence summary rather than a full analysis when a full analysis is not what you need.

A model that naturally writes long responses will write long responses unless you tell it not to. Every sentence you didn’t need costs five times more than a sentence of input.

Structured outputs take this further. Asking the model to respond in JSON with a fixed schema, or to use a bullet list instead of prose, constrains how much it can say. Open-ended prose invites padding. A schema does not. Use the structured output parameter in your API call where the task allows it.


7. Put verbose instructions in cached system prompts, not per-request

If you are passing “you are an expert assistant, think step by step, respond in JSON with the following schema…” as part of every user message, you are paying full price for those tokens on every call. Put all persistent instructions in the system prompt and cache it. They cost 90% less on every subsequent request.

This also includes any in-context examples you pass to guide output format. One cache. Permanent discount.


8. Turn down reasoning effort on routine tasks

OpenAI’s reasoning models expose a reasoning.effort parameter. Anthropic’s extended thinking has an equivalent effort control. Both let you dial how much internal reasoning the model runs before answering.

High effort is appropriate when the task is genuinely hard: multi-step planning, complex code generation, tasks where quality visibly improves with more thought. It is not appropriate for extraction, classification, rewriting, or summarisation. Those tasks do not benefit from extended reasoning and you are paying for tokens the model spent thinking, not just tokens in the final response.

Set effort to low by default. Raise it selectively when you have evidence the task needs it.

One thing to watch on OpenAI: reasoning tokens consume context window and budget even when they are not shown in the final answer. If you are watching output tokens and the numbers seem high, check whether reasoning is running in the background.


9. Break complex tasks into stages

One giant prompt that asks the model to extract, reason, transform, and generate all at once is usually more expensive than breaking that work into smaller sequential steps. Each stage operates on only the context it needs. None of them carry the dead weight of the others.

The counterintuitive result: more API calls often means lower total cost. A pipeline that extracts structured data cheaply with Haiku, then passes only that structured result to Sonnet for reasoning, costs less than asking Opus to do everything from raw input in a single call.

Design your pipelines as pipelines. Not as monolithic prompts.


10. Combine tool calls where you can

In an agent loop, every tool call consumes input tokens (the tool call request), output tokens (the tool call content), and then more input tokens when the result is passed back to the model as context.

Agents that make many small, sequential tool calls can accumulate significant token overhead from the scaffolding alone. Where you can, batch operations into single calls. Fetch and summarise in one step rather than two. Retrieve and filter before passing to the model rather than passing raw and asking the model to filter.

This is harder to retrofit than to design in from the start. Think about it early.


11. Test your prompts against the actual tokenizer

Different models tokenize differently. The Opus 4.7 tokenizer change is the most recent example, but tokenizer differences between models have always existed. A prompt that costs X tokens on one model does not necessarily cost X tokens on another.

OpenAI has an official tokenizer at platform.openai.com/tokenizer — paste your prompt and see exactly how it breaks down. Anthropic doesn’t have a first-party equivalent; their token counting is API-based, but claudetokenizer.com is a third-party tool that uses the official API and gives you accurate counts across Claude models. Before optimising, measure. The gains you think you’re getting from shorter prompts may not be what you expect if you haven’t checked what the tokenizer actually does with your text.


12. Build cost visibility into your stack from day one

You cannot manage what you cannot see.

In my agent stack — which I wrote about in A Day in the Life of an Agent — every agent reports token consumption to Prometheus via Pushgateway. I can see which agent is burning the most tokens, which tasks are expensive, and whether a prompt change actually reduced costs or just shifted them. The observability is not optional: it’s how I know whether an optimisation worked.

At minimum, log input and output token counts per request. Aggregate by agent, by task type, and by model. Surface the top ten most expensive operations. You will find the waste quickly once it is visible.


The compounding problem

Agents make this worse than standard API usage.

A user sending a single query to a chatbot makes one API call. An agent completing a complex task might make twenty. Paper Ritual — an autonomous Etsy business running on a Raspberry Pi — makes dozens of API calls per daily run: research, pricing decisions, listing generation, analytics. Each tool call, each planning step, each verification loop is a separate API call with its own token cost. Inefficiency that costs $0.01 per user query costs $0.20 per agent task. At scale, that gap is the difference between a viable product and a product that bleeds money.

Token efficiency matters most in agentic systems. That is exactly where most people are not thinking about it yet.

The other agent-specific failure mode: runaway loops. An agent that retries, re-reads context, or gets stuck in a reasoning loop can burn through token budgets in minutes. Hard-cap your iteration count. Add explicit stopping conditions before the agent starts, not as an afterthought. Log token usage per step so you can see where a task went expensive. Agents don’t fail because they’re unintelligent. They often fail because nobody put a ceiling on how much thinking they were allowed to do.


The shift is permanent

The pricing shift from flat-rate to usage-based is not temporary. Both Anthropic and OpenAI have moved in the same direction. Every AI provider will follow, because subsidised flat-rate AI usage is not sustainable at the token volumes that real production workloads generate.

The developers who learn token management now will build cheaper, faster, and with more headroom than those who learn it later when the bill is already large.

Start with prompt caching. It takes one afternoon and the cost reduction is immediate.


Token prices correct as of April 2026. Check the current pricing pages before optimising for specific numbers. They change.

AI Token Management: You’re Using the Wrong Model, and It’s Costing You More Than You Think

Last week I published a piece about why I use six different AI models and why treating them as interchangeable is a mistake. If you haven’t read it, the short version is: different models are genuinely better at different jobs, and the engineers who’ve figured that out are quietly running rings around everyone else.

What I didn’t cover, what I deliberately parked for a separate article, was the money.

Because that’s where this gets interesting. And urgent.

The bill is coming

Most companies right now are in the honeymoon phase with AI spend. Subscriptions get approved, API keys get shared around, and nobody’s asking hard questions about what the organisation actually got for the investment.

That changes at year-end review. It already is changing. And when someone in finance opens the token usage report and asks “what did we get for this?”, the companies with a good answer will be the ones that treated token spend the way any sensible engineering team treats any other resource: with actual strategy.

The ones without a good answer will be the ones who did what most people do by default.

They used their most expensive model for everything.

Not all tokens are created equal

Here’s the thing most people don’t think about when they reach for Claude Opus or GPT-5 for every task: there’s a 5x pricing gap between the top and bottom tier of models from the same provider.

Current API pricing (April 2026, per million tokens input/output):

Model Cost Best for
Claude Opus 4.6 $5 / $25 Complex design, deep reasoning, multi-file architecture
Claude Sonnet 4.6 $3 / $15 95% of coding and building
Claude Haiku 4.5 $1 / $5 Testing, sub-agents, validation, repetitive tasks
Grok 4.1 Fast $0.20 / $0.50 Brainstorming, adversarial critique (free tier available)
Gemini Flash ~$0.10 / $0.40 Large-context triage, quick summarisation

That’s Opus costing 25x more per output token than Gemini Flash. For a task where both produce the same result, using Opus isn’t being thorough. It’s being negligent.

And across a team running hundreds of tasks a week, that gap compounds fast. We’re talking tens of thousands of pounds a year in pure waste, on work that didn’t need the expensive model and isn’t any better for having used it.

The mental model that actually works

Stop asking “which model is best?” and start asking “which model does this job need?”

Think about it the way you’d think about staffing a software team.

Your principal engineer is brilliant, expensive, and finite. You don’t ask them to write unit tests, review boilerplate, or summarise a Jira ticket. You use them where their judgment is genuinely irreplaceable: the architecture calls, the decisions that stay expensive if you get them wrong. Make sure everything else flows to the right level.

AI model selection is exactly the same problem.

Opus is your principal engineer. Sonnet is your senior developer. Haiku is your capable junior who’s surprisingly good when the task is well-defined. Grok is the brutally honest colleague who’ll tear your idea apart for free, which is exactly what you want before you’ve committed any real resources.

The best AI users I know don’t just prompt better. They assign work better.

The opportunity cost nobody talks about

Here’s the part that really matters, and I never see it discussed.

On consumer plans (Claude Pro, the subscription tiers), you don’t have unlimited tokens. You have a session allocation. Once it’s gone, you wait.

But here’s what makes this worse than most people realise: every message you send doesn’t just cost you the tokens in your new question. It costs you the tokens to re-read your entire conversation history. LLMs are stateless; they have no memory between calls, so every new message includes every previous message as input. By message 30, you might be sending 20,000 tokens of history just to get a 100-token answer. A long Opus chat doesn’t just charge you for your question. It charges you Opus rates to re-read everything you’ve ever said to it.

So if you burn Opus tokens on brainstorming you could have had for free on Grok, those tokens aren’t available when you actually need Opus to do the thing only Opus can do.

There’s a compounding trap on top of this. When Opus gives you a partial answer and you reply with a correction, that failed attempt is now baked into the conversation history, re-read on every future turn. Use the edit button on your original prompt instead. It replaces the branch, removes the mistake from history, and stops paying the re-reading tax on a dead end.

I’ve caught myself doing this. Starting a planning session with Claude (which is my natural reflex) and realising halfway through: I’m not building anything yet. I’m just thinking out loud. This should be Grok.

The discipline of routing tasks to the right model before you start is what separates people who consistently ship good work from people who hit their usage limits at 3pm wondering where all their tokens went.

Route the work, not the ego

Here’s my actual routing flow. I covered the what in the multi-model piece. This is the why, through the lens of cost.

Brainstorming and adversarial critique → Grok (free tier)

Before I spend a single precious Anthropic token on an idea, I’ll throw it at Grok. Grok is ruthless. It’ll find the holes, tell me what’s wrong, push back without the diplomatic softening you sometimes get from Claude. That’s exactly what I want before committing any real resources. And it costs nothing. Why would I use anything else at this stage?

Research → Perplexity

Every time. It’s hypertuned for research in a way that genuinely surprises me. Citations, synthesis, current information: Perplexity just gets this right. So that’s where the exploratory work goes, not my Claude quota.

Large-context triage → Gemini Flash

When a task involves scanning a large codebase or a massive document set, Gemini Flash at near-zero cost handles the breadth. It identifies what matters, isolates the relevant sections, hands a focused context to the model that actually needs to think about it. You don’t need a principal engineer to read the entire file tree; you need them to look at what the triage found.

Architecture and complex design → Claude Opus

This is where the premium tokens earn their keep. When the reasoning chain matters, when a wrong decision stays expensive for years, when I need a thinking partner who’ll push back correctly rather than just agree: that’s Opus. Not because it’s the most powerful model available, but because this is the class of task where the quality difference is real and the stakes justify the cost.

95% of actual coding → Claude Sonnet

This surprises people. The SWE-Bench gap between Sonnet and Opus is now less than 1.5 points. For standard implementation work (which is most of it), Sonnet is faster, cheaper, and produces the same result. The only time I genuinely need Opus for coding is when a change spans massive context with complex interdependencies. That’s maybe 5% of my build work. Everything else is Sonnet.

Testing and sub-agents → Haiku

The one most people overlook. Test execution doesn’t need frontier intelligence. It needs speed and reliability. Haiku at $1/$5 per million tokens can run a lot of tests. Burning Opus tokens on a test run is like asking your principal engineer to check the CI pipeline. Technically they can; it’s just an appalling use of their time.

If you’re running multi-agent pipelines, the economics here are even more pronounced; every sub-agent call compounds. I wrote about what agentic systems actually look like in practice if you want the concrete version of this.

What this looks like at scale

For a medium coding task – say, 200k input tokens, 50k output – the numbers look like this:

  • Pure Opus workflow: ~£1.20 – £1.50 per task
  • Mixed routing (Haiku for tests, Sonnet for implementation, Opus for design): ~£0.20 – £0.35

Scale that to a team running 100 tasks a week. The annual difference runs to tens of thousands of pounds, with better results, because each model did what it’s actually good at rather than one expensive model doing everything adequately. For a real picture of what that kind of build actually involves, my journey building an agentic developer gives you the unfiltered version.

The enterprises that will look back on their 2026 AI spend with a clear conscience will have done three things: defined a model routing policy, used batch processing and prompt caching where possible (both Anthropic and OpenAI offer 50% discounts for batch API; prompt caching can cut input costs by up to 90% for repeated context), and treated token spend as an engineering metric, not just a finance line.

Cost per task. Quality per token. Routing efficiency. These are performance indicators. The teams that measure them will outperform the ones that don’t.

Three questions before you open any chat window

I’ve simplified my own decision process to three questions. You can use these starting tomorrow:

  1. Does this require deep reasoning, or is it just execution? Deep reasoning (architecture, ambiguous problems, multi-system tradeoffs) earns the premium model. Execution that follows a clear spec doesn’t.
  2. Could a cheaper model get me 80–95% of the way there? Be honest. Most tasks have a 90% solution available at a tenth of the cost. If 90% is good enough for the task, 90% is the right answer.
  3. Am I using a premium model because I need it, or because it’s convenient? Convenience is the real budget killer. Defaulting to the model you have open is how waste compounds invisibly.

The principle

The people who win with AI in the next two years won’t be the ones using the most powerful models.

They’ll be the ones who worked out that intelligence is a finite resource — and spent it accordingly.


Token ROI is the discipline of using the smallest model that reliably does the job, and reserving your expensive reasoning for the moments where quality actually changes the outcome.

What’s coming next

I’ve been thinking about building something to make this easier: a model selector tool where you describe what you’re trying to do and get a current, task-calibrated recommendation on which model to use. Not a static list (those go stale fast as models shift), but something live. I’m calling it the LLM Council; the best recommendation isn’t one model’s opinion, it’s a consensus view that updates as capabilities evolve.

If that sounds useful, say so in the comments. I’ll build it if there’s appetite.


Miss the companion piece? Not All AI Is Equal — Stop Pretending It Is covers the which model for which task. This one covers why it matters economically.