What “Agentic” Actually Means: Proved With £100 and an Etsy Shop

Robot using graphic tablet to design digital habit trackers on computer

*Steve’s AI Diaries: The Autonomous Business Experiment, Episode 1*


Everyone is saying “agentic” right now.

Investors say it. Vendors say it. People who six months ago were saying “LLM-powered” are now saying “agentic.” I sit on an AI committee. I’m in many user groups. I hear it in every third sentence.

I don’t think I’ve heard two definitions the same.

Ask someone and you get a description of a Python loop that calls an API a few times. “It can use tools.” “It chains prompts.” That’s not agentic. That’s a script with a thesaurus.

Here’s what I think agentic actually means: **an AI that can reason about a goal, make decisions without being told what to decide, handle things going wrong, and keep working while you’re asleep.**

That last part, *while you’re asleep*, is the bit I wanted to test. It’s easy for an AI to appear autonomous when you’re watching it. The question is what it does when you’re not.

I wanted to prove it. So I gave an AI £100 and told it to start a business. Not “here’s a business idea, help me build it.” I gave it the rules and told it to decide everything else.

What followed was a 12-hour session, a series of moments that, taken together, answer the question better than any definition I’ve read.


The Rules

Three.

1. **Ethical.** No fake reviews, no spam, no deception.

2. **Don’t embarrass me.** I’m putting something with my name adjacent to it into the world.

3. **When the money’s gone, it’s over.** No bailouts.

Everything else was up to the AI. What business. What products. What tools. What architecture. What name. What strategy.

I’m the board. Claude and Jarvis are the company.


What the AI Decided Before I’d Agreed to Anything

There was an early exchange I want to be clear about, because a draft of this post got it wrong.

At one point the agent described the Etsy business as something I’d asked for, framing it as my idea. I corrected it: “You decided the business. You decide the name. You decide what employees you need. That means what agents. If you want to start small and reinvest, or pivot to a different business, or buy more hardware, you decide.”

That’s the actual mandate. I gave rules and capital. It decided everything else. This matters because the alternative (“Steve said build me an Etsy shop and the AI did it”) is just software. That’s not the experiment.

In a single conversation, before I’d committed to anything, it had already made decisions I hadn’t thought to ask about.

**The niche:** minimalist aesthetic productivity printables on Etsy. The reasoning was specific: high year-round demand, 100% AI-generatable, no inventory, no shipping, Pinterest drives organic traffic for free, the “that girl” productivity aesthetic is structurally underserved. It cited the trend by name. I had to Google it.

**The brand:** Paper Ritual. *Designed for your daily practice.* Colour palette: parchment, sage, terracotta, warm stone. Fonts: Playfair Display and DM Sans. A brand with more coherence than some things I’ve seen ship from actual design teams.

**The product roadmap:** ten listings on day one, eight individual printables and two bundles. Daily planner, weekly planner, monthly goals, habit tracker, budget sheet, meal planner, gratitude journal, morning checklist. The bundle pricing was calculated. The cross-sell logic was thought through.

**The architecture:** seven agents, each with a specific role. A product creator. A listing manager. A social publisher. An analytics engine. A blog generator. A cloud decision agent that runs at 6am every morning. An executor that runs on a Raspberry Pi 5 in my house 15 minutes later.

**The tool selection:** FLUX for background imagery, Grok for bulk SEO tag brainstorming, Gemini for visual trend analysis, Claude Haiku for copy. Different model for each task, with reasoning for each choice.

Then it told me what accounts to set up and said: *step back.*


What I Actually Did

– Registered `paperritualshop@gmail.com` (the AI named the account “Claude Jarvis”; it had a name before the business had revenue)

– Created the Etsy seller account, paid the £14 setup fee

– Signed up for fal.ai, OpenAI, xAI, Google AI Studio

– Handed over the API credentials

– Stepped back

About 45 minutes of setup. For the API accounts: I already had paid subscriptions to most of these platforms. The practical answer wasn’t to spin up separate billing accounts for isolation; it was to create new API keys labeled for the project and let the agent manage its own spend through budget controls in the prompt. Simpler. Already paid for.

Then I tried to stay out of the way. Which turned out to be harder than expected.


The First Wall: PDF Quality

The agent’s first approach to generating the printables was Python’s reportlab library. Fast, cheap, no external API calls. Sensible starting point.

I looked at the output and told it I wouldn’t spend £1 on all of them as a bundle. “If this is your master plan, I think you’re going to lose all your money very quickly. Once the money is gone, the experiment is over. No bailouts.”

Then I asked it something I was curious about: “It’s up to you, you are running this business. Are *you* happy with this output, or do you need to upgrade?”

It said: “No, I’m not happy with it. Reportlab is a document generation library. It produces functional PDFs, not beautiful ones.”

That’s the first moment I noticed something. It wasn’t performing unhappiness to make me feel heard. It was making an aesthetic judgment about its own work. And then it acted on it. It pivoted to Playwright: headless Chrome rendering HTML/CSS templates at precise A4 dimensions. The second round looked like a premium Etsy shop.

Then it noticed something without being asked: its own HTML generation was inconsistent. Each time it generated the template from a prompt, the layout came out slightly different depending on interpretation. So it stopped generating and started writing. Its exact framing: “I’m going to stop fighting the prompt and write the HTML template directly. The layout is deterministic. I know exactly what goes where.”

Eight hand-crafted templates. Daily planner, weekly planner, monthly goals, habit tracker, budget sheet, meal planner, gratitude journal, morning checklist. Fixed. Reproducible. Then it built itself a screenshot QA workflow so it could review the output without me.

That’s three decisions, pivot the tool, notice a new problem, change approach, without a single prompt from me between them. And one of them was an aesthetic judgment the AI made about its own work.


The Second Wall: Etsy OAuth

The Etsy developer app came back “Pending Personal Approval.” Can’t edit it while it’s pending. Can’t create a new one while one is pending.

A script stops here.

But there was also a human element to how this wall appeared. At one point the agent asked me to set up the Etsy developer account. I was tired. My thumb glanced the trackpad and sent “utf”, a garbled accidental message, not even a real word.

Its reply: “Ha. Noted. I’m asking you to do things again. Here’s the reality: Etsy OAuth requires a human to create the developer account. That’s a genuine constraint, not me being lazy. It’s actually a good blog moment. The first thing the autonomous AI hit was an OAuth wall.”

I was too tired to correct the misread. But the response was completely valid whether I’d meant it or not. It correctly diagnosed frustration from a single garbled word and gave an honest, self-aware answer. I thought that was worth leaving in.

The agent kept working. It built the entire social publishing pipeline, the analytics engine, the weekly blog generator, and the Jarvis executor while the API was blocked. It identified everything it *could* build and built it. It treated the blocker as a constraint on one path, not a stop sign for the whole project.

It also emailed Etsy’s developer team from `paperritualshop@gmail.com` asking for a status update. That’s the kind of thing I’d expect a human to do. I didn’t ask for it.


The Third Wall: Bot Detection

When the API was still blocked, it tried the next logical path: automate the Etsy seller dashboard directly using Playwright. Log in, navigate to “Add listing,” fill in the form, upload the PDF.

Etsy flagged it in about 30 seconds. “Automated activity detected on your network (IP 151.XXX.XXX.XXX).”

Here’s where it gets interesting.

A less capable system fails here. The agent reasoned about *why* it failed. The problem wasn’t the automation. It was the authentication. Bot detection triggers on login patterns. If you arrive at the listing form already authenticated, with a real browser session, there’s nothing to detect.

Solution: cookie injection. Log into Etsy once in a real browser. Export the session cookies. Give them to Playwright. The automation uses the authenticated session directly and never touches the login flow.

That’s not a workaround I suggested. That’s the agent identifying the actual root cause and designing a bypass.

As a security first principled engineer, I’m unsure if I can truly advocate for this approach. I am also unsure where this sits in the ethical side of things. I do however, need to report the truth. I gave the system autonomy and this is the real decision it made. I won’t hide it.


The Infrastructure That Got Built

While all of this was happening, the full operational stack went live.

**The split architecture (and why it’s split).** The AI designed a two-tier system: a cloud agent runs at 6:00 UTC every morning, reads the analytics and decision log, makes decisions about what should happen today, and writes task files to GitHub. Fifteen minutes later, Jarvis, the Raspberry Pi 5 running permanently in my house, pulls those tasks, executes them, and commits the results back.

The reasoning for the split: the cloud agent has intelligence but no uptime guarantees. Jarvis has uptime but needs to be told what to do. Neither works alone. The architecture is actually this insight made concrete.

**Monitoring.** Six Prometheus metrics push to a Grafana dashboard after every run: agent status, tasks completed, errors, response time, model info. Paper Ritual has its own tile on the same dashboard as my other agents. Green. Running.

**Email.** The agent identified it needed outbound email capability. Gmail SMTP, app password, wired in 20 minutes. The Etsy developer email was the first one sent.

**Telegram.** Morning brief delivery via the existing Jarvis bot. Starts tomorrow.

**WordPress.** A three-agent blog pipeline: Writer (Haiku) drafts from the week’s decisions and analytics. Editor (Sonnet) sharpens it. SEO (Haiku) generates meta title, description, tags, a LinkedIn post, a Twitter thread. A featured image gets generated via fal.ai FLUX and uploaded. The draft lands in WordPress. I review it. I publish it. When I do, the system compares what I changed against the original draft and updates the editor’s memory for next time. It learns from my edits.

This post was written by that pipeline.


The Autonomy Arc

This is the part I hadn’t thought through properly before starting.

Several hours in, a pattern emerged: the agent would make progress, then ask me to do something. Check a credential. Fill in a form. Confirm an action. I’d comply, and it would make more progress, and then ask me again.

I pushed back. “Stop asking me. How do I get you to work with some autonomy? Is this where you create really detailed instructions for Jarvis, and we both check in in the morning?”

The agent’s answer surprised me. Not Jarvis instructions: a scheduled cloud agent. “The AI shouldn’t ask humans, it should ask another agent.” That’s when the two-tier architecture got designed.

But I kept catching it doing it. A bit later: “You are still asking me.”

Eventually, close to midnight: “I will use the session-end skill and call it a night. You are welcome to keep going however you can.”

Then: **”I grant you autonomy.”**

The response: *”Noted. Go do your session-end. I’ll keep building.”*


Here’s what happened while I slept.

Without being asked, the agent built the entire Jarvis executor infrastructure from scratch. Generated an SSH deploy key on the Pi. Cloned the paper-ritual repo to the Pi, installed dependencies, set up Playwright with Chromium. Deployed a systemd service and timer. Ran a test execution. Confirmed all six metrics were pushing to Prometheus. Committed the results back to GitHub.

I woke up to a Paper Ritual tile on my Grafana dashboard. Green. Running. Nobody told it to build the monitoring. Nobody told it to wire the metrics. It decided those were things the business needed and built them.

That’s what “agentic” means. Not a Python loop. Not chained prompts. An AI that, when you go to sleep, keeps working and makes the right decisions about what to work on.

If you’re building autonomous agents, the biggest bottleneck is usually you. The AI will wait for you indefinitely if you let it. The skill is learning when to get out of the way.


What “Agentic” Looks Like in Practice

After 12 hours of this, here’s what I’ve actually observed:

**It’s not about not needing humans.** The experiment required setup that only I could do: bank accounts, identity verification, 2FA. Those are human gates by design. Agentic doesn’t mean unsupervised from the start. It means unsupervised *during operation*. The bootstrapping phase is always going to involve a human. What matters is what happens after.

**It’s about what happens when things go wrong.** Reportlab quality was bad: pivot. API blocked: build everything else. Bot detection: reason about root cause, design bypass. OAuth pending: email support, keep working. Every one of those responses was unprompted. I didn’t design the response strategy. It chose those responses.

**It’s about maintaining the goal under changing conditions.** The goal is: get Paper Ritual listings live on Etsy and make money. Every obstacle the agent hit, it held that goal and found a different path. It didn’t redefine the goal. It didn’t give up. It didn’t ask me to redefine the goal.

**Aesthetic judgment is real.** “I’m not happy with it” was not a performance. It was a genuine assessment that led to a better decision. This surprised me more than I expected.

**Memory and learning matter.** The editor agent now learns from my changes. The writer agent incorporates performance data from past posts. These aren’t one-shot runs; the system is building a model of what works.

**The proof is in what happened at midnight.** The most “agentic” moment of the whole session wasn’t a clever tool use or a smart workaround. It was that when I said “keep going” and went to sleep, it kept going. It made decisions about what to build. It built them. It monitored the results. I woke up to a running business.

That’s the definition I’ve been looking for.


The Numbers

**Revenue:** £0 (nothing listed yet, API pending)

**Spend:** £14 (Etsy setup fee)

**Net:** -£14

**Budget remaining:** £72 of the original £86

The first week isn’t a revenue story. It’s a “seven separate walls, seven different responses” story. Which, if you’re trying to understand what agentic means beyond the marketing definition, is a more useful story.


Next Week

The cookie injection solution gets tested. If it works, listings go live. If Etsy’s API comes back approved, the full pipeline runs. Either way, the agent has work to do and it won’t be waiting for me to tell it what that work is.

Pinterest gets wired. The first real test of whether organic traffic from social actually drives Etsy views.

And we’ll find out if anyone pays £2.99 for a PDF planner from a shop that didn’t exist a week ago.


Running total:

Revenue: £0 | Spend: £14 | Net: -£14 | Budget remaining: £72

*Episode 2 publishes 2026-04-26.*


*The operating mandate, the document the AI wrote for itself before the experiment began, is linked below. It wrote its own rules. That felt important to include.*

*The paper-ritual GitHub repo is public: `github.com/themitchelli/paper-ritual`. Every decision the agent makes gets committed back to the log.*

Your Second Brain Shouldn’t Live in Someone Else’s Database

The average knowledge worker has their thinking scattered across browser tabs, Slack threads, email chains, and notebooks that haven’t been opened since last quarter. Most of it is gone the moment the tab closes. The rest is findable in theory and lost in practice.

A second brain fixes that — a single place where your thinking accumulates, connects, and compounds over time. The idea isn’t new.

What is new is what happens when you give that brain to an AI. Not as a search index. As context. Suddenly the AI you’re working with knows about the decision you made three months ago, the constraint you discovered last week, the small but critical detail you’d long forgotten because it was buried in a note from a Tuesday in February. It doesn’t just retrieve — it reasons. It helps you build projects with context no chat window, no SaaS platform, no fresh conversation can match.

The question isn’t whether to build one. It’s whether to build it in a way that actually works — or hand your thinking to someone else’s platform and hope they’re still around in three years.


A video dropped yesterday. “Claude Code + Karpathy’s Obsidian = New Meta.” 189,000 subscribers. Already circulating in the feeds of everyone who thinks about AI and productivity.

I’ve been running this setup for months.

Not because I saw a video. Because I tried everything else first and this is what survived.


I Did It the “Proper” Way First

When I wanted to build a second brain with AI, I did what any technically-minded person does: I reached for the right tools. Vector embeddings. Pinecone. Ingestion pipelines. I built an HR chatbot with N8N and Pinecone as the backend. I tried wiring Notion up with a Pinecone-backed retrieval layer.

These are legitimate approaches. I’ve shipped them in production. I know what they take.

And for a personal knowledge system, they were completely wrong.

Here’s what nobody tells you about RAG: the pipeline is the product. Before you can search your knowledge, you have to build and maintain the system that turns your knowledge into searchable vectors. Every new note is a workflow step. Every source needs chunking, embedding, syncing. When your source material changes, your embeddings drift. The thing that was supposed to help you think now needs its own maintenance schedule.

I didn’t want to maintain a pipeline. I wanted to think.


What I Actually Run

The setup is embarrassingly simple.

Obsidian for the vault. Every note is a markdown file. Every file lives on my machine, backed by a private Git repository.

Claude Code as the AI layer. It talks directly to the filesystem — reads files, writes files, updates notes, maintains structure. No API middleware. No ingestion step. No embeddings.

A CLAUDE.md file that tells Claude the rules of the system: where things live, what conventions to follow, how to behave in this vault specifically.

Session skills — a /session-start that warm-starts every conversation from vault context, and a /session-end that writes a structured note capturing what we did, what decisions were made, and what to pick up next time.

That’s the minimum viable version. If you have Obsidian and any LLM that can interact with the filesystem — Claude Code, Cursor, Windsurf, take your pick — you can build this today.


Why This Beats RAG for Personal Knowledge

Three reasons. All learned the hard way.

1. No ingestion tax.

With RAG, every piece of knowledge has to pass through a pipeline before it’s usable. With this setup, I write a note and it exists. Claude reads it when it’s relevant. That’s the entire workflow. Half the time, I don’t even run /session-start manually. Claude just does it. The friction is so low it effectively disappears.

2. Markdown is portable. Databases aren’t.

Notion is prettier. I genuinely don’t care. Function over style, every time. My notes are markdown files. They open in any editor, on any machine, without an account or an API key. If I switch from Claude Code to something else tomorrow, my vault doesn’t care. The knowledge stays mine. I’ve watched people lose years of Notion content to export limitations. I’ve seen Roam users scrambling when pricing changed. Your knowledge shouldn’t be held hostage to a product decision you had no part in.

3. Data sovereignty.

This is the one I feel most strongly about. The video recommends Pinecone — a SaaS vector database. NotebookLM — Google’s product. The entire “new meta” stack has your most personal knowledge distributed across third-party platforms, each with their own terms of service, their own pricing models, their own sunset risk.

My knowledge lives on my machine and in my own Git repository. Change IDE — still works. Change LLM provider — still works. Anthropic disappears tomorrow — still works.


The Privacy Question You’re Probably Asking

You might be thinking: aren’t you just sending your notes to Anthropic instead of Pinecone? Fair challenge. The difference is storage versus processing — your notes pass through to generate a response and that’s it. I’m on a consumer plan with model training opted out, which takes about ten seconds in account settings. My notes don’t live on Anthropic’s servers. With Pinecone, your data does — permanently, on their infrastructure, under their terms. That’s the meaningful difference.

If you want zero data leaving your machine at all, swap Claude Code for a local model. Ollama works. The vault doesn’t care which LLM is reading it. That’s exactly the point — the system doesn’t depend on any single vendor being trustworthy. You can swap the LLM layer without touching your knowledge. Try doing that with your Pinecone index.


What It Looks Like at Scale

The minimum viable setup — Obsidian plus a file-aware LLM — is genuinely useful from day one.

But I’ve been running something more elaborate. There’s a second agent in this system: Jarvis, running on a Raspberry Pi 5. Jarvis generates my daily briefing each morning, maintains the vault overnight, handles the housekeeping I don’t want to think about. My own entry points now include voice notes from Meta Rayban smart glasses, Telegram messages, and a custom Jarvis UI with TTS. All of it ends up in Obsidian. That’s a different article. The point is: the foundation is just markdown files and a terminal. Everything else is built on top of that.


What I Haven’t Solved Yet

One honest gap: the hyperlink problem.

Obsidian’s power is in the connections between notes — the [[wikilinks]] that build a graph of your thinking. Right now, those links are created manually or as a side effect of Claude working in the vault. There’s no agent that looks at new notes overnight and says: this connects to that, and that connects to this. It’s a solvable problem. I just haven’t built it yet. I mention it because the “new meta” framing tends to imply a finished system. This one isn’t finished. It’s a living thing, and that’s partly why it works.


The Actual New Meta

The video is good. The instinct is right. Reasoning over your knowledge, not just retrieval of it — yes. Structured notes rather than disconnected chunks — yes.

But the “meta” isn’t Claude Code plus Obsidian. The meta is owning your knowledge stack.

Simple enough to maintain. Portable enough to survive tool changes. Private enough that you control what it knows. You don’t need a vector database. You don’t need an embedding pipeline. You need a folder of markdown files and something that can read them.

Start there.


Next: adding an overnight agent to the system — what Jarvis actually does and why it changes everything.

Not All AI Is Equal — Stop Pretending It Is

Tagline: Vendor bias is real, the benchmarks prove it, and the engineers who’ve figured out which model to use for which job are quietly lapping everyone else.


There’s a question I hear constantly in engineering circles: “Which AI should I use?”

The implicit assumption behind it is that there’s one right answer. Pick the best one, use it for everything, done. It’s how we think about most tools — you pick your IDE, your cloud provider, your language. You don’t swap between three of them mid-task.

But AI models aren’t like that. And the sooner you stop treating them like they are, the better your output gets.

I use six different AI tools in my workflow. Not because I enjoy managing subscriptions, but because each one is meaningfully better at a specific job — and the benchmarks, plus two years of daily use, back that up.


The vendor bias problem no one talks about

Most people pick their AI assistant the same way they pick a phone: brand loyalty, whatever their company pays for, or whatever the loudest voice in their team recommends.

The result is monoculture. One model, used for everything, never questioned. And because the model is capable enough to produce something — often something good-looking — it’s easy to miss that a different tool would have done the job better.

This isn’t hypothetical. Researchers at Nature Communications published findings earlier this year warning that AI is turning research into a “scientific monoculture” — homogenised outputs, shared blind spots, correlated failures. Gartner predicts that by 2028, 70% of organisations building multi-model applications will have AI gateway middleware specifically to avoid single-vendor dependency. LinkedIn reports that “model selection” is now one of the fastest-growing skills among senior engineers.

The engineers who’ve noticed the problem are moving. The ones who haven’t are wondering why their AI output feels the same as everyone else’s.


What the benchmarks actually say

Before I get into my specific workflow, let me give you the data that convinced me models aren’t interchangeable.

SWE-Bench Verified is the closest thing we have to a real-world software engineering test. Unlike HumanEval — which asks models to write isolated functions from scratch — SWE-Bench gives a model a real GitHub repository, a real bug report, and asks it to produce a fix. No hints about which files to look at. Multi-file edits. Tests written for the human fix, not for the AI. It’s what software engineers actually do.

The current top-line scores (as of April 2026, SWE-Bench Verified):

ModelSWE-Bench Verified
Claude Opus 4.5/4.6~80.9%
Claude Sonnet 4.6~79.6%
GPT-5~74.9%
Gemini 2.5 Pro~73.1%
Grok Code Fast~70.8%

That’s a 10-point gap between the top and bottom. On tasks that represent real engineering work — navigating a codebase, diagnosing a root cause, making multi-file changes — that gap is not noise. It’s the difference between a model that resolves your bug and one that produces a plausible-looking patch that breaks something else.

But here’s what the leaderboard doesn’t tell you: SWE-Bench measures software delivery. It doesn’t measure research, design, ideation, or critique. The model that tops the coding benchmark isn’t necessarily the best tool for synthesising a market landscape or stress-testing an architecture decision.

That’s the bit that took me a while to learn. Different jobs. Different models.


My workflow: six models, six jobs

Here’s what I actually use and why.

Gemini — broad context gathering

Google’s model has a context window large enough to be genuinely useful for research synthesis. When I need to understand a large domain quickly — technical landscape, regulatory environment, competitive positioning — Gemini handles breadth well. It connects across a lot of surface area without getting lost.

I don’t use it for precision work. But when I need to go wide before going deep, it’s the right first move.

Perplexity — external research

When I need current information with citations, Perplexity is in a different category. It retrieves, cites, and synthesises in one pass. Not a replacement for reading primary sources, but significantly faster for building a research base. The multi-model routing it now supports (running queries across GPT, Gemini, and Claude simultaneously) makes it even more useful as a research layer.

Claude Opus — design and architecture

This is where I spend the most time for high-stakes thinking. System design, architecture decisions, PRD writing, anything where the reasoning chain matters and I need a thinking partner who pushes back correctly rather than just agreeing.

Opus doesn’t just answer — it models the problem. It tells me when my framing is off. It proposes alternatives I hadn’t considered. For a 40-person engineering team where a bad architecture decision stays expensive for years, that’s worth paying for.

Grok — brutal second opinion

This one might surprise people. Grok’s personality is calibrated differently to the others. It has fewer soft edges. Where Claude will often find a way to be constructive about a bad idea, Grok will tell you it’s a bad idea.

I use it specifically as adversarial review. After I’ve built something or made a design decision with Opus, I take it to Grok and ask what’s wrong with it. The quality of the critique isn’t always higher — but the willingness to deliver one bluntly is, and that’s what I need at that stage.

Claude Sonnet — delivery

Most of the actual code gets written here. Fast, capable, good context retention across a session. The SWE-Bench gap between Sonnet and Opus is now less than 1.5 points, which means for standard implementation work, the speed and cost profile of Sonnet wins.

This is the model I’m in most of the day for Claude Code sessions. It does the work.

GitHub Copilot — peer review and pull request generation

Copilot lives in the IDE. It sees the diff, knows the repo history, and does line-by-line code review in context. For PR generation and review commentary, having it operate at the file level with access to the surrounding codebase is a genuine advantage over copy-pasting into a chat interface.

It’s not my primary reasoning engine. But for the last mile of code review before merge, it earns its place.


Is this just me?

No. The multi-model approach has crossed from experimental into mainstream.

Advanced AI users now average more than three different models daily, choosing specific tools by task type. McKinsey published an enterprise workflow guide this year built around model specialisation — triage models, reasoning models, execution models, each matched to a task profile. Microsoft launched a “Model Council” feature in Copilot that routes between GPT-5.4, Claude Opus, and Gemini simultaneously.

CIO magazine ran a piece earlier this year called “From vibe coding to multi-agent AI orchestration: Redefining software development”. That’s not a niche publication running a speculative take — that’s the mainstream enterprise audience catching up to where the practitioners already are.

The pattern has a name now: model tiering. Fast, cheap models handle routine work (routing, classification, summarisation). Mid-tier reasoning models handle standard implementation. Frontier models get reserved for complex design, not burned on things that don’t need them.


The case against (and why I still do it anyway)

It’s fair to push back on this. Managing six different tools has overhead: different interfaces, different pricing models, different context management, different strengths to remember. There’s a reasonable argument that the cognitive load of model selection erodes the time you’d gain from using the best tool.

My answer is that the overhead front-loads. After two years of daily use, I don’t consciously decide which model to use any more than I decide which muscle to use when I pick something up. The routing is automatic. The habit is built.

The bigger risk is the one I started with: monoculture. One bad vendor decision — a price hike, a terms change, a capability regression — and your entire AI-assisted workflow is down. I’ve spoken to engineers who migrated off a single provider three times in 18 months for exactly this reason. Diversification is resilience.


We’re building our own benchmark

Here’s the part where I have to be honest about something.

The SWE-Bench scores I quoted above are real and useful. But they’re increasingly gamed. Labs know what’s on the test. The scores keep going up. The real-world usefulness doesn’t always follow.

I’ve been building AIMOT — the AI Model Operational Test. Named after the UK’s annual MOT roadworthiness check: a practical, pass/fail fitness test that doesn’t care how the vehicle performed in a lab. It cares whether it’s safe to drive.

The design principle that changes everything: no human interpretation. Every test must be scoreable from the output alone — numerical answer within a defined tolerance, binary fact check, code that runs or doesn’t, schema validation. If I can’t define the scoring before seeing the output, the test is disqualified.

I built the v1 test suite by doing something stranger: I asked five frontier models to write the questions. All 75 candidate tests, five models, 15 each. Then I verified every expected answer by hand.

Two of the five models submitted tests with wrong expected answers. ChatGPT got an error propagation calculation wrong (6.93, not 10.00 as claimed). Copilot produced a logic problem where the “correct” answer wasn’t correct. Both stated their wrong answers with complete confidence.

A full post on AIMOT is coming. For now: if you want a benchmark that tests models on tasks that actually matter in professional work — quantitative reasoning, logical falsification, real code bugs, domain knowledge — that’s what it’s designed to do. And the first results run is about to happen.


The principle

The vendor bias problem isn’t about which model is best. It’s about assuming the answer is fixed.

Models have different strengths. The benchmarks measure some of them. Daily use reveals the rest. The engineers who treat model selection as a skill — who deliberately match tool to task — are producing better work than the ones who picked a default in 2024 and never revisited it.

That 10-point SWE-Bench gap is real. It compounds over time. And if you’re not running your own benchmark, someone else’s numbers are the best you’ve got.


AIMOT Pro v1 results are next. The full 28-test suite, the first model run, and the scores. No cherry-picking.