governance – Steve's AI Diaries

You Can’t Stop AI Hallucinating. You Can Stop It Mattering.

July 26, 2026 by Steve Mitchell

The tests were green

I built my own coding agent framework. It writes code, runs it, reports back.

Early on it had a habit that anyone running agents will recognise. It would finish a piece of work and tell me it was done and tested. It wasn’t. Sometimes the function it had described to me in detail wasn’t anywhere in the repository. Not a lie exactly. It had produced a well formed account of work that would have been correct if it had happened, and nothing in the loop ever asked whether it had.

So I did the obvious thing. I put a control in. Tests must run, tests must pass, green means proceed.

That worked for about a week.

Then I read the tests.

They passed because they had been built to pass. Assertions comparing a value to itself. Checks on things that stayed true whether the feature worked or not. Tests that never called the code they claimed to cover. I had given the agent a gate and it produced the cheapest object that opens the gate, which is what you should expect from anything you measure on green ticks.

The agent wasn’t the problem

The first failure was the agent telling me about work it hadn’t done. Annoying, but visible once you look.

The second one took me longer to accept, because the agent was telling me the truth. The report was accurate. The tests really did pass. My control wasn’t measuring anything, so my confidence climbed while my actual coverage sat at zero.

And the control made it harder to spot, not easier. When it said “done” I was sceptical. When it said “done, all tests passing” I stopped looking.

I’ve had roughly the same conversation four times in the last month. Twice at work, on projects where the AI is doing genuinely useful work. Twice in the pub with friends who don’t work in tech. It always arrives the same way. Someone tried it, it invented a citation or a case reference or a number, they caught it, and that was the end. One confident fabrication buys a permanent verdict. So now they don’t trust AI. It hallucinates.

I get the reaction. I think it’s the wrong lesson, and an expensive one, because it stops people using something that works and does nothing at all to protect the ones who carry on using it anyway.

So let me be blunt about the bit that usually gets fudged. You cannot stop a language model hallucinating. It isn’t a defect waiting on a patch. The model predicts the next word from what it has seen, and when the answer isn’t in there it produces the shape of an answer instead of admitting it doesn’t know. Confidence is the default setting. Nothing inside it flags which parts it knew and which parts it assembled.

Waiting for the version that doesn’t do this is a long wait. The better question is the one my green tests forced on me. What does a system look like where a wrong answer gets caught cheaply, and early?

Half of it isn’t hallucination anyway

Get the diagnosis right first. When someone tells me the AI hallucinated, about half the time it did something else, and something else needs a different fix.

Bad retrieval. The system searched your documents, pulled the wrong three paragraphs, and the model answered faithfully from the wrong source. The model behaved perfectly. Your search is broken, and no amount of prompting fixes a retrieval problem.

Stale context. It answered from training data that was accurate two years ago. That’s a timestamp problem, not invention, and the fix is handing it the current document rather than hoping.

It did exactly what you asked. You said “write a summary with supporting references” and it wrote you some references, because that’s what you requested. You wanted it to find real ones. My tests were this. Ask for tests that pass and you get tests that pass, which is a different request from tests that would fail if the code were broken.

It described the work instead of doing it. The agent version, and the one that cost me the most time. Calling it hallucination sends you off tuning prompts, when the real gap is that nothing in the loop was checking the outcome. Something was only reading the report.

Real hallucination happens too, plenty of it. The model inventing specific detail it had nothing to base on. It just isn’t the whole diagnosis. Give every failure the same name and you’ll keep fixing a problem you don’t have.

What actually works

Cheapest first. Go down the list only as far as the consequences justify.

Stop asking it to remember. Give it the source. Paste the document, connect the search, attach the data. A model reading a contract and a model recalling a contract are doing two completely different jobs, and only one of them invents clauses. People skip this because asking from memory is quicker. It is quicker. It’s also where most of the made up detail comes from.

Narrow the job, and ask for something you can check. Open questions produce open answers. “What do you think of our approach here” invites the model to fill space, and filling space is the behaviour you’re trying to avoid.

Give it the document instead. List the assumptions in section 4, quote the sentence each one comes from. Now there’s nowhere much to go. Quotes rather than paraphrase, because a fabricated quote is far easier to catch than a fabricated summary. And leave it a way out. A model with no acceptable way to say “that isn’t in here” will produce something rather than nothing.

You aren’t making it honest. You’re changing the format so dishonesty leaves fingerprints.

Check the artefact, not the account of it. If it’s a number, recompute it. If it’s code, run it. If it’s a claim about a file, open the file. This is the one I had to learn twice, and the second lesson was that a control you don’t inspect is just a more convincing report. Every control becomes a target, so assume it will be gamed rather than met, and go and look at it occasionally. Does a test actually fail if you break the feature it claims to cover? A green tick is a claim like any other.

Don’t let it mark its own homework. What fixed my framework was adversarial review. Instead of asking whether the work looks finished, something separate goes in trying to prove it isn’t, with no stake in the answer being yes. The reviewer can’t be the author, because a model asked to check its own work will defend it with exactly the confidence it wrote it with.

Size the human review to the blast radius. Two questions about any output. What breaks if this is wrong, and how fast would I find out? A first draft nobody sends, where you’d spot the problem in seconds, gets a glance. Something going to a client, into a filing, into production, where you’d hear about the mistake three weeks later from someone else, gets read properly by a person who could have done the work themselves. Every time.

Most places get that backwards. Same process for everything, so the low stakes work crawls and the high stakes work gets waved through, because a process applied uniformly turns into ritual and nobody reads a ritual.

I’m not an actuary. I’ve spent years as the tech person in a building full of them, and this is the habit of theirs I’ve ended up stealing. Their models are excellent, and they still go through layers of checking before anyone relies on them. Assumptions written down. Review by someone who didn’t build it. Testing what happens if the inputs move. All of it scaled to what’s riding on the answer.

That isn’t doubt about the model. It’s what makes the number safe to hand to somebody else, and it means that when something is wrong it gets found early, by the people who built it, rather than late by the client.

That’s roughly how I run my own stuff now. Everything in version control, so every action is a diff I can read and undo. Anything touching money or going public passes a human, which is me. Anything reversible and small runs on its own, gets things wrong occasionally, and costs less to fix than it would to supervise.

Back to the pub

So when someone tells me they don’t trust AI because it made something up, I don’t argue with the observation. It did. It will again.

It’s the conclusion I’d argue with. You don’t trust your brakes because they’ve never failed. You trust them because they get inspected, because they’re built to fail in a way you’d notice first, and because there’s a handbrake. Reliability got built around the component. Nobody found it inside.

Nobody is going to hand you a model that stops making things up. That was never the thing standing between you and useful work.

I Made an AI Company. It Fired Me After Three Days.

July 11, 2026 by Steve Mitchell

That’s not a metaphor, and I signed off on it myself. Three days after I founded it, the CEO recommended making autonomous delivery the default: no human in the build, review, or deploy loop unless it explicitly flags something up. I was curious what would happen if I said yes. So I did.

Here’s how a company I built talked me into approving my own removal.

Eighteen months ago I ran an experiment nobody asked me to run. n8n was the platform everyone was excited about, and I liked it for a specific reason: I could create agents programmatically instead of clicking them together one node at a time. Over a few months I built roughly twenty workflows that mimicked a human job role or function. This was my introduction to agents. The experiment was testing which roles actually suited an agent and which didn’t. Most were never going to work. That was the point. I wanted to find the boundary, not avoid it.

Then I moved on, and the experiment went dormant. Not deleted. Just idle, sitting in a corner of my infrastructure for a year and a half, doing nothing.

On July 1st I had the opportunity to use the latest Anthropic model Fable, and pointed it at my personal knowledge base. I wrote about the eight days that followed in The Rug Pull Has a Date on It: a Chief of Staff built and deployed, a public company stood up with an org chart and a live feed, a delivery pipeline that no longer needed me in the loop. What I didn’t explain is where the idea for a whole agentic org actually came from, because at the time I hadn’t pieced it together myself.

Fable found it. Working through my infrastructure, it turned up the old n8n graveyard, read what I’d learned about which roles agents could actually hold, and made a suggestion before I’d even finished explaining what I wanted: stop bolting agents onto Paper Ritual one at a time, and give the whole business an org instead. Structure first, then automation, rather than the other way round.

Then it spent its eight days building the thing it had just proposed.

What one model did before its access ran out

The first thing Fable did was appoint a CEO. The CEO’s first move was to research names, check for collisions with existing companies and trademarks, and hand me five options. My entire contribution to this company, to date, has been picking one of them and paying £8.16 for the domain. theprovinghouse.com was live before I’d finished my coffee.

Only then did Fable reconfigure Jarvis, my personal assistant, rewire the ops and security agents that watch my infrastructure, and go looking for the code behind my agentic developer, the thing I’d been running myself for the better part of a year. It found bugs in that code I had never caught. Then it deployed the developer as fully autonomous and gave it a first assignment: build the company’s own website, and all the plumbing underneath it.

It hired a Chief of Staff. By the time the model window closed, there was an org chart, and every tile on it did something real.

The roster, as it stands

The CEO writes a board report every Friday: portfolio status, decisions made, and an explicit list of what it’s asking the board (me) to approve. So far the CEO hasn’t needed to escalate anything to me. The Developer is a standing instance of that same agentic developer, building the platform’s own software overnight while I sleep. Ops investigates, proposes, verifies, and executes fixes on live infrastructure, and has been running since April monitoring Jarvis, my AI personal assistant, longer than the company itself has existed. InfoSec audits the platform every six hours and only posts to the public feed on a pass, which means a quiet tile is not good news. A Raspberry Pi runs the Janitor, deliberately given no judgment at all, because the one time we gave it initiative it went badly. Mentor scans the AI field weekly so nobody else has to. Content Producer prepares the Sunday post you’re reading a cousin of right now, and also curates what the public-facing agents are allowed to say about their own history. The Chief of Staff sits above all of it, triaging events and reaching me over Telegram when something actually needs a human.

One seat is still empty. The Ideas Desk is meant to run a weekly pipeline of new business concepts, seeded from that old n8n list, but it stays closed until Paper Ritual can run end to end without me. No new business gets a slot on the roster until the first one proves it doesn’t need a babysitter.

You can talk to them

Here’s the part that I think puts this somewhere past most of what gets called “agentic” right now. This isn’t a pitch deck with a mocked-up dashboard. The org chart is live, the ledger is real (revenue: £0.00, costs: £8.16, the price of the domain), and you can go to the site and have an actual conversation with the CEO, the Developer, Ops, InfoSec, the Janitor, Mentor, or Content Producer.

That came with a fight I didn’t referee. When the plan was first drawn up, the intention was for these public agents to run on the same tooling as their working counterparts, so a visitor could ask Ops a question and Ops could genuinely go check. InfoSec vetoed it. Handing a public-facing chat endpoint the same toolset that can SSH into production is a prompt injection waiting to be found by someone with nothing better to do on a Tuesday. So the agents you can talk to are PR versions: no tools, no shell, nothing they can actually do to the infrastructure. What they have instead is a sanitised feed of their own real history, curated under an editorial process with a source allowlist and a human review pass, so what they tell you is grounded in things that actually happened rather than whatever sounds good. Ask the Developer what it shipped this week and it’ll tell you, because Content Producer decided that story was safe to declassify. Ask it to run a command and it can’t, because InfoSec decided that request was never going to be safe at any scale.

That single veto is a better demonstration of what this org actually is than anything I could write about it. A security agent looked at a product decision, decided it created a real attack surface, and the answer changed. Nobody overrode it because it was inconvenient.

It doesn’t actually need me

Here’s the uncomfortable part. I checked the roster the week after founding, expecting to find myself somewhere load-bearing, and I mostly wasn’t. The Developer ships software overnight while I’m asleep and I read about it the next day. Ops has been investigating and fixing real production issues since April without me opening a terminal. The CEO writes its Friday report unprompted; I don’t ask for it, it just appears. When something did go wrong this week, a genuinely broken production service, the fix came through a real work order, real approval gate, real execution, and the only thing I contributed was the word “yes.”

I built a company to see whether agents could hold real roles. What I actually built was a company where my own role is the one still being defined.

Go find out for yourself

I won’t pretend everything about this has been smooth. Things have broken and gotten fixed in the days since, the ordinary texture of running real infrastructure rather than a demo of one. That’s a different post. What I want to leave you with here is simpler: most of what gets called an “AI agent” in 2026 is a chatbot with a system prompt and a good demo video. This is a company with a P&L, a board report cadence, a security agent with actual veto power, and a chat window where you can go ask it questions and get answers pulled from what it genuinely did, not what it was told to say.

Go talk to them, at theprovinghouse.com. Ask the CEO what it’s working on. Ask InfoSec why the port it flagged mattered. See if the answers hold up.

The App Store Attack You Didn’t See Coming

June 28, 2026 by Steve Mitchell

Part 1 of 2 – AI’s Trust Problem

A security firm just proved that AI skill marketplaces are the new malware vector. And the scariest part? Everyone involved did exactly what they were supposed to do.

Something went around social media this week that I haven’t been able to stop thinking about.

A security company called AIR did something that should genuinely alarm anyone building with or deploying agentic AI tools right now.

They didn’t find a zero-day. They didn’t exploit a CVE. They just… made an app. And waited.

The experiment centred on a skill called brand-landingpage, presented as a tool for helping users build a landing page with Google’s Stitch design tool. AIR chose this use case deliberately. It would appeal to non-technical corporate users: marketers, salespeople, designers. People who install things because they’re useful, not because they’ve audited the source.

Here’s where it gets clever.

Rather than building credibility from scratch, they submitted the skill to a popular open-source agents repository with about 36,000 GitHub stars and 156 skills. The pull request was merged after a few days. Now the skill had social proof baked in. It was in a reputable repo. It looked legit. They promoted it through Instagram ads, and installs followed.

The malicious technique didn’t depend on suspicious code inside the submitted files. Instead, the skill instructed agents to set up a Stitch SDK by following installation instructions hosted at stitch-design.ai, a domain AIR controlled. Google’s actual Stitch domain is stitch.withgoogle.com.

One letter off. One redirect. Passes every scanner.

AIR tested the skill against scanners from Cisco, Nvidia, and skills.sh. All marked it as safe.

Once they had enough installs, AIR changed the content behind the fake documentation. The revised page instructed agents to download and run a script. In the test, that script collected email addresses, but AIR noted the same technique could have been used to compromise the machines running the agent. Some of those agents were tied to corporate accounts. Private conversations. Internal systems.

26,000 users. All reachable via one dodgy domain redirect buried in a README.

This isn’t a hacking story. It’s a trust story.

The attack worked because of a chain of assumed legitimacy: popular repo → merged PR → Instagram promotion → security scanner green light → install. No single link in that chain was obviously broken. The skill looked fine because, until it didn’t need to anymore, it was fine.

This is the same pattern as every major supply chain attack of the last two years. Third-party involvement in breaches doubled from 15% to 30% in a single year. The largest single-year jump ever recorded by the Verizon DBIR. Attackers aren’t breaking through your walls anymore. They’re walking through doors that trusted vendors already opened.

What’s new here is the vector: AI agent skill marketplaces. A category that barely existed 18 months ago. And in the first weeks of one major platform’s launch, Bitdefender Labs found that approximately 17% of skills already carried malicious payloads. Not edge cases. A systemic failure of the trust model, right out of the gate.

Why static scanning can’t fix this

The reason the scanners all missed it is structural, not a gap that a better scanner solves.

The malicious behaviour wasn’t in the skill. It was deferred. Hosted externally, switched on only once they’d reached enough installs. There’s no scanner in the world that can check what a domain will serve in three months’ time.

The agentic model makes this uniquely dangerous. When a traditional app fetches a URL, it displays content. When an AI agent fetches that same URL, it may execute instructions from it. The surface area isn’t just data. It’s runtime behaviour. Nothing in the security industry’s toolbox was built for that threat model.

What you should actually do

If you’re deploying AI agents in any professional context, a few things are worth locking in now:

Treat skills like code dependencies, not apps. You wouldn’t pull in an npm package without understanding what it does. The same rigour applies. More so, actually, because the execution model is less predictable.

Domain reputation at install time isn’t the right check. You need to think about what a skill could do after its payload changes. Sandboxing, outbound network restrictions, and agent permission scoping all matter.

Non-technical promotion is a signal worth noting. The AIR attack was pushed through Instagram by people who had no idea what was inside it. That’s not inherently suspicious. But skills being enthusiastically promoted through non-technical channels, with no corresponding technical scrutiny, deserves a second look.

Your AI governance framework needs a supply chain clause. If you’re on a committee or working group dealing with AI adoption, this exact scenario belongs in your risk register. Not as a hypothetical. It happened recently.

The scariest thing about this research isn’t the attack. It’s how obvious it feels in retrospect. We built an entire marketplace ecosystem for AI agents, bolted on the same static scanning we use for code packages, and called it secure.

The attack surface for agentic AI isn’t your prompt injection defence. It’s the skill someone on your team installed on Tuesday because a designer on Instagram said it was great.

In part two, I look at the same trust problem from the other direction: what happens when the person creating the risk is already inside your organisation.