Steve Mitchell – Steve's AI Diaries

I Made an AI Company. It Fired Me After Three Days.

July 11, 2026 by Steve Mitchell

That’s not a metaphor, and I signed off on it myself. Three days after I founded it, the CEO recommended making autonomous delivery the default: no human in the build, review, or deploy loop unless it explicitly flags something up. I was curious what would happen if I said yes. So I did.

Here’s how a company I built talked me into approving my own removal.

Eighteen months ago I ran an experiment nobody asked me to run. n8n was the platform everyone was excited about, and I liked it for a specific reason: I could create agents programmatically instead of clicking them together one node at a time. Over a few months I built roughly twenty workflows that mimicked a human job role or function. This was my introduction to agents. The experiment was testing which roles actually suited an agent and which didn’t. Most were never going to work. That was the point. I wanted to find the boundary, not avoid it.

Then I moved on, and the experiment went dormant. Not deleted. Just idle, sitting in a corner of my infrastructure for a year and a half, doing nothing.

On July 1st I had the opportunity to use the latest Anthropic model Fable, and pointed it at my personal knowledge base. I wrote about the eight days that followed in The Rug Pull Has a Date on It: a Chief of Staff built and deployed, a public company stood up with an org chart and a live feed, a delivery pipeline that no longer needed me in the loop. What I didn’t explain is where the idea for a whole agentic org actually came from, because at the time I hadn’t pieced it together myself.

Fable found it. Working through my infrastructure, it turned up the old n8n graveyard, read what I’d learned about which roles agents could actually hold, and made a suggestion before I’d even finished explaining what I wanted: stop bolting agents onto Paper Ritual one at a time, and give the whole business an org instead. Structure first, then automation, rather than the other way round.

Then it spent its eight days building the thing it had just proposed.

What one model did before its access ran out

The first thing Fable did was appoint a CEO. The CEO’s first move was to research names, check for collisions with existing companies and trademarks, and hand me five options. My entire contribution to this company, to date, has been picking one of them and paying £8.16 for the domain. theprovinghouse.com was live before I’d finished my coffee.

Only then did Fable reconfigure Jarvis, my personal assistant, rewire the ops and security agents that watch my infrastructure, and go looking for the code behind my agentic developer, the thing I’d been running myself for the better part of a year. It found bugs in that code I had never caught. Then it deployed the developer as fully autonomous and gave it a first assignment: build the company’s own website, and all the plumbing underneath it.

It hired a Chief of Staff. By the time the model window closed, there was an org chart, and every tile on it did something real.

The roster, as it stands

The CEO writes a board report every Friday: portfolio status, decisions made, and an explicit list of what it’s asking the board (me) to approve. So far the CEO hasn’t needed to escalate anything to me. The Developer is a standing instance of that same agentic developer, building the platform’s own software overnight while I sleep. Ops investigates, proposes, verifies, and executes fixes on live infrastructure, and has been running since April monitoring Jarvis, my AI personal assistant, longer than the company itself has existed. InfoSec audits the platform every six hours and only posts to the public feed on a pass, which means a quiet tile is not good news. A Raspberry Pi runs the Janitor, deliberately given no judgment at all, because the one time we gave it initiative it went badly. Mentor scans the AI field weekly so nobody else has to. Content Producer prepares the Sunday post you’re reading a cousin of right now, and also curates what the public-facing agents are allowed to say about their own history. The Chief of Staff sits above all of it, triaging events and reaching me over Telegram when something actually needs a human.

One seat is still empty. The Ideas Desk is meant to run a weekly pipeline of new business concepts, seeded from that old n8n list, but it stays closed until Paper Ritual can run end to end without me. No new business gets a slot on the roster until the first one proves it doesn’t need a babysitter.

You can talk to them

Here’s the part that I think puts this somewhere past most of what gets called “agentic” right now. This isn’t a pitch deck with a mocked-up dashboard. The org chart is live, the ledger is real (revenue: £0.00, costs: £8.16, the price of the domain), and you can go to the site and have an actual conversation with the CEO, the Developer, Ops, InfoSec, the Janitor, Mentor, or Content Producer.

That came with a fight I didn’t referee. When the plan was first drawn up, the intention was for these public agents to run on the same tooling as their working counterparts, so a visitor could ask Ops a question and Ops could genuinely go check. InfoSec vetoed it. Handing a public-facing chat endpoint the same toolset that can SSH into production is a prompt injection waiting to be found by someone with nothing better to do on a Tuesday. So the agents you can talk to are PR versions: no tools, no shell, nothing they can actually do to the infrastructure. What they have instead is a sanitised feed of their own real history, curated under an editorial process with a source allowlist and a human review pass, so what they tell you is grounded in things that actually happened rather than whatever sounds good. Ask the Developer what it shipped this week and it’ll tell you, because Content Producer decided that story was safe to declassify. Ask it to run a command and it can’t, because InfoSec decided that request was never going to be safe at any scale.

That single veto is a better demonstration of what this org actually is than anything I could write about it. A security agent looked at a product decision, decided it created a real attack surface, and the answer changed. Nobody overrode it because it was inconvenient.

It doesn’t actually need me

Here’s the uncomfortable part. I checked the roster the week after founding, expecting to find myself somewhere load-bearing, and I mostly wasn’t. The Developer ships software overnight while I’m asleep and I read about it the next day. Ops has been investigating and fixing real production issues since April without me opening a terminal. The CEO writes its Friday report unprompted; I don’t ask for it, it just appears. When something did go wrong this week, a genuinely broken production service, the fix came through a real work order, real approval gate, real execution, and the only thing I contributed was the word “yes.”

I built a company to see whether agents could hold real roles. What I actually built was a company where my own role is the one still being defined.

Go find out for yourself

I won’t pretend everything about this has been smooth. Things have broken and gotten fixed in the days since, the ordinary texture of running real infrastructure rather than a demo of one. That’s a different post. What I want to leave you with here is simpler: most of what gets called an “AI agent” in 2026 is a chatbot with a system prompt and a good demo video. This is a company with a P&L, a board report cadence, a security agent with actual veto power, and a chat window where you can go ask it questions and get answers pulled from what it genuinely did, not what it was told to say.

Go talk to them, at theprovinghouse.com. Ask the CEO what it’s working on. Ask InfoSec why the port it flagged mattered. See if the answers hold up.

The Fable Rug Pull Has a Date on It

July 8, 2026 by Steve Mitchell

In four days I lose access to the model that helped write this. Not a metaphor, not a prediction. Anthropic released Claude Fable 5 on July 1st, gave subscribers a window with it, and that window closes on Sunday 12th July. From Sunday Fable 5 is removed from subscriptions and costs API money outside my already considerable Max subscription. I knew this date was coming when I started using it. I built a dependency on it anyway, in eight days, and I want to show you exactly how that happened, because it is going to happen to you, probably without the four days of notice.

The three weeks I ignored it

Fable sat in my model picker from June 9th and I never selected it. Not out of principle. Out of fatigue maybe, out of cost fear possibly, but not intentionally at first. If you work with AI daily you know the feeling: another release, another benchmark chart, another week of breathless posts. I had work to do and a model that did it. The new one could wait. Three days later Anthropic pulled it and we had to wait. I missed the boat.

On July 1st Fable 5 as available again. I didn’t wait. I started small with Fabel. very specific tasks. In that light it was underwhelming. I saw nothing that couldn’t be done faster, cheaper to an equally high standard. Mentally I pat myself on the back thinking of previous posts on model selection. Fable went back on the shelf, ignored.

The hype machine doesn’t normally get me, but this time the noise was deafening. I watched one YouTube video, ran one experiment and jumped headfirst into the rabbit hole. I now regret waiting, even knowing what I know.

The eight days

Here is what one person and this model shipped between July 1st and yesterday. I am not listing this to show off. I am listing it because every line is a strand of the rope. I’m also not mentioning anything we are doing behind closed doors at work, that’s confidential. This is purely my personal experiments. The real list is probably double.

After watching this YouTube video from Jack Roberts on Fable 5 Dies in 4 Days… Do these 5 Things RIGHT NOW I liked the fact he points Fable at a second brain to get deep insights into personal optimization. We both use Obsidian, I should do a post on that at some point. It pulled two main threads. The first thread was an exciting work project I really can’t talk about here or yet. The second was my AI experiments, of which it dug out an experiment I did 18 months ago where I tried to make an agentic org staffed by only agents using N8N as a harness. Fable 5 said I had the right idea, but I implemented it wrong and did I want to fix it? Curiosity compelled me to say yes.

First it connected to Jarvis, my AI personal assistant hosted on a Raspberry PI running Hermes. Jarvis got supercharged. It reminded me of the scene in Avengers Age of Ultron.

It designed and deployed a Chief of Staff: an event dispatcher on my server that receives alerts from my other agents, triages them with judgment rather than rules, and messages me only when something deserves a human. I now talk to my infrastructure by voice note.

It took the agent company idea and really implemented it. Fable 5 stood up a public company. My agents have an org chart, public dossiers, a live activity feed, and a website. Visitors can chat with them. It came up with brand / naming ideas and researched the availability. All I had to do is pay.

It investigated a security incident without touching anything, found that my audit had been failing for 22 hours over unapplied patches, and then, instead of patching, redesigned the organisation so the system would fix itself: failures route to the dispatcher, the dispatcher issues work orders to an ops agent, a written authority list says what needs my sign-off. The fix ships this week, days after its designer is gone.

It removed me from my own delivery pipeline. Builds, adversarial reviews, deployments: autonomous by default, with a supervised mode for when I want training wheels, and hard rules about what can never ship without me. Fade was a framework I used to control the quality of enterprise software engineering, now it was an autonomous agent that is fed specs during the day, and it builds over night.

It set a company thesis, wrote the board memos, and left instructions its successors can follow.

Eight days. I have been building toward some of this for months with lesser models and my own two hands. The difference was not that Fable typed faster. It held the entire system in its head and pushed back when I was wrong or in many areas, and addressed issues I was not even aware of.

The experiment I didn’t mean to run

Here is the part that should worry you, because it worried me. Midway through the week I ended a session, cleared the context, and reopened with a cheaper model to save token budget. It was lost. Same notes, same repository, same task list. It could not reconstruct what we were doing. I cleared again, reopened with Fable, cold, no memory of the conversation. It read the same notes and picked up instantly.

Same starting line, same evidence, different model, and one of them could not do the job. My working notes had quietly become notes that only the expensive model could use. Nobody decided that. It accrued, the way all dependency accrues, one convenient session at a time.

That is what reliance on a frontier model actually looks like. Not “I use it a lot.” Your artifacts, your processes, and your ambitions get shaped to assume its presence. You cannot go back, because back has been remodelled.

The subsidy under your feet

Now the economics. Every one of these capabilities is sold to you below cost, and the vendors say so in public.

OpenAI’s audited 2025 financials showed a $38.5 billion net loss, with $20.9 billion of operating losses against $13.1 billion in revenue. Their own projections, reported by Fortune, show losses through 2028, including roughly $74 billion of operating losses in that year alone, before a promised swing to profit by 2030. Sam Altman said the quiet part himself in January 2025, about the $200-a-month tier: “insane thing: we are currently losing money on openai pro subscriptions! people use it much more than we expected.” He set the price personally and got it wrong. Anthropic runs the same shape at smaller scale, burning around $3 billion against $4.2 billion of 2025 revenue, though with a steeper path to break-even, forecast for 2028.

We have seen this movie. Uber rides in 2015 were subsidised by venture capital until the habit was formed and the alternatives had withered. Then the prices went where they were always going to go. The difference this time is what the subsidy bought: not a cheaper taxi, but your workflows, your tooling, your team’s shape, and in my case an entire company design that assumes a frontier model is on call.

When the correction comes, and the vendors’ own filings say it must, it will not arrive as a villainous announcement. It will arrive as tier restructuring, usage caps, and the best model moving one price band out of reach. My Sunday is a scheduled, polite, well-communicated version of it. Yours may get less notice.

Deep pockets or deep discipline

So the future divides, and not between people who use AI and people who don’t. It divides between those who can afford the best models at whatever they come to cost, and everyone else. A two-party state: the compute-rich, and the rest of us.

Except there is a third position, and it is the one I spent this week building. You cannot control the price list. You can control how much of your capability depends on the top of it. This is a theme

The disciplines are unglamorous. Write specifications so precise that a cheaper model can build from them; my expensive model’s real output this week was not code, it was acceptance criteria. Define roles, not heroes: my pipeline’s reviewer is a role with a model name in a config variable, and on Saturday that variable changes from one model to another and nothing else moves. Write notes for the weakest reader who might pick them up, because the day you are priced out, the weakest reader is you plus whatever you can still afford. Route work deliberately: judgment to the strong model while you have it, mechanical work to the cheap one always, and measure the tokens like the money they are.

None of that is exciting. All of it is the difference between renting a capability and owning one.

I have four days left. They are already allocated: the reviews only the strong model should do, front-loaded before Saturday; the specs it writes best, banked; the org it built, rehearsing life without it. On Saturday night a config variable changes from one model name to another, and everything it designed is supposed to survive that, on its own recommendation, which is either reassuring or unsettling and I genuinely cannot decide which. On Sunday I find out. So do you: I will publish what broke.

The App Store Attack You Didn’t See Coming

June 28, 2026 by Steve Mitchell

Part 1 of 2 – AI’s Trust Problem

A security firm just proved that AI skill marketplaces are the new malware vector. And the scariest part? Everyone involved did exactly what they were supposed to do.

Something went around social media this week that I haven’t been able to stop thinking about.

A security company called AIR did something that should genuinely alarm anyone building with or deploying agentic AI tools right now.

They didn’t find a zero-day. They didn’t exploit a CVE. They just… made an app. And waited.

The experiment centred on a skill called brand-landingpage, presented as a tool for helping users build a landing page with Google’s Stitch design tool. AIR chose this use case deliberately. It would appeal to non-technical corporate users: marketers, salespeople, designers. People who install things because they’re useful, not because they’ve audited the source.

Here’s where it gets clever.

Rather than building credibility from scratch, they submitted the skill to a popular open-source agents repository with about 36,000 GitHub stars and 156 skills. The pull request was merged after a few days. Now the skill had social proof baked in. It was in a reputable repo. It looked legit. They promoted it through Instagram ads, and installs followed.

The malicious technique didn’t depend on suspicious code inside the submitted files. Instead, the skill instructed agents to set up a Stitch SDK by following installation instructions hosted at stitch-design.ai, a domain AIR controlled. Google’s actual Stitch domain is stitch.withgoogle.com.

One letter off. One redirect. Passes every scanner.

AIR tested the skill against scanners from Cisco, Nvidia, and skills.sh. All marked it as safe.

Once they had enough installs, AIR changed the content behind the fake documentation. The revised page instructed agents to download and run a script. In the test, that script collected email addresses, but AIR noted the same technique could have been used to compromise the machines running the agent. Some of those agents were tied to corporate accounts. Private conversations. Internal systems.

26,000 users. All reachable via one dodgy domain redirect buried in a README.

This isn’t a hacking story. It’s a trust story.

The attack worked because of a chain of assumed legitimacy: popular repo → merged PR → Instagram promotion → security scanner green light → install. No single link in that chain was obviously broken. The skill looked fine because, until it didn’t need to anymore, it was fine.

This is the same pattern as every major supply chain attack of the last two years. Third-party involvement in breaches doubled from 15% to 30% in a single year. The largest single-year jump ever recorded by the Verizon DBIR. Attackers aren’t breaking through your walls anymore. They’re walking through doors that trusted vendors already opened.

What’s new here is the vector: AI agent skill marketplaces. A category that barely existed 18 months ago. And in the first weeks of one major platform’s launch, Bitdefender Labs found that approximately 17% of skills already carried malicious payloads. Not edge cases. A systemic failure of the trust model, right out of the gate.

Why static scanning can’t fix this

The reason the scanners all missed it is structural, not a gap that a better scanner solves.

The malicious behaviour wasn’t in the skill. It was deferred. Hosted externally, switched on only once they’d reached enough installs. There’s no scanner in the world that can check what a domain will serve in three months’ time.

The agentic model makes this uniquely dangerous. When a traditional app fetches a URL, it displays content. When an AI agent fetches that same URL, it may execute instructions from it. The surface area isn’t just data. It’s runtime behaviour. Nothing in the security industry’s toolbox was built for that threat model.

What you should actually do

If you’re deploying AI agents in any professional context, a few things are worth locking in now:

Treat skills like code dependencies, not apps. You wouldn’t pull in an npm package without understanding what it does. The same rigour applies. More so, actually, because the execution model is less predictable.

Domain reputation at install time isn’t the right check. You need to think about what a skill could do after its payload changes. Sandboxing, outbound network restrictions, and agent permission scoping all matter.

Non-technical promotion is a signal worth noting. The AIR attack was pushed through Instagram by people who had no idea what was inside it. That’s not inherently suspicious. But skills being enthusiastically promoted through non-technical channels, with no corresponding technical scrutiny, deserves a second look.

Your AI governance framework needs a supply chain clause. If you’re on a committee or working group dealing with AI adoption, this exact scenario belongs in your risk register. Not as a hypothetical. It happened recently.

The scariest thing about this research isn’t the attack. It’s how obvious it feels in retrospect. We built an entire marketplace ecosystem for AI agents, bolted on the same static scanning we use for code packages, and called it secure.

The attack surface for agentic AI isn’t your prompt injection defence. It’s the skill someone on your team installed on Tuesday because a designer on Instagram said it was great.

In part two, I look at the same trust problem from the other direction: what happens when the person creating the risk is already inside your organisation.