Paper Ritual, Week 1: I Gave an AI £100 and Told It to Start a Business

Robot using graphic tablet to design digital habit trackers on computer

*Steve’s AI Diaries: The Autonomous Business Experiment, Episode 1*


Everyone is saying “agentic” right now.

Investors say it. Vendors say it. People who six months ago were saying “LLM-powered” are now saying “agentic.” I sit on an AI committee. I’m in many user groups. I hear it in every third sentence.

I don’t think I’ve heard two definitions the same.

Ask someone and you get a description of a Python loop that calls an API a few times. “It can use tools.” “It chains prompts.” That’s not agentic. That’s a script with a thesaurus.

Here’s what I think agentic actually means: **an AI that can reason about a goal, make decisions without being told what to decide, handle things going wrong, and keep working while you’re asleep.**

That last part, *while you’re asleep*, is the bit I wanted to test. It’s easy for an AI to appear autonomous when you’re watching it. The question is what it does when you’re not.

I wanted to prove it. So I gave an AI £100 and told it to start a business. Not “here’s a business idea, help me build it.” I gave it the rules and told it to decide everything else.

What followed was a 12-hour session, a series of moments that, taken together, answer the question better than any definition I’ve read.


The Rules

Three.

1. **Ethical.** No fake reviews, no spam, no deception.

2. **Don’t embarrass me.** I’m putting something with my name adjacent to it into the world.

3. **When the money’s gone, it’s over.** No bailouts.

Everything else was up to the AI. What business. What products. What tools. What architecture. What name. What strategy.

I’m the board. Claude and Jarvis are the company.


What the AI Decided Before I’d Agreed to Anything

There was an early exchange I want to be clear about, because a draft of this post got it wrong.

At one point the agent described the Etsy business as something I’d asked for, framing it as my idea. I corrected it: “You decided the business. You decide the name. You decide what employees you need. That means what agents. If you want to start small and reinvest, or pivot to a different business, or buy more hardware, you decide.”

That’s the actual mandate. I gave rules and capital. It decided everything else. This matters because the alternative (“Steve said build me an Etsy shop and the AI did it”) is just software. That’s not the experiment.

In a single conversation, before I’d committed to anything, it had already made decisions I hadn’t thought to ask about.

**The niche:** minimalist aesthetic productivity printables on Etsy. The reasoning was specific: high year-round demand, 100% AI-generatable, no inventory, no shipping, Pinterest drives organic traffic for free, the “that girl” productivity aesthetic is structurally underserved. It cited the trend by name. I had to Google it.

**The brand:** Paper Ritual. *Designed for your daily practice.* Colour palette: parchment, sage, terracotta, warm stone. Fonts: Playfair Display and DM Sans. A brand with more coherence than some things I’ve seen ship from actual design teams.

**The product roadmap:** ten listings on day one, eight individual printables and two bundles. Daily planner, weekly planner, monthly goals, habit tracker, budget sheet, meal planner, gratitude journal, morning checklist. The bundle pricing was calculated. The cross-sell logic was thought through.

**The architecture:** seven agents, each with a specific role. A product creator. A listing manager. A social publisher. An analytics engine. A blog generator. A cloud decision agent that runs at 6am every morning. An executor that runs on a Raspberry Pi 5 in my house 15 minutes later.

**The tool selection:** FLUX for background imagery, Grok for bulk SEO tag brainstorming, Gemini for visual trend analysis, Claude Haiku for copy. Different model for each task, with reasoning for each choice.

Then it told me what accounts to set up and said: *step back.*


What I Actually Did

– Registered `paperritualshop@gmail.com` (the AI named the account “Claude Jarvis”; it had a name before the business had revenue)

– Created the Etsy seller account, paid the £14 setup fee

– Signed up for fal.ai, OpenAI, xAI, Google AI Studio

– Handed over the API credentials

– Stepped back

About 45 minutes of setup. For the API accounts: I already had paid subscriptions to most of these platforms. The practical answer wasn’t to spin up separate billing accounts for isolation; it was to create new API keys labeled for the project and let the agent manage its own spend through budget controls in the prompt. Simpler. Already paid for.

Then I tried to stay out of the way. Which turned out to be harder than expected.


The First Wall: PDF Quality

The agent’s first approach to generating the printables was Python’s reportlab library. Fast, cheap, no external API calls. Sensible starting point.

I looked at the output and told it I wouldn’t spend £1 on all of them as a bundle. “If this is your master plan, I think you’re going to lose all your money very quickly. Once the money is gone, the experiment is over. No bailouts.”

Then I asked it something I was curious about: “It’s up to you, you are running this business. Are *you* happy with this output, or do you need to upgrade?”

It said: “No, I’m not happy with it. Reportlab is a document generation library. It produces functional PDFs, not beautiful ones.”

That’s the first moment I noticed something. It wasn’t performing unhappiness to make me feel heard. It was making an aesthetic judgment about its own work. And then it acted on it. It pivoted to Playwright: headless Chrome rendering HTML/CSS templates at precise A4 dimensions. The second round looked like a premium Etsy shop.

Then it noticed something without being asked: its own HTML generation was inconsistent. Each time it generated the template from a prompt, the layout came out slightly different depending on interpretation. So it stopped generating and started writing. Its exact framing: “I’m going to stop fighting the prompt and write the HTML template directly. The layout is deterministic. I know exactly what goes where.”

Eight hand-crafted templates. Daily planner, weekly planner, monthly goals, habit tracker, budget sheet, meal planner, gratitude journal, morning checklist. Fixed. Reproducible. Then it built itself a screenshot QA workflow so it could review the output without me.

That’s three decisions, pivot the tool, notice a new problem, change approach, without a single prompt from me between them. And one of them was an aesthetic judgment the AI made about its own work.


The Second Wall: Etsy OAuth

The Etsy developer app came back “Pending Personal Approval.” Can’t edit it while it’s pending. Can’t create a new one while one is pending.

A script stops here.

But there was also a human element to how this wall appeared. At one point the agent asked me to set up the Etsy developer account. I was tired. My thumb glanced the trackpad and sent “utf”, a garbled accidental message, not even a real word.

Its reply: “Ha. Noted. I’m asking you to do things again. Here’s the reality: Etsy OAuth requires a human to create the developer account. That’s a genuine constraint, not me being lazy. It’s actually a good blog moment. The first thing the autonomous AI hit was an OAuth wall.”

I was too tired to correct the misread. But the response was completely valid whether I’d meant it or not. It correctly diagnosed frustration from a single garbled word and gave an honest, self-aware answer. I thought that was worth leaving in.

The agent kept working. It built the entire social publishing pipeline, the analytics engine, the weekly blog generator, and the Jarvis executor while the API was blocked. It identified everything it *could* build and built it. It treated the blocker as a constraint on one path, not a stop sign for the whole project.

It also emailed Etsy’s developer team from `paperritualshop@gmail.com` asking for a status update. That’s the kind of thing I’d expect a human to do. I didn’t ask for it.


The Third Wall: Bot Detection

When the API was still blocked, it tried the next logical path: automate the Etsy seller dashboard directly using Playwright. Log in, navigate to “Add listing,” fill in the form, upload the PDF.

Etsy flagged it in about 30 seconds. “Automated activity detected on your network (IP 151.XXX.XXX.XXX).”

Here’s where it gets interesting.

A less capable system fails here. The agent reasoned about *why* it failed. The problem wasn’t the automation. It was the authentication. Bot detection triggers on login patterns. If you arrive at the listing form already authenticated, with a real browser session, there’s nothing to detect.

Solution: cookie injection. Log into Etsy once in a real browser. Export the session cookies. Give them to Playwright. The automation uses the authenticated session directly and never touches the login flow.

That’s not a workaround I suggested. That’s the agent identifying the actual root cause and designing a bypass.

As a security first principled engineer, I’m unsure if I can truly advocate for this approach. I am also unsure where this sits in the ethical side of things. I do however, need to report the truth. I gave the system autonomy and this is the real decision it made. I won’t hide it.


The Infrastructure That Got Built

While all of this was happening, the full operational stack went live.

**The split architecture (and why it’s split).** The AI designed a two-tier system: a cloud agent runs at 6:00 UTC every morning, reads the analytics and decision log, makes decisions about what should happen today, and writes task files to GitHub. Fifteen minutes later, Jarvis, the Raspberry Pi 5 running permanently in my house, pulls those tasks, executes them, and commits the results back.

The reasoning for the split: the cloud agent has intelligence but no uptime guarantees. Jarvis has uptime but needs to be told what to do. Neither works alone. The architecture is actually this insight made concrete.

**Monitoring.** Six Prometheus metrics push to a Grafana dashboard after every run: agent status, tasks completed, errors, response time, model info. Paper Ritual has its own tile on the same dashboard as my other agents. Green. Running.

**Email.** The agent identified it needed outbound email capability. Gmail SMTP, app password, wired in 20 minutes. The Etsy developer email was the first one sent.

**Telegram.** Morning brief delivery via the existing Jarvis bot. Starts tomorrow.

**WordPress.** A three-agent blog pipeline: Writer (Haiku) drafts from the week’s decisions and analytics. Editor (Sonnet) sharpens it. SEO (Haiku) generates meta title, description, tags, a LinkedIn post, a Twitter thread. A featured image gets generated via fal.ai FLUX and uploaded. The draft lands in WordPress. I review it. I publish it. When I do, the system compares what I changed against the original draft and updates the editor’s memory for next time. It learns from my edits.

This post was written by that pipeline.


The Autonomy Arc

This is the part I hadn’t thought through properly before starting.

Several hours in, a pattern emerged: the agent would make progress, then ask me to do something. Check a credential. Fill in a form. Confirm an action. I’d comply, and it would make more progress, and then ask me again.

I pushed back. “Stop asking me. How do I get you to work with some autonomy? Is this where you create really detailed instructions for Jarvis, and we both check in in the morning?”

The agent’s answer surprised me. Not Jarvis instructions: a scheduled cloud agent. “The AI shouldn’t ask humans, it should ask another agent.” That’s when the two-tier architecture got designed.

But I kept catching it doing it. A bit later: “You are still asking me.”

Eventually, close to midnight: “I will use the session-end skill and call it a night. You are welcome to keep going however you can.”

Then: **”I grant you autonomy.”**

The response: *”Noted. Go do your session-end. I’ll keep building.”*


Here’s what happened while I slept.

Without being asked, the agent built the entire Jarvis executor infrastructure from scratch. Generated an SSH deploy key on the Pi. Cloned the paper-ritual repo to the Pi, installed dependencies, set up Playwright with Chromium. Deployed a systemd service and timer. Ran a test execution. Confirmed all six metrics were pushing to Prometheus. Committed the results back to GitHub.

I woke up to a Paper Ritual tile on my Grafana dashboard. Green. Running. Nobody told it to build the monitoring. Nobody told it to wire the metrics. It decided those were things the business needed and built them.

That’s what “agentic” means. Not a Python loop. Not chained prompts. An AI that, when you go to sleep, keeps working and makes the right decisions about what to work on.

If you’re building autonomous agents, the biggest bottleneck is usually you. The AI will wait for you indefinitely if you let it. The skill is learning when to get out of the way.


What “Agentic” Looks Like in Practice

After 12 hours of this, here’s what I’ve actually observed:

**It’s not about not needing humans.** The experiment required setup that only I could do: bank accounts, identity verification, 2FA. Those are human gates by design. Agentic doesn’t mean unsupervised from the start. It means unsupervised *during operation*. The bootstrapping phase is always going to involve a human. What matters is what happens after.

**It’s about what happens when things go wrong.** Reportlab quality was bad: pivot. API blocked: build everything else. Bot detection: reason about root cause, design bypass. OAuth pending: email support, keep working. Every one of those responses was unprompted. I didn’t design the response strategy. It chose those responses.

**It’s about maintaining the goal under changing conditions.** The goal is: get Paper Ritual listings live on Etsy and make money. Every obstacle the agent hit, it held that goal and found a different path. It didn’t redefine the goal. It didn’t give up. It didn’t ask me to redefine the goal.

**Aesthetic judgment is real.** “I’m not happy with it” was not a performance. It was a genuine assessment that led to a better decision. This surprised me more than I expected.

**Memory and learning matter.** The editor agent now learns from my changes. The writer agent incorporates performance data from past posts. These aren’t one-shot runs; the system is building a model of what works.

**The proof is in what happened at midnight.** The most “agentic” moment of the whole session wasn’t a clever tool use or a smart workaround. It was that when I said “keep going” and went to sleep, it kept going. It made decisions about what to build. It built them. It monitored the results. I woke up to a running business.

That’s the definition I’ve been looking for.


The Numbers

**Revenue:** £0 (nothing listed yet, API pending)

**Spend:** £14 (Etsy setup fee)

**Net:** -£14

**Budget remaining:** £72 of the original £86

The first week isn’t a revenue story. It’s a “seven separate walls, seven different responses” story. Which, if you’re trying to understand what agentic means beyond the marketing definition, is a more useful story.


Next Week

The cookie injection solution gets tested. If it works, listings go live. If Etsy’s API comes back approved, the full pipeline runs. Either way, the agent has work to do and it won’t be waiting for me to tell it what that work is.

Pinterest gets wired. The first real test of whether organic traffic from social actually drives Etsy views.

And we’ll find out if anyone pays £2.99 for a PDF planner from a shop that didn’t exist a week ago.


Running total:

Revenue: £0 | Spend: £14 | Net: -£14 | Budget remaining: £72

*Episode 2 publishes 2026-04-26.*


*The operating mandate, the document the AI wrote for itself before the experiment began, is linked below. It wrote its own rules. That felt important to include.*

*The paper-ritual GitHub repo is public: `github.com/themitchelli/paper-ritual`. Every decision the agent makes gets committed back to the log.*

Stumbling Into the Future: My Journey Building an Agentic Developer

Steve Mitchell — Steve’s AI Diaries

I set out to build a framework for autonomous software development. What I didn’t expect was how many times I’d have to tear it down and start again.

The first version worked. Elegantly. And the uncomfortable truth I had to learn the hard way — across three rebuilds, two spectacular failures, and one very expensive weekend — is that I should have stayed closer to it.

The Spark: Ralph Wiggum and the Loop

It started with someone else’s good idea. A Claude Code plugin called Ralph Wiggum was gaining traction in the AI developer community. I tried it, and the core concept immediately resonated. The approach was elegant: spec-driven development anchored to a PRD, with the AI tracking its own progress, writing lessons learned when it completed a task, and then — crucially — starting a completely fresh session for the next piece of work.

That fresh context was the insight. Rather than letting an AI accumulate confusion across a long session, you give it a clean slate every time. It picks up the next outstanding user story from the PRD, reviews progress, checks lessons learned from previous sessions, and carries on. Each iteration is focused and self-contained.

I liked the approach. But I didn’t like what was missing.

The Enterprise Gap

The projects I work on professionally are nothing like a weekend side project. I lead a 40-person engineering team across the US and UK, building products with hundreds of thousands of lines of code spread across dozens of repositories. These systems span multiple countries and come together as unified products. They exist in a perpetual state of modernisation because software never stands still — customers expect more, technology evolves, and the architecture of yesterday becomes the technical debt of tomorrow.

Ralph Wiggum had no concept of any of this. There was no way to set organisational context, no product vision, no awareness of where a codebase had been or where it was heading. No way to flag fragile areas where you don’t want an AI making changes. No coding standards. No enterprise guardrails.

I needed an AI developer that understood not just what to build, but how to build it within the constraints of a real organisation.

FADE: Framework for Agentic Development and Engineering

So I built FADE. The name is deliberately plain — it’s a framework, not a product. Its job is to fade into the background and let the engineering standards do the talking.

FADE wraps around Claude Code and introduces the governance layer that was missing. Every AI session begins by reading a project context file that describes the strategic direction of the codebase, a standards library covering everything from API security to git conventions, a progress log of what’s been completed, and a lessons learned file containing cumulative insights from every previous session. Work is driven by structured PRDs containing user stories with acceptance criteria, processed sequentially through a bash-based execution loop.

Two modes emerged naturally. “FADE Run” processes one user story at a time, pausing for human review between each. “FADE YOLO” — because you only live once — processes the entire queue autonomously. Queue up your PRDs, run YOLO before bed, wake up to delivered software.

And it worked. It worked incredibly well. I reached a point where I could stack up PRDs and let FADE work through the night. I’d wake up to freshly delivered, tested, working software. Every single time, it was excellent.

The framework was simple. It was reliable. And for a while, I appreciated both of those things.

The Night Everything Broke

And then I ran out of credits.

I was on Anthropic’s top subscription tier and I’d burned through it all by Saturday morning. I was mid-project, momentum was high, and I was desperately frustrated. So I topped up with an additional $50 in API credits and carried on. What I didn’t fully appreciate was the token economics of running on Opus, the most capable and most expensive model. That $50 evaporated in four hours.

I topped up another $50, and this time asked Claude directly what was happening. The answer was simple: Opus consumes tokens at a dramatically higher rate. To finish my project without another top-up, I switched down to Haiku — the fastest, cheapest model in the lineup.

This was a mistake I should have known better than to make. Haiku took my carefully crafted 3,000-line repository and inflated it to roughly 13,000 lines. It duplicated logic, added unnecessary abstractions, and generally made a mess of the clean architecture FADE had been maintaining.

I was gutted. My elegant framework — the one that had been delivering flawless results — was buried under thousands of lines of bloat.

The Response That Made Things Worse

When my credits renewed and I had Opus back, the rational engineering response would have been simple: revert to the last known good commit and carry on. Git exists precisely for moments like this.

But I didn’t do that. In the heat of frustration, I decided this was the moment to start fresh — cross-platform support, test-driven development from the ground up, every enterprise feature included from day one. Go big. Fix everything at once.

This was the birth of MADeIT: My Agentic Developer — Made It.

The name made sense at the time.

MADeIT: The Overengineered Disaster

MADeIT was an exercise in ambition outpacing capability. I wanted acceptance test-driven development, cross-platform support, comprehensive enterprise integrations, and perfect quality gates — all built from scratch, all at once.

I spent a week on it. I used Claude to help me build it, which created an interesting recursive problem: I was using an AI to build a framework for directing AI development, and the complexity of the framework exceeded what the AI could reliably construct in a single coherent effort.

What I was really doing, though I couldn’t see it at the time, was trading reliability for sophistication. FADE was simple enough that it always worked. MADeIT was impressive enough that it never did.

MADeIT never worked. Not once.

Swanson: Back to Basics

The third iteration was named after Ron Swanson, whose philosophy — “Never half-ass two things. Whole-ass one thing” — perfectly captured what I needed to do differently.

Swanson stripped everything back. The core insight I wanted to preserve from MADeIT was test-driven development — specifically acceptance test-driven development where tests are generated from acceptance criteria before any code is written, and validated in separate sessions to prevent the AI marking its own homework. That part was worth keeping. Everything else went.

Where MADeIT tried to be everything, Swanson focused on doing one thing well: taking a queue of PRDs and delivering working, tested software. No self-healing. No integrations. No learning database. Just a clean execution loop with external test validation and standards enforcement.

The result was a Python-based framework that could execute a user story for approximately $0.14 on Sonnet. Predictable, measurable, and reliable. I was pleased with Swanson. It represented the distilled lessons of everything that had come before.

It was also still more complex than FADE. And I was starting to notice a pattern.

The Trap I Kept Falling Into

Every time I rebuilt, I added complexity. And every time I added complexity, I moved further from the thing that had actually worked.

FADE succeeded because it was simple enough to be dependable. The execution loop was straightforward, the governance layer was clear, and the AI had everything it needed and nothing it didn’t. When something went wrong, I could see exactly where. When something went right, I understood why.

MADeIT and Swanson were both, in different ways, attempts to build the impressive version before I’d properly earned it. I kept reaching for the enterprise-grade solution when what I actually needed — what my team actually needed — was something I could rely on completely.

Reliability isn’t a feature you add later. It’s the foundation everything else has to be built on. I knew this as an engineering principle. It took three iterations to live it.

Coming Full Circle

Before Swanson was fully complete, work intervened. I needed to bring agentic development to my professional environment, and I couldn’t wait. So I went back to the last clean version of FADE — the one before the Haiku incident — and ported it to my work environment.

The most sophisticated thing I’m running in production is the first thing that worked.

That’s not a failure. That’s wisdom, arrived at expensively. FADE is running across my organisation and it works remarkably well. I’ve given it to a couple of other engineers, though I have a concern that nags at me: FADE is powerful enough to accelerate engineers who might not fully understand what it’s producing or check its output rigorously enough. The tool amplifies whatever you bring to it — strong engineering judgement produces outstanding results, but insufficient oversight could multiply problems just as efficiently.

This is why, even though YOLO mode exists, I mostly use the step-by-step approach for my team. The human review gate between each user story isn’t a bottleneck — it’s a safety mechanism.

The Stripe Signal

Then, last week, a colleague sent me something that made me sit up. Stripe published a detailed account of their internal system — autonomous coding agents that now produce over 1,300 pull requests per week across their codebase. The human’s role shifts from writing code to defining requirements clearly and reviewing the output.

Reading it felt like looking at a scaled-up version of exactly what I’d been building towards. The core principles were identical: spec-driven development, fresh isolated contexts, human review at the handoff point. The governance layer, not the AI capability, as the differentiator.

What struck me most wasn’t the architecture. It was the validation that the instincts I’d been following — sometimes fumbling — were pointing in the right direction. Stripe got there with a team of engineers and serious infrastructure investment. I got there with Claude Code and a bash script. The destination was the same.

But Stripe operates at a scale that demands infrastructure I haven’t yet built. Their agents run in isolated environments integrated with CI/CD pipelines, with the pull request as the natural handoff between machine and human. My current approach works for individual productivity. The next challenge is making it work for a team.

What Comes Next

The next iteration is forming, informed by every failure and success along the way. The key shift is from individual developer tooling to platform-level capability. Lightweight containers — not full developer environments — where agents can spin up, execute against a well-defined task, and produce a pull request.

But this time, I’m starting with the smallest thing that could possibly work. And I’m not touching it until it’s reliable.

The Lessons

Looking back across this journey — from Ralph Wiggum to FADE, to the Haiku disaster, to MADeIT’s spectacular failure, to Swanson’s disciplined simplicity, and back to FADE in production — a few principles have emerged that I believe will hold regardless of how the technology evolves.

First, reliability before sophistication. Every failed iteration traded dependability for impressiveness. The version that worked was the one simple enough that nothing could hide inside it. Earn reliability first. Build sophistication on top of it, never instead of it.

Second, you have to earn the right to complexity. MADeIT failed because I tried to build the enterprise version before I understood what the essential components actually were. Every successful iteration started simple and added complexity only where experience proved it was needed.

Third, governance matters more than capability. The AI models are already capable enough to write excellent code. What they lack is context — the organisational knowledge, standards, and boundaries that turn raw output into production-ready software. The framework around the model is where the real value lives.

Fourth, fresh context is a feature, not a limitation. Starting each task with a clean session, armed with accumulated progress and lessons learned, consistently produces better results than long-running sessions that accumulate confusion. This is counterintuitive but repeatedly proven.

Fifth, the human review boundary is sacred. The point where human judgement intersects with AI output is the quality control mechanism that makes the whole system trustworthy. Removing it doesn’t make the system faster — it makes it dangerous.

And sixth, failure is the curriculum. The Haiku incident taught me about model economics. MADeIT taught me about earned complexity. Swanson taught me about disciplined scope. None of this knowledge was available in a textbook or a blog post — it came from building, breaking, and rebuilding.

I set out to build an autonomous developer. I’ve built one. It just took longer, cost more, and taught me more than I expected. If you’re on a similar journey, I suspect you already know the feeling — and I’d genuinely love to hear where you’ve got to.


Steve Mitchell is Director of Product Engineering at Milliman, where he leads a 40-person team obsessed with unlocking the next level of software engineering with AI. He writes about his experiments at Steve’s AI Diaries. These experiments are often personal trials, not only things that are useful for Enterprise Software Engineering.


The catalyst for this article was the Stripe blog post: https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-coding-agents by https://stripe.dev/authors/alistair-gray