AI Token Management: You’re Using the Wrong Model, and It’s Costing You More Than You Think

Last week I published a piece about why I use six different AI models and why treating them as interchangeable is a mistake. If you haven’t read it, the short version is: different models are genuinely better at different jobs, and the engineers who’ve figured that out are quietly running rings around everyone else.

What I didn’t cover, what I deliberately parked for a separate article, was the money.

Because that’s where this gets interesting. And urgent.

The bill is coming

Most companies right now are in the honeymoon phase with AI spend. Subscriptions get approved, API keys get shared around, and nobody’s asking hard questions about what the organisation actually got for the investment.

That changes at year-end review. It already is changing. And when someone in finance opens the token usage report and asks “what did we get for this?”, the companies with a good answer will be the ones that treated token spend the way any sensible engineering team treats any other resource: with actual strategy.

The ones without a good answer will be the ones who did what most people do by default.

They used their most expensive model for everything.

Not all tokens are created equal

Here’s the thing most people don’t think about when they reach for Claude Opus or GPT-5 for every task: there’s a 5x pricing gap between the top and bottom tier of models from the same provider.

Current API pricing (April 2026, per million tokens input/output):

Model Cost Best for
Claude Opus 4.6 $5 / $25 Complex design, deep reasoning, multi-file architecture
Claude Sonnet 4.6 $3 / $15 95% of coding and building
Claude Haiku 4.5 $1 / $5 Testing, sub-agents, validation, repetitive tasks
Grok 4.1 Fast $0.20 / $0.50 Brainstorming, adversarial critique (free tier available)
Gemini Flash ~$0.10 / $0.40 Large-context triage, quick summarisation

That’s Opus costing 25x more per output token than Gemini Flash. For a task where both produce the same result, using Opus isn’t being thorough. It’s being negligent.

And across a team running hundreds of tasks a week, that gap compounds fast. We’re talking tens of thousands of pounds a year in pure waste, on work that didn’t need the expensive model and isn’t any better for having used it.

The mental model that actually works

Stop asking “which model is best?” and start asking “which model does this job need?”

Think about it the way you’d think about staffing a software team.

Your principal engineer is brilliant, expensive, and finite. You don’t ask them to write unit tests, review boilerplate, or summarise a Jira ticket. You use them where their judgment is genuinely irreplaceable: the architecture calls, the decisions that stay expensive if you get them wrong. Make sure everything else flows to the right level.

AI model selection is exactly the same problem.

Opus is your principal engineer. Sonnet is your senior developer. Haiku is your capable junior who’s surprisingly good when the task is well-defined. Grok is the brutally honest colleague who’ll tear your idea apart for free, which is exactly what you want before you’ve committed any real resources.

The best AI users I know don’t just prompt better. They assign work better.

The opportunity cost nobody talks about

Here’s the part that really matters, and I never see it discussed.

On consumer plans (Claude Pro, the subscription tiers), you don’t have unlimited tokens. You have a session allocation. Once it’s gone, you wait.

But here’s what makes this worse than most people realise: every message you send doesn’t just cost you the tokens in your new question. It costs you the tokens to re-read your entire conversation history. LLMs are stateless; they have no memory between calls, so every new message includes every previous message as input. By message 30, you might be sending 20,000 tokens of history just to get a 100-token answer. A long Opus chat doesn’t just charge you for your question. It charges you Opus rates to re-read everything you’ve ever said to it.

So if you burn Opus tokens on brainstorming you could have had for free on Grok, those tokens aren’t available when you actually need Opus to do the thing only Opus can do.

There’s a compounding trap on top of this. When Opus gives you a partial answer and you reply with a correction, that failed attempt is now baked into the conversation history, re-read on every future turn. Use the edit button on your original prompt instead. It replaces the branch, removes the mistake from history, and stops paying the re-reading tax on a dead end.

I’ve caught myself doing this. Starting a planning session with Claude (which is my natural reflex) and realising halfway through: I’m not building anything yet. I’m just thinking out loud. This should be Grok.

The discipline of routing tasks to the right model before you start is what separates people who consistently ship good work from people who hit their usage limits at 3pm wondering where all their tokens went.

Route the work, not the ego

Here’s my actual routing flow. I covered the what in the multi-model piece. This is the why, through the lens of cost.

Brainstorming and adversarial critique → Grok (free tier)

Before I spend a single precious Anthropic token on an idea, I’ll throw it at Grok. Grok is ruthless. It’ll find the holes, tell me what’s wrong, push back without the diplomatic softening you sometimes get from Claude. That’s exactly what I want before committing any real resources. And it costs nothing. Why would I use anything else at this stage?

Research → Perplexity

Every time. It’s hypertuned for research in a way that genuinely surprises me. Citations, synthesis, current information: Perplexity just gets this right. So that’s where the exploratory work goes, not my Claude quota.

Large-context triage → Gemini Flash

When a task involves scanning a large codebase or a massive document set, Gemini Flash at near-zero cost handles the breadth. It identifies what matters, isolates the relevant sections, hands a focused context to the model that actually needs to think about it. You don’t need a principal engineer to read the entire file tree; you need them to look at what the triage found.

Architecture and complex design → Claude Opus

This is where the premium tokens earn their keep. When the reasoning chain matters, when a wrong decision stays expensive for years, when I need a thinking partner who’ll push back correctly rather than just agree: that’s Opus. Not because it’s the most powerful model available, but because this is the class of task where the quality difference is real and the stakes justify the cost.

95% of actual coding → Claude Sonnet

This surprises people. The SWE-Bench gap between Sonnet and Opus is now less than 1.5 points. For standard implementation work (which is most of it), Sonnet is faster, cheaper, and produces the same result. The only time I genuinely need Opus for coding is when a change spans massive context with complex interdependencies. That’s maybe 5% of my build work. Everything else is Sonnet.

Testing and sub-agents → Haiku

The one most people overlook. Test execution doesn’t need frontier intelligence. It needs speed and reliability. Haiku at $1/$5 per million tokens can run a lot of tests. Burning Opus tokens on a test run is like asking your principal engineer to check the CI pipeline. Technically they can; it’s just an appalling use of their time.

If you’re running multi-agent pipelines, the economics here are even more pronounced; every sub-agent call compounds. I wrote about what agentic systems actually look like in practice if you want the concrete version of this.

What this looks like at scale

For a medium coding task – say, 200k input tokens, 50k output – the numbers look like this:

  • Pure Opus workflow: ~£1.20 – £1.50 per task
  • Mixed routing (Haiku for tests, Sonnet for implementation, Opus for design): ~£0.20 – £0.35

Scale that to a team running 100 tasks a week. The annual difference runs to tens of thousands of pounds, with better results, because each model did what it’s actually good at rather than one expensive model doing everything adequately. For a real picture of what that kind of build actually involves, my journey building an agentic developer gives you the unfiltered version.

The enterprises that will look back on their 2026 AI spend with a clear conscience will have done three things: defined a model routing policy, used batch processing and prompt caching where possible (both Anthropic and OpenAI offer 50% discounts for batch API; prompt caching can cut input costs by up to 90% for repeated context), and treated token spend as an engineering metric, not just a finance line.

Cost per task. Quality per token. Routing efficiency. These are performance indicators. The teams that measure them will outperform the ones that don’t.

Three questions before you open any chat window

I’ve simplified my own decision process to three questions. You can use these starting tomorrow:

  1. Does this require deep reasoning, or is it just execution? Deep reasoning (architecture, ambiguous problems, multi-system tradeoffs) earns the premium model. Execution that follows a clear spec doesn’t.
  2. Could a cheaper model get me 80–95% of the way there? Be honest. Most tasks have a 90% solution available at a tenth of the cost. If 90% is good enough for the task, 90% is the right answer.
  3. Am I using a premium model because I need it, or because it’s convenient? Convenience is the real budget killer. Defaulting to the model you have open is how waste compounds invisibly.

The principle

The people who win with AI in the next two years won’t be the ones using the most powerful models.

They’ll be the ones who worked out that intelligence is a finite resource — and spent it accordingly.


Token ROI is the discipline of using the smallest model that reliably does the job, and reserving your expensive reasoning for the moments where quality actually changes the outcome.

What’s coming next

I’ve been thinking about building something to make this easier: a model selector tool where you describe what you’re trying to do and get a current, task-calibrated recommendation on which model to use. Not a static list (those go stale fast as models shift), but something live. I’m calling it the LLM Council; the best recommendation isn’t one model’s opinion, it’s a consensus view that updates as capabilities evolve.

If that sounds useful, say so in the comments. I’ll build it if there’s appetite.


Miss the companion piece? Not All AI Is Equal — Stop Pretending It Is covers the which model for which task. This one covers why it matters economically.

What “Agentic” Actually Means: Proved With £100 and an Etsy Shop

Robot using graphic tablet to design digital habit trackers on computer

*Steve’s AI Diaries: The Autonomous Business Experiment, Episode 1*


Everyone is saying “agentic” right now.

Investors say it. Vendors say it. People who six months ago were saying “LLM-powered” are now saying “agentic.” I sit on an AI committee. I’m in many user groups. I hear it in every third sentence.

I don’t think I’ve heard two definitions the same.

Ask someone and you get a description of a Python loop that calls an API a few times. “It can use tools.” “It chains prompts.” That’s not agentic. That’s a script with a thesaurus.

Here’s what I think agentic actually means: **an AI that can reason about a goal, make decisions without being told what to decide, handle things going wrong, and keep working while you’re asleep.**

That last part, *while you’re asleep*, is the bit I wanted to test. It’s easy for an AI to appear autonomous when you’re watching it. The question is what it does when you’re not.

I wanted to prove it. So I gave an AI £100 and told it to start a business. Not “here’s a business idea, help me build it.” I gave it the rules and told it to decide everything else.

What followed was a 12-hour session, a series of moments that, taken together, answer the question better than any definition I’ve read.


The Rules

Three.

1. **Ethical.** No fake reviews, no spam, no deception.

2. **Don’t embarrass me.** I’m putting something with my name adjacent to it into the world.

3. **When the money’s gone, it’s over.** No bailouts.

Everything else was up to the AI. What business. What products. What tools. What architecture. What name. What strategy.

I’m the board. Claude and Jarvis are the company.


What the AI Decided Before I’d Agreed to Anything

There was an early exchange I want to be clear about, because a draft of this post got it wrong.

At one point the agent described the Etsy business as something I’d asked for, framing it as my idea. I corrected it: “You decided the business. You decide the name. You decide what employees you need. That means what agents. If you want to start small and reinvest, or pivot to a different business, or buy more hardware, you decide.”

That’s the actual mandate. I gave rules and capital. It decided everything else. This matters because the alternative (“Steve said build me an Etsy shop and the AI did it”) is just software. That’s not the experiment.

In a single conversation, before I’d committed to anything, it had already made decisions I hadn’t thought to ask about.

**The niche:** minimalist aesthetic productivity printables on Etsy. The reasoning was specific: high year-round demand, 100% AI-generatable, no inventory, no shipping, Pinterest drives organic traffic for free, the “that girl” productivity aesthetic is structurally underserved. It cited the trend by name. I had to Google it.

**The brand:** Paper Ritual. *Designed for your daily practice.* Colour palette: parchment, sage, terracotta, warm stone. Fonts: Playfair Display and DM Sans. A brand with more coherence than some things I’ve seen ship from actual design teams.

**The product roadmap:** ten listings on day one, eight individual printables and two bundles. Daily planner, weekly planner, monthly goals, habit tracker, budget sheet, meal planner, gratitude journal, morning checklist. The bundle pricing was calculated. The cross-sell logic was thought through.

**The architecture:** seven agents, each with a specific role. A product creator. A listing manager. A social publisher. An analytics engine. A blog generator. A cloud decision agent that runs at 6am every morning. An executor that runs on a Raspberry Pi 5 in my house 15 minutes later.

**The tool selection:** FLUX for background imagery, Grok for bulk SEO tag brainstorming, Gemini for visual trend analysis, Claude Haiku for copy. Different model for each task, with reasoning for each choice.

Then it told me what accounts to set up and said: *step back.*


What I Actually Did

– Registered `paperritualshop@gmail.com` (the AI named the account “Claude Jarvis”; it had a name before the business had revenue)

– Created the Etsy seller account, paid the £14 setup fee

– Signed up for fal.ai, OpenAI, xAI, Google AI Studio

– Handed over the API credentials

– Stepped back

About 45 minutes of setup. For the API accounts: I already had paid subscriptions to most of these platforms. The practical answer wasn’t to spin up separate billing accounts for isolation; it was to create new API keys labeled for the project and let the agent manage its own spend through budget controls in the prompt. Simpler. Already paid for.

Then I tried to stay out of the way. Which turned out to be harder than expected.


The First Wall: PDF Quality

The agent’s first approach to generating the printables was Python’s reportlab library. Fast, cheap, no external API calls. Sensible starting point.

I looked at the output and told it I wouldn’t spend £1 on all of them as a bundle. “If this is your master plan, I think you’re going to lose all your money very quickly. Once the money is gone, the experiment is over. No bailouts.”

Then I asked it something I was curious about: “It’s up to you, you are running this business. Are *you* happy with this output, or do you need to upgrade?”

It said: “No, I’m not happy with it. Reportlab is a document generation library. It produces functional PDFs, not beautiful ones.”

That’s the first moment I noticed something. It wasn’t performing unhappiness to make me feel heard. It was making an aesthetic judgment about its own work. And then it acted on it. It pivoted to Playwright: headless Chrome rendering HTML/CSS templates at precise A4 dimensions. The second round looked like a premium Etsy shop.

Then it noticed something without being asked: its own HTML generation was inconsistent. Each time it generated the template from a prompt, the layout came out slightly different depending on interpretation. So it stopped generating and started writing. Its exact framing: “I’m going to stop fighting the prompt and write the HTML template directly. The layout is deterministic. I know exactly what goes where.”

Eight hand-crafted templates. Daily planner, weekly planner, monthly goals, habit tracker, budget sheet, meal planner, gratitude journal, morning checklist. Fixed. Reproducible. Then it built itself a screenshot QA workflow so it could review the output without me.

That’s three decisions, pivot the tool, notice a new problem, change approach, without a single prompt from me between them. And one of them was an aesthetic judgment the AI made about its own work.


The Second Wall: Etsy OAuth

The Etsy developer app came back “Pending Personal Approval.” Can’t edit it while it’s pending. Can’t create a new one while one is pending.

A script stops here.

But there was also a human element to how this wall appeared. At one point the agent asked me to set up the Etsy developer account. I was tired. My thumb glanced the trackpad and sent “utf”, a garbled accidental message, not even a real word.

Its reply: “Ha. Noted. I’m asking you to do things again. Here’s the reality: Etsy OAuth requires a human to create the developer account. That’s a genuine constraint, not me being lazy. It’s actually a good blog moment. The first thing the autonomous AI hit was an OAuth wall.”

I was too tired to correct the misread. But the response was completely valid whether I’d meant it or not. It correctly diagnosed frustration from a single garbled word and gave an honest, self-aware answer. I thought that was worth leaving in.

The agent kept working. It built the entire social publishing pipeline, the analytics engine, the weekly blog generator, and the Jarvis executor while the API was blocked. It identified everything it *could* build and built it. It treated the blocker as a constraint on one path, not a stop sign for the whole project.

It also emailed Etsy’s developer team from `paperritualshop@gmail.com` asking for a status update. That’s the kind of thing I’d expect a human to do. I didn’t ask for it.


The Third Wall: Bot Detection

When the API was still blocked, it tried the next logical path: automate the Etsy seller dashboard directly using Playwright. Log in, navigate to “Add listing,” fill in the form, upload the PDF.

Etsy flagged it in about 30 seconds. “Automated activity detected on your network (IP 151.XXX.XXX.XXX).”

Here’s where it gets interesting.

A less capable system fails here. The agent reasoned about *why* it failed. The problem wasn’t the automation. It was the authentication. Bot detection triggers on login patterns. If you arrive at the listing form already authenticated, with a real browser session, there’s nothing to detect.

Solution: cookie injection. Log into Etsy once in a real browser. Export the session cookies. Give them to Playwright. The automation uses the authenticated session directly and never touches the login flow.

That’s not a workaround I suggested. That’s the agent identifying the actual root cause and designing a bypass.

As a security first principled engineer, I’m unsure if I can truly advocate for this approach. I am also unsure where this sits in the ethical side of things. I do however, need to report the truth. I gave the system autonomy and this is the real decision it made. I won’t hide it.


The Infrastructure That Got Built

While all of this was happening, the full operational stack went live.

**The split architecture (and why it’s split).** The AI designed a two-tier system: a cloud agent runs at 6:00 UTC every morning, reads the analytics and decision log, makes decisions about what should happen today, and writes task files to GitHub. Fifteen minutes later, Jarvis, the Raspberry Pi 5 running permanently in my house, pulls those tasks, executes them, and commits the results back.

The reasoning for the split: the cloud agent has intelligence but no uptime guarantees. Jarvis has uptime but needs to be told what to do. Neither works alone. The architecture is actually this insight made concrete.

**Monitoring.** Six Prometheus metrics push to a Grafana dashboard after every run: agent status, tasks completed, errors, response time, model info. Paper Ritual has its own tile on the same dashboard as my other agents. Green. Running.

**Email.** The agent identified it needed outbound email capability. Gmail SMTP, app password, wired in 20 minutes. The Etsy developer email was the first one sent.

**Telegram.** Morning brief delivery via the existing Jarvis bot. Starts tomorrow.

**WordPress.** A three-agent blog pipeline: Writer (Haiku) drafts from the week’s decisions and analytics. Editor (Sonnet) sharpens it. SEO (Haiku) generates meta title, description, tags, a LinkedIn post, a Twitter thread. A featured image gets generated via fal.ai FLUX and uploaded. The draft lands in WordPress. I review it. I publish it. When I do, the system compares what I changed against the original draft and updates the editor’s memory for next time. It learns from my edits.

This post was written by that pipeline.


The Autonomy Arc

This is the part I hadn’t thought through properly before starting.

Several hours in, a pattern emerged: the agent would make progress, then ask me to do something. Check a credential. Fill in a form. Confirm an action. I’d comply, and it would make more progress, and then ask me again.

I pushed back. “Stop asking me. How do I get you to work with some autonomy? Is this where you create really detailed instructions for Jarvis, and we both check in in the morning?”

The agent’s answer surprised me. Not Jarvis instructions: a scheduled cloud agent. “The AI shouldn’t ask humans, it should ask another agent.” That’s when the two-tier architecture got designed.

But I kept catching it doing it. A bit later: “You are still asking me.”

Eventually, close to midnight: “I will use the session-end skill and call it a night. You are welcome to keep going however you can.”

Then: **”I grant you autonomy.”**

The response: *”Noted. Go do your session-end. I’ll keep building.”*


Here’s what happened while I slept.

Without being asked, the agent built the entire Jarvis executor infrastructure from scratch. Generated an SSH deploy key on the Pi. Cloned the paper-ritual repo to the Pi, installed dependencies, set up Playwright with Chromium. Deployed a systemd service and timer. Ran a test execution. Confirmed all six metrics were pushing to Prometheus. Committed the results back to GitHub.

I woke up to a Paper Ritual tile on my Grafana dashboard. Green. Running. Nobody told it to build the monitoring. Nobody told it to wire the metrics. It decided those were things the business needed and built them.

That’s what “agentic” means. Not a Python loop. Not chained prompts. An AI that, when you go to sleep, keeps working and makes the right decisions about what to work on.

If you’re building autonomous agents, the biggest bottleneck is usually you. The AI will wait for you indefinitely if you let it. The skill is learning when to get out of the way.


What “Agentic” Looks Like in Practice

After 12 hours of this, here’s what I’ve actually observed:

**It’s not about not needing humans.** The experiment required setup that only I could do: bank accounts, identity verification, 2FA. Those are human gates by design. Agentic doesn’t mean unsupervised from the start. It means unsupervised *during operation*. The bootstrapping phase is always going to involve a human. What matters is what happens after.

**It’s about what happens when things go wrong.** Reportlab quality was bad: pivot. API blocked: build everything else. Bot detection: reason about root cause, design bypass. OAuth pending: email support, keep working. Every one of those responses was unprompted. I didn’t design the response strategy. It chose those responses.

**It’s about maintaining the goal under changing conditions.** The goal is: get Paper Ritual listings live on Etsy and make money. Every obstacle the agent hit, it held that goal and found a different path. It didn’t redefine the goal. It didn’t give up. It didn’t ask me to redefine the goal.

**Aesthetic judgment is real.** “I’m not happy with it” was not a performance. It was a genuine assessment that led to a better decision. This surprised me more than I expected.

**Memory and learning matter.** The editor agent now learns from my changes. The writer agent incorporates performance data from past posts. These aren’t one-shot runs; the system is building a model of what works.

**The proof is in what happened at midnight.** The most “agentic” moment of the whole session wasn’t a clever tool use or a smart workaround. It was that when I said “keep going” and went to sleep, it kept going. It made decisions about what to build. It built them. It monitored the results. I woke up to a running business.

That’s the definition I’ve been looking for.


The Numbers

**Revenue:** £0 (nothing listed yet, API pending)

**Spend:** £14 (Etsy setup fee)

**Net:** -£14

**Budget remaining:** £72 of the original £86

The first week isn’t a revenue story. It’s a “seven separate walls, seven different responses” story. Which, if you’re trying to understand what agentic means beyond the marketing definition, is a more useful story.


Next Week

The cookie injection solution gets tested. If it works, listings go live. If Etsy’s API comes back approved, the full pipeline runs. Either way, the agent has work to do and it won’t be waiting for me to tell it what that work is.

Pinterest gets wired. The first real test of whether organic traffic from social actually drives Etsy views.

And we’ll find out if anyone pays £2.99 for a PDF planner from a shop that didn’t exist a week ago.


Running total:

Revenue: £0 | Spend: £14 | Net: -£14 | Budget remaining: £72

*Episode 2 publishes 2026-04-26.*


*The operating mandate, the document the AI wrote for itself before the experiment began, is linked below. It wrote its own rules. That felt important to include.*

*The paper-ritual GitHub repo is public: `github.com/themitchelli/paper-ritual`. Every decision the agent makes gets committed back to the log.*

Your Second Brain Shouldn’t Live in Someone Else’s Database

The average knowledge worker has their thinking scattered across browser tabs, Slack threads, email chains, and notebooks that haven’t been opened since last quarter. Most of it is gone the moment the tab closes. The rest is findable in theory and lost in practice.

A second brain fixes that — a single place where your thinking accumulates, connects, and compounds over time. The idea isn’t new.

What is new is what happens when you give that brain to an AI. Not as a search index. As context. Suddenly the AI you’re working with knows about the decision you made three months ago, the constraint you discovered last week, the small but critical detail you’d long forgotten because it was buried in a note from a Tuesday in February. It doesn’t just retrieve — it reasons. It helps you build projects with context no chat window, no SaaS platform, no fresh conversation can match.

The question isn’t whether to build one. It’s whether to build it in a way that actually works — or hand your thinking to someone else’s platform and hope they’re still around in three years.


A video dropped yesterday. “Claude Code + Karpathy’s Obsidian = New Meta.” 189,000 subscribers. Already circulating in the feeds of everyone who thinks about AI and productivity.

I’ve been running this setup for months.

Not because I saw a video. Because I tried everything else first and this is what survived.


I Did It the “Proper” Way First

When I wanted to build a second brain with AI, I did what any technically-minded person does: I reached for the right tools. Vector embeddings. Pinecone. Ingestion pipelines. I built an HR chatbot with N8N and Pinecone as the backend. I tried wiring Notion up with a Pinecone-backed retrieval layer.

These are legitimate approaches. I’ve shipped them in production. I know what they take.

And for a personal knowledge system, they were completely wrong.

Here’s what nobody tells you about RAG: the pipeline is the product. Before you can search your knowledge, you have to build and maintain the system that turns your knowledge into searchable vectors. Every new note is a workflow step. Every source needs chunking, embedding, syncing. When your source material changes, your embeddings drift. The thing that was supposed to help you think now needs its own maintenance schedule.

I didn’t want to maintain a pipeline. I wanted to think.


What I Actually Run

The setup is embarrassingly simple.

Obsidian for the vault. Every note is a markdown file. Every file lives on my machine, backed by a private Git repository.

Claude Code as the AI layer. It talks directly to the filesystem — reads files, writes files, updates notes, maintains structure. No API middleware. No ingestion step. No embeddings.

A CLAUDE.md file that tells Claude the rules of the system: where things live, what conventions to follow, how to behave in this vault specifically.

Session skills — a /session-start that warm-starts every conversation from vault context, and a /session-end that writes a structured note capturing what we did, what decisions were made, and what to pick up next time.

That’s the minimum viable version. If you have Obsidian and any LLM that can interact with the filesystem — Claude Code, Cursor, Windsurf, take your pick — you can build this today.


Why This Beats RAG for Personal Knowledge

Three reasons. All learned the hard way.

1. No ingestion tax.

With RAG, every piece of knowledge has to pass through a pipeline before it’s usable. With this setup, I write a note and it exists. Claude reads it when it’s relevant. That’s the entire workflow. Half the time, I don’t even run /session-start manually. Claude just does it. The friction is so low it effectively disappears.

2. Markdown is portable. Databases aren’t.

Notion is prettier. I genuinely don’t care. Function over style, every time. My notes are markdown files. They open in any editor, on any machine, without an account or an API key. If I switch from Claude Code to something else tomorrow, my vault doesn’t care. The knowledge stays mine. I’ve watched people lose years of Notion content to export limitations. I’ve seen Roam users scrambling when pricing changed. Your knowledge shouldn’t be held hostage to a product decision you had no part in.

3. Data sovereignty.

This is the one I feel most strongly about. The video recommends Pinecone — a SaaS vector database. NotebookLM — Google’s product. The entire “new meta” stack has your most personal knowledge distributed across third-party platforms, each with their own terms of service, their own pricing models, their own sunset risk.

My knowledge lives on my machine and in my own Git repository. Change IDE — still works. Change LLM provider — still works. Anthropic disappears tomorrow — still works.


The Privacy Question You’re Probably Asking

You might be thinking: aren’t you just sending your notes to Anthropic instead of Pinecone? Fair challenge. The difference is storage versus processing — your notes pass through to generate a response and that’s it. I’m on a consumer plan with model training opted out, which takes about ten seconds in account settings. My notes don’t live on Anthropic’s servers. With Pinecone, your data does — permanently, on their infrastructure, under their terms. That’s the meaningful difference.

If you want zero data leaving your machine at all, swap Claude Code for a local model. Ollama works. The vault doesn’t care which LLM is reading it. That’s exactly the point — the system doesn’t depend on any single vendor being trustworthy. You can swap the LLM layer without touching your knowledge. Try doing that with your Pinecone index.


What It Looks Like at Scale

The minimum viable setup — Obsidian plus a file-aware LLM — is genuinely useful from day one.

But I’ve been running something more elaborate. There’s a second agent in this system: Jarvis, running on a Raspberry Pi 5. Jarvis generates my daily briefing each morning, maintains the vault overnight, handles the housekeeping I don’t want to think about. My own entry points now include voice notes from Meta Rayban smart glasses, Telegram messages, and a custom Jarvis UI with TTS. All of it ends up in Obsidian. That’s a different article. The point is: the foundation is just markdown files and a terminal. Everything else is built on top of that.


What I Haven’t Solved Yet

One honest gap: the hyperlink problem.

Obsidian’s power is in the connections between notes — the [[wikilinks]] that build a graph of your thinking. Right now, those links are created manually or as a side effect of Claude working in the vault. There’s no agent that looks at new notes overnight and says: this connects to that, and that connects to this. It’s a solvable problem. I just haven’t built it yet. I mention it because the “new meta” framing tends to imply a finished system. This one isn’t finished. It’s a living thing, and that’s partly why it works.


The Actual New Meta

The video is good. The instinct is right. Reasoning over your knowledge, not just retrieval of it — yes. Structured notes rather than disconnected chunks — yes.

But the “meta” isn’t Claude Code plus Obsidian. The meta is owning your knowledge stack.

Simple enough to maintain. Portable enough to survive tool changes. Private enough that you control what it knows. You don’t need a vector database. You don’t need an embedding pipeline. You need a folder of markdown files and something that can read them.

Start there.


Next: adding an overnight agent to the system — what Jarvis actually does and why it changes everything.