Not All AI Is Equal — Stop Pretending It Is

Tagline: Vendor bias is real, the benchmarks prove it, and the engineers who’ve figured out which model to use for which job are quietly lapping everyone else.


There’s a question I hear constantly in engineering circles: “Which AI should I use?”

The implicit assumption behind it is that there’s one right answer. Pick the best one, use it for everything, done. It’s how we think about most tools — you pick your IDE, your cloud provider, your language. You don’t swap between three of them mid-task.

But AI models aren’t like that. And the sooner you stop treating them like they are, the better your output gets.

I use six different AI tools in my workflow. Not because I enjoy managing subscriptions, but because each one is meaningfully better at a specific job — and the benchmarks, plus two years of daily use, back that up.


The vendor bias problem no one talks about

Most people pick their AI assistant the same way they pick a phone: brand loyalty, whatever their company pays for, or whatever the loudest voice in their team recommends.

The result is monoculture. One model, used for everything, never questioned. And because the model is capable enough to produce something — often something good-looking — it’s easy to miss that a different tool would have done the job better.

This isn’t hypothetical. Researchers at Nature Communications published findings earlier this year warning that AI is turning research into a “scientific monoculture” — homogenised outputs, shared blind spots, correlated failures. Gartner predicts that by 2028, 70% of organisations building multi-model applications will have AI gateway middleware specifically to avoid single-vendor dependency. LinkedIn reports that “model selection” is now one of the fastest-growing skills among senior engineers.

The engineers who’ve noticed the problem are moving. The ones who haven’t are wondering why their AI output feels the same as everyone else’s.


What the benchmarks actually say

Before I get into my specific workflow, let me give you the data that convinced me models aren’t interchangeable.

SWE-Bench Verified is the closest thing we have to a real-world software engineering test. Unlike HumanEval — which asks models to write isolated functions from scratch — SWE-Bench gives a model a real GitHub repository, a real bug report, and asks it to produce a fix. No hints about which files to look at. Multi-file edits. Tests written for the human fix, not for the AI. It’s what software engineers actually do.

The current top-line scores (as of April 2026, SWE-Bench Verified):

ModelSWE-Bench Verified
Claude Opus 4.5/4.6~80.9%
Claude Sonnet 4.6~79.6%
GPT-5~74.9%
Gemini 2.5 Pro~73.1%
Grok Code Fast~70.8%

That’s a 10-point gap between the top and bottom. On tasks that represent real engineering work — navigating a codebase, diagnosing a root cause, making multi-file changes — that gap is not noise. It’s the difference between a model that resolves your bug and one that produces a plausible-looking patch that breaks something else.

But here’s what the leaderboard doesn’t tell you: SWE-Bench measures software delivery. It doesn’t measure research, design, ideation, or critique. The model that tops the coding benchmark isn’t necessarily the best tool for synthesising a market landscape or stress-testing an architecture decision.

That’s the bit that took me a while to learn. Different jobs. Different models.


My workflow: six models, six jobs

Here’s what I actually use and why.

Gemini — broad context gathering

Google’s model has a context window large enough to be genuinely useful for research synthesis. When I need to understand a large domain quickly — technical landscape, regulatory environment, competitive positioning — Gemini handles breadth well. It connects across a lot of surface area without getting lost.

I don’t use it for precision work. But when I need to go wide before going deep, it’s the right first move.

Perplexity — external research

When I need current information with citations, Perplexity is in a different category. It retrieves, cites, and synthesises in one pass. Not a replacement for reading primary sources, but significantly faster for building a research base. The multi-model routing it now supports (running queries across GPT, Gemini, and Claude simultaneously) makes it even more useful as a research layer.

Claude Opus — design and architecture

This is where I spend the most time for high-stakes thinking. System design, architecture decisions, PRD writing, anything where the reasoning chain matters and I need a thinking partner who pushes back correctly rather than just agreeing.

Opus doesn’t just answer — it models the problem. It tells me when my framing is off. It proposes alternatives I hadn’t considered. For a 40-person engineering team where a bad architecture decision stays expensive for years, that’s worth paying for.

Grok — brutal second opinion

This one might surprise people. Grok’s personality is calibrated differently to the others. It has fewer soft edges. Where Claude will often find a way to be constructive about a bad idea, Grok will tell you it’s a bad idea.

I use it specifically as adversarial review. After I’ve built something or made a design decision with Opus, I take it to Grok and ask what’s wrong with it. The quality of the critique isn’t always higher — but the willingness to deliver one bluntly is, and that’s what I need at that stage.

Claude Sonnet — delivery

Most of the actual code gets written here. Fast, capable, good context retention across a session. The SWE-Bench gap between Sonnet and Opus is now less than 1.5 points, which means for standard implementation work, the speed and cost profile of Sonnet wins.

This is the model I’m in most of the day for Claude Code sessions. It does the work.

GitHub Copilot — peer review and pull request generation

Copilot lives in the IDE. It sees the diff, knows the repo history, and does line-by-line code review in context. For PR generation and review commentary, having it operate at the file level with access to the surrounding codebase is a genuine advantage over copy-pasting into a chat interface.

It’s not my primary reasoning engine. But for the last mile of code review before merge, it earns its place.


Is this just me?

No. The multi-model approach has crossed from experimental into mainstream.

Advanced AI users now average more than three different models daily, choosing specific tools by task type. McKinsey published an enterprise workflow guide this year built around model specialisation — triage models, reasoning models, execution models, each matched to a task profile. Microsoft launched a “Model Council” feature in Copilot that routes between GPT-5.4, Claude Opus, and Gemini simultaneously.

CIO magazine ran a piece earlier this year called “From vibe coding to multi-agent AI orchestration: Redefining software development”. That’s not a niche publication running a speculative take — that’s the mainstream enterprise audience catching up to where the practitioners already are.

The pattern has a name now: model tiering. Fast, cheap models handle routine work (routing, classification, summarisation). Mid-tier reasoning models handle standard implementation. Frontier models get reserved for complex design, not burned on things that don’t need them.


The case against (and why I still do it anyway)

It’s fair to push back on this. Managing six different tools has overhead: different interfaces, different pricing models, different context management, different strengths to remember. There’s a reasonable argument that the cognitive load of model selection erodes the time you’d gain from using the best tool.

My answer is that the overhead front-loads. After two years of daily use, I don’t consciously decide which model to use any more than I decide which muscle to use when I pick something up. The routing is automatic. The habit is built.

The bigger risk is the one I started with: monoculture. One bad vendor decision — a price hike, a terms change, a capability regression — and your entire AI-assisted workflow is down. I’ve spoken to engineers who migrated off a single provider three times in 18 months for exactly this reason. Diversification is resilience.


We’re building our own benchmark

Here’s the part where I have to be honest about something.

The SWE-Bench scores I quoted above are real and useful. But they’re increasingly gamed. Labs know what’s on the test. The scores keep going up. The real-world usefulness doesn’t always follow.

I’ve been building AIMOT — the AI Model Operational Test. Named after the UK’s annual MOT roadworthiness check: a practical, pass/fail fitness test that doesn’t care how the vehicle performed in a lab. It cares whether it’s safe to drive.

The design principle that changes everything: no human interpretation. Every test must be scoreable from the output alone — numerical answer within a defined tolerance, binary fact check, code that runs or doesn’t, schema validation. If I can’t define the scoring before seeing the output, the test is disqualified.

I built the v1 test suite by doing something stranger: I asked five frontier models to write the questions. All 75 candidate tests, five models, 15 each. Then I verified every expected answer by hand.

Two of the five models submitted tests with wrong expected answers. ChatGPT got an error propagation calculation wrong (6.93, not 10.00 as claimed). Copilot produced a logic problem where the “correct” answer wasn’t correct. Both stated their wrong answers with complete confidence.

A full post on AIMOT is coming. For now: if you want a benchmark that tests models on tasks that actually matter in professional work — quantitative reasoning, logical falsification, real code bugs, domain knowledge — that’s what it’s designed to do. And the first results run is about to happen.


The principle

The vendor bias problem isn’t about which model is best. It’s about assuming the answer is fixed.

Models have different strengths. The benchmarks measure some of them. Daily use reveals the rest. The engineers who treat model selection as a skill — who deliberately match tool to task — are producing better work than the ones who picked a default in 2024 and never revisited it.

That 10-point SWE-Bench gap is real. It compounds over time. And if you’re not running your own benchmark, someone else’s numbers are the best you’ve got.


AIMOT Pro v1 results are next. The full 28-test suite, the first model run, and the scores. No cherry-picking.

When AI Leads You Down a Rabbit Hole

Flat illustration of unplugged cable and human head with brain, in burnt orange palette, symbolising humans regaining control from machines

By Steve Mitchell | Steve’s AI Diaries

It was supposed to be 30 minutes. Just a quick check-in on my N8N automation project after a 12-hour workday.

Instead, I got locked out of my Raspberry Pi server.

I spent the rest of that evening troubleshooting. Then I spent five more hours on Saturday going deeper down the rabbit hole—until I literally couldn’t remember what problem I was actually trying to fix anymore.

The actual issue? An expired token. Two clicks.

This is my second time learning this lesson the hard way. If you’re smarter than me, you’ll learn it from reading this instead.

What Was at Stake

This wasn’t just any server. This was the backbone of my entire Personal AI automation network:

  • The n8n workflow hub that automates my podcasts, notes, and Notion updates
  • The AI voice studio that turns my reflections into daily TTS episodes
  • The family assistant that syncs health, workouts, and journaling
  • The forex trading bot controller running live experiments
  • Unpublished projects like my J.A.R.V.I.S. personal assistant
  • All the backup scripts protecting everything above

One login error, and the whole system went dark.

No notes syncing. No podcast generator. No smart routines.
Just a dead login screen—and me, already exhausted from a full day of work.

I told myself it’d be fixed in 30 minutes. Just get it back online and call it a night.

How It Started

I was following an N8N tutorial, comparing my setup to someone’s YouTube walkthrough. My hosted version didn’t have the same features they showed onscreen.

No documentation. Nothing in the forums.

So I asked ChatGPT to help me configure it.

That should have been my first red flag. If there’s no documentation and no forum posts, there’s probably a reason.

But I trusted AI to lead the way.

A few config tweaks later, I was locked out completely. Every login attempt kicked me back to the setup screen.

And down the rabbit hole I went.

The Spiral

Here’s what the rest of my evening looked like—and then my entire Saturday:

Friday night:

  • Rebuilding containers
  • Reconfiguring OAuth settings
  • Checking permissions
  • Reviewing logs

Maybe I should just sleep on it and come back fresh…

Saturday morning:

  • Adjusting environment variables
  • Testing different authentication methods
  • Creating new instances
  • Comparing configurations

Saturday afternoon:

  • Reading Docker documentation
  • Trying completely different approaches
  • Backtracking through changes I’d made
  • Solving problems I’d created while solving other problems

By hour seven on Saturday, I had completely lost the thread. I wasn’t fixing the login issue anymore—I was fixing the fixes I’d attempted on Friday night.

I wasn’t debugging anymore—I was trying to prove I could fix it.

Why We Fall In

We don’t fall into rabbit holes because we’re careless. We fall in because we care.

We want to fix things.
We want to understand why.
We want control.

The very traits that make us effective—persistence, pride, precision—also make us vulnerable to what I call productive self-deception.

We convince ourselves we’re making progress when we’re actually just making noise.

And when you add AI to the mix? The spiral gets steeper.

When AI Becomes a Crutch

AI is extraordinary at local reasoning—pattern recognition, log analysis, generating commands.

But it lacks meta-awareness. It can’t say: “This problem isn’t worth solving right now.”

That’s our job.

AI doesn’t care about opportunity cost.
AI doesn’t feel frustration as a signal to pause.
AI doesn’t protect your time, energy, or focus—you do.

As soon as my system failed—already exhausted on a Friday night—I let AI take the wheel. I fed it errors, followed every suggestion, and outsourced my judgment.

By Saturday afternoon, I had lost the plot entirely.

The Real Cost

We think troubleshooting costs time. It doesn’t. It costs something far more valuable:

Momentum — Every hour in the weeds delays real work
Energy — You finish drained and demotivated
Perspective — You forget why you were fixing it
Trust — You doubt your tools, your instincts, yourself

I call this the Troubleshooting Tax—the hidden price of over-engineering.

The goal isn’t to fix everything. It’s to know what’s worth fixing.

How to Know You’re Looping

You’re not debugging anymore when:

  • You’ve been “almost there” for more than 45 minutes
  • You’re solving issues that weren’t part of the original goal
  • You’ve stopped documenting your changes
  • You’re chasing closure instead of progress
  • Frustration is rising faster than understanding

When that happens—stop.

You’re not learning. You’re looping.

How to Escape

After burning my entire Saturday (plus Friday evening) on a two-click fix, I built myself a system. Here’s what I do now before touching the keyboard:

1. Re-Anchor to Purpose

  • What value am I restoring by fixing this?
  • What’s my time budget?
  • What’s my rollback plan?

If the purpose feels fuzzy—stop.

2. Use a “Go/No-Go” Timer

Timebox your troubleshooting. If it’s not resolved in that window, document what you tried and move on.

Come back with fresh eyes, or escalate it.

3. Keep a Human in the Loop

Regularly ask yourself (or a colleague, or even AI):

“Are we still solving the right problem?”

If not, step back.

4. Protect Your Rollbacks

Backups and version control aren’t just technical safety nets—they’re psychological ones.

When you know you can undo, you stop being afraid to pause.

5. Review the Decision, Not Just the Bug

After you fix something, ask:

“At what point could I have realized this wasn’t worth the time?”

That reflection sharpens your intuition for next time.

The 60-Second Sanity Check

Before diving into any technical issue, I now run through this mental checklist:

Step 1 – Clarify the Why
What outcome am I protecting? Who depends on this system?

Step 2 – Bound the Effort
What’s my time budget? What’s my rollback plan?

Step 3 – Sanity Cross-Check
Has AI taken over my reasoning? Do I still understand why I’m doing this?

Step 4 – Stop or Continue
If I’m stuck or emotionally frustrated—stop. Write down what I know, walk away, revisit tomorrow.

This simple framework has saved me countless hours.

The Leadership Angle

This isn’t just a tech story—it’s a leadership story.

Teams fall into the same trap: automating, optimizing, refactoring—but losing sight of the value.

As leaders, we need to create cultures that celebrate stepping back, not just pushing through.

Reward the engineer who says, “Let’s stop here.”

Great engineers don’t just know how to solve problems. They know which ones matter.

My New Rule

After that day, I rebuilt my automation stack with one principle:

Every system must have a human circuit breaker.

For me, that means:

  • Git-based backups for all configs
  • Versioned containers
  • Daily snapshots
  • A visible note on my monitor

That note says:

“Are you fixing the real problem—or the one you found while fixing it?”

That’s my new mantra.

Because the deeper lesson wasn’t about OAuth or Docker or expired tokens.

It was about judgment.

The Bottom Line

AI can multiply your reach.
Automation can expand your capacity.

But only you can decide what’s worth fixing—and when it’s time to stop.

The smartest command in your system isn’t sudo or git commit or docker restart.

It’s:

pause && breathe

Have you fallen down a troubleshooting rabbit hole recently? What pulled you out? I’d love to hear your stories in the comments.


Steve Mitchell | Steve’s AI Diaries
Exploring the messy, human side of building with AI

How AI Helped Me Build a Personalised Development Roadmap

AI curated reading list

You know the feeling: stacks of books, endless recommendations, but no clear path. I was stuck in that cycle — until I started experimenting with AI to build a personalised roadmap I actually use.


🚀 TL;DR: Copy the Prompt & Get Started

Want to skip the backstory?
👉 Copy this prompt into your AI tool (I used Claude) and generate your own roadmap today:

📜 Show the Full Prompt (click to expand)
You are my learning strategist and curriculum designer. I need you to create a personalized learning program based on my specific situation, not generic recommendations.

My Profile:
Role: [e.g., "Senior Engineering Manager", "Product Owner", "Tech Lead transitioning to management"]
Learning Tracks: [2-4 specific areas, e.g., "Technical Leadership", "Organizational Design", "Product Strategy", "Team Dynamics"]
Current Challenges: [e.g., "Struggling with cross-team alignment", "Need to scale engineering culture", "Moving from IC to manager"]
Time Commitment: [e.g., "1 book per week for 20 weeks", "2 books per month"]
Preferred Learning Format: [e.g., "100% books", "50/30/20 mix of books, podcasts, YouTube videos"]

Your Tasks:

Phase 1: Strategic Curation
Don't just organize existing content—curate the RIGHT resources for my situation:
- Select 15–25 resources (books, podcasts, or videos) based on my role, tracks, challenges, and preferred format
- Prioritize recent, evidence-based, and practically applicable content
- Include foundational classics only if they're still relevant
- Balance theory with actionable frameworks
- Consider diverse perspectives and avoid redundant content
- Make sure the mix of resources matches my Preferred Learning Format

Phase 2: Intelligent Sequencing
- Create an optimal learning progression (easy wins first, complexity builds)
- Ensure each resource builds on previous learnings
- Balance tracks so no area is neglected for too long
- Front-load resources that address my immediate challenges
- For each resource: one clear sentence on "What this solves for you"

Phase 3: Interactive Dashboard Creation
Build a modern, professional HTML dashboard with:

Design Requirements:
- Clean, contemporary styling with gradients and professional color scheme
- Smooth hover effects and subtle animations
- Mobile-responsive design that works on all devices
- Print-friendly styling for PDF export

Functional Features:
- Amazon Integration: All book titles must link to Amazon product pages
- Podcast Integration: Direct link to an official podcast page (Apple, Spotify, or publisher site)
- Video Integration: Direct link to YouTube playlists, lecture series, or course pages
- Format Column: Clearly show 📚 Book, 🎙️ Podcast, or ▶️ Video
- Progress Tracking: Interactive checkboxes that update completion percentage in real-time
- Visual Feedback: Completed resources get struck through with green highlighting
- Category Visualization: Checkmarks showing which tracks each resource addresses
- Week Counter: Visual week numbers for pacing
- Progress Header: Dynamic "X/Y resources completed (Z%)" in the header

Link Requirements:
- All resources must link to specific, existing content (not generic search results)
- Books → direct Amazon product pages
- Podcasts → direct official podcast pages (Apple/Spotify/publisher site)
- Videos → direct YouTube playlists, lecture series, or course pages
- Verify links are valid and relevant
- Do not provide search URLs or placeholders

Technical Specifications:
- Single HTML file with embedded CSS and JavaScript
- No external dependencies except Amazon/podcast/YouTube links
- Functional checkboxes that maintain visual state during session
- Hover animations on links and interactive elements

Output Structure:
1. Curation Rationale: Brief explanation of why these resources were chosen
2. Learning Roadmap: Markdown table with sequencing logic
3. Interactive Dashboard: Complete HTML file ready to use

Table Format:
| Week | Resource (Link) | Format | What This Solves For You | [Track 1] | [Track 2] | [Track 3] |

Quality Standards:
- All resources should be from 2015 or later unless timeless classics
- Prioritize content with practical frameworks over pure theory
- Include diverse authors and perspectives
- Ensure each resource directly addresses my stated challenges or growth tracks
- Respect my Preferred Learning Format when building the list
- All links must be validated and point to real, working resources (no placeholders or search results)

👉 Or see my live interactive dashboard here:
View Example Dashboard


What This Gives You

  • A personalised roadmap sequenced to your role and challenges
  • Clear learning tracks (e.g., Leadership, Product, AI, Team Dynamics)
  • A mix of books, podcasts, and YouTube videos — adapted to how you learn best
  • Why each resource matters, in one sentence
  • An interactive dashboard with checkboxes, pacing, and progress tracking
  • Exportable as PDF for old-school readers

How to Use It (3 steps)

  1. Paste the prompt into your AI tool.
  2. Fill in your role, challenges, tracks, time commitment, and preferred learning format.

    Example input:
    Role: Senior Engineering Manager
    Learning Tracks: Technical Leadership, Product Strategy, Team Dynamics Current Challenges: Struggling with cross-team alignment, need to scale engineering culture
    Time Commitment: 2 books per month
    Preferred Learning Format: 50% books, 30% podcasts, 20% YouTube videos
  3. Export your personalised learning roadmap and start working through it.

💡 Ten minutes in, you’ll have a sequenced plan instead of random guesses.


🌟 Why This Matters

Most of us consume knowledge randomly: we buy a book because someone recommended it, binge a YouTube series because it looked useful, or listen to podcasts without connecting the dots. That’s fine — but it’s rarely strategic.

This approach fixes that:

  • Strategic progression → fundamentals first, depth later.
  • Personalisation → tuned to your role, goals, and real gaps.
  • Multi-format flexibility → prefer podcasts on commutes? Tell the AI 50% podcasts, 30% YouTube, 20% books. Old-school reader? Set it to 100% books.
  • Living curriculum → as your challenges change, rerun the prompt.

It’s like having your own curriculum designer on call.


🧪 Behind the Build (The Messy Experiments)

Iteration 1 → 3: I started with ChatGPT critiquing my competencies against my role. We mapped my gaps, then I added books I’d read and recently bought. ChatGPT built a PDF roadmap — solid content, but ugly design.

Iteration 4 → 5: I took the roadmap to Claude, asking for a clean interactive dashboard. Claude nailed it on the first try. For a while, I shared a two-step process: ChatGPT for the roadmap, Claude for the interface. Eventually I simplified into one combined prompt.

Iteration 6: I shared the dual prompts with friends who are keen readers and lifelong learners. One was Leesa Drake, Head of People & Culture at Milliman, who in return passed me a powerful self-analysis prompt from Chris Broyles. Chris tested my method himself and came back with his own curated list. That was proof: the idea worked for more than just me — but the process still felt clunky. I may follow up with a self analysis article if there is interest?

Iteration 79: The dual prompts bothered me. ChatGPT won’t make it beautiful, but Claude can research the content. A few tweaks and we’re down to one prompt.

Iteration 10: I wanted to share this on LinkedIn. But Claude kept busting the character limits — no matter how I prompted it. That failure pushed me to start this blog instead.

Iteration 11: With ChatGPT’s help I set up the domain, HTTPS, and brand identity. Claude tuned the SEO. I trialled different WordPress themes until Claude nudged me toward a minimalist one that stuck.

Iteration 12: For the title image, I tried ChatGPT, Claude, and Gemini. ChatGPT’s version won. The SEO dashboard suggested improvements — Claude optimised the post in one pass.

Iteration 13: After the post had been live for 18 hours, I realised the most important part was missing: the story of how I got here. That’s why you’re reading this expanded version.

Iteration 1417: Tested on main LLM alternatives, Grok, Gemini, CoPilot, Perplexity, none as good as Claude.


🎬 Credits

  • Inspiration → Role coaching from ChatGPT
  • Roadmap Creation → ChatGPT
  • Interactive Dashboard → Claude
  • Blog post → ChatGPT
  • Blog Image → ChatGPT
  • SEO → Claude
  • Idea validation Corey Grigg
  • Idea champion Leesa Drake
  • First user test Chris Broyles

🤦 Bloopers

  • No clear plan (organic development) meant where we ended up, was far from where we started.
  • Lots of time wasted tweaking the wrong WordPress Theme
  • ChatGPT’s attempts to “beautify” the roadmap were clunky.
  • Claude repeatedly broke LinkedIn’s character limit.
  • Manual SSL setup on budget hosting was painful.
  • I tried 3 AIs for the header image → ChatGPT’s was best.
  • Tested on Perplexity.ai → Creates the plan but exceeds the free tier for my tests
  • Tested on Grok, → Is my favourite version, but viewing the dashboard isn’t intuitive to all
  • Tested on CoPilot → Creates the plan, but the dashboard is code you’d have to host.
  • Tested on Gemini → Creates the plan, but the dashboard is code you’d have to host.

🎬 Sequels: Possible Follow-Up Experiments

Every experiment sparks more experiments. Here are three directions this could go next:

  1. Centralised Learning Dashboard in NotebookLM
    • Load all curated resources (books, podcast episodes, YouTube transcripts) into NotebookLM.
    • Ask questions directly across your entire personal curriculum.
  2. Workflow Automation with n8n + Airtable
    • Each time you generate a new learning roadmap, have n8n automatically log it to Airtable.
    • Creates a searchable library of learning paths (for you or your team).
  3. Database-Connected Dashboard
    • Upgrade the static HTML dashboard into one that connects to a real database.
    • Track progress permanently, sync across devices, and generate analytics.

💡 If there’s interest, I’ll turn one of these into the next experiment and share the results here.


Ready to Try It?


Still here?

Want to see more strategic approaches to professional development? Subscribe to stevesaidiaries.com for weekly insights on leadership, technology, and intentional growth.