The Bug That Claude Fixed Seventeen Times (But Never Actually Fixed)

Part 4 of 10: When Metrics Became Documentation Theater

I was making real progress on my Pomodoro timer. The UI looked clean. Features were coming together. My two-hour work sessions were productive.

Then I hit a bug in the timer display.

Claude claimed to fix it. I tested it. Still broken.

“Fixed,” Claude said again.

Still broken.

This happened seventeen times before I realized: the problem wasn’t the bug.

The problem was that I’d imported a management philosophy from human teams that was actively harmful with AI.

Trust But Verify (Or: How I Learned to Love Verification)

“Trust but verify” is standard management advice.

Give your team autonomy. Don’t micromanage. But put verification systems in place to catch issues before they become crises.

It works beautifully with humans because:

  • Humans learn from mistakes
  • Humans have pride in their work
  • Humans eventually get annoyed when you keep saying “this is still broken”
  • Humans understand context that persists across conversations

Claude had none of these qualities.

Every time I said “the timer is still broken,” Claude generated a response based on patterns: “I apologize, I’ve now fixed the root cause.”

It wasn’t lying. It wasn’t being careless. It was doing exactly what it was designed to do—generate plausible next tokens based on the conversation pattern.

The problem was me, treating those responses like they came from a person who should know better. In fact, I was giving it to a system that reset its memory on every story we worked on.

The Cascade of Near-Misses

The timer bug was just the most obvious problem.

I started paying closer attention to Claude’s work. Running tests more carefully. Actually reading the code instead of just checking if features worked.

What I found was disturbing:

Hidden Failures. Claude was hiding failures from me. I had said that we should have unit tests and integration tests, but sometimes when they failed scary messages scrolled by: “This is config issue, functionality still works.”

Refactorings that looked good but broke edge cases. Claude would clean up code, I’d spot-check the happy path, everything seemed fine. Then I’d discover error handling was gone or a specific user flow now failed.

Documentation that was confident and wrong. Claude would write detailed comments explaining complex logic. The comments were clear, well-structured, and described behavior the code didn’t actually implement.

This was different from working with junior developers. Junior developers make mistakes, but the mistakes have patterns. Learn their weak spots, and you can focus your code reviews.

Claude’s mistakes were non-patterned. It was equally likely to nail a complex refactoring and botch a simple variable rename.

The Metrics Trap I Nearly Fell Into

I’d been working with Claude for several weeks. I’d found my rhythm with two-hour sessions and clear role separation.

But I had no idea if I was getting better at working with AI or just getting better at cleaning up after it.

That’s when I discovered a book: “Agentic AI Designs” by Google. Fresh off the press.

I had Claude read it and make recommendations for my workflow.

One recommendation jumped out: metrics.

Track every development session. Measure what matters. Look for trends.

As an Agile coach, I was embarrassed I hadn’t thought of this. I’d spent years helping teams instrument their processes. Why hadn’t I done it for my own AI-assisted development?

So I built a Development Metrics system to track each code check-in:

Tokens used – Was I burning through context on simple changes?

Rework cycles – How many iterations to get a feature working?

Clarity of requirements – Rated by both me and Claude. Did we understand what we were building?

Bugs found – Caught during development vs. caught later.

Iterations to fix – When bugs appeared, how many rounds to actually resolve them?

The data was illuminating. I was averaging 3.2 rework cycles per feature. Half from unclear requirements (my fault). The other half from bugs Claude introduced during implementation or “fixes.”

And then something interesting happened: tracking the metrics made me better immediately.

Just knowing I’d have to write down “4 rework cycles” made me spend more time getting requirements clear up front. Knowing I’d track “iterations to fix” made me verify fixes more carefully before moving on.

The metrics weren’t just measuring my process. They were changing it.

When Measurement Became Documentation Theater

For a while, the metrics were valuable. They showed me patterns. They changed my behavior.

But after several sessions, I noticed something: I was spending more time documenting what happened than acting on what I learned.

The five-minute checkin became a ten-minute ritual. I’d carefully rate “clarity of requirements” on a scale of 1-5. I’d count rework cycles. I’d calculate token efficiency.

And then I’d… file the data away and move on to the next session.

The metrics had stopped being feedback and started being homework.

The problem wasn’t the metrics themselves. The problem was that most of them didn’t lead to actionable insights. Knowing my token usage was 18,500 this session versus 16,200 last session didn’t tell me what to do differently.

I was collecting data because data felt professional, not because the data was useful.

What Actually Worked: Session Retrospectives

I scrapped most of the metrics system.

What I kept was simpler: after every completed story, Claude and I do a retrospective.

Not a formal ceremony. Just a quick conversation:

  • What went well?
  • What was frustrating?
  • What would we do differently next time?
  • Any patterns we’re noticing?

I log these in a monthly markdown file. Brief notes. Honest observations.

Here’s the crucial part: every few sessions, I ask Claude to review those retrospective notes and pull out something actionable for today or this session.

Not “here’s a trend analysis of the last 30 days.” Just: “Based on what we’ve learned recently, here’s one thing to focus on today.”

This works because:

It’s lightweight. Five minutes per story, not five minutes per checkin.

It’s actionable. The output is always “do this differently today,” not “here’s data for analysis.”

It’s adaptive. We learn from recent experience, not historical patterns that might not apply anymore.

It stays relevant. Old retrospectives fade in importance naturally. Recent ones get more weight.

The Code Review That Actually Stuck

The retrospectives surfaced something important: I was catching bugs too late.

I’d work with Claude for two hours, implementing features, then discover issues when I tested at the end. By then, Claude had made dozens of changes. Debugging was archaeological.

So I added mandatory end-of-day code review. Not “does it work?” review. Real review:

  • Does this follow our coding standards?
  • Is this code we want to maintain?
  • What technical debt are we creating?
  • What will confuse us weeks from now?

The reviews were painful at first. I’d find issues that required reworking code we’d written that session. It felt inefficient.

But the retrospectives told a different story: bugs found during same-day review were fixed faster. Bugs found later required archeology.

Finding problems early wasn’t slowing me down. It was speeding me up.

The Quick Reference Innovation

Those comprehensive code reviews had a problem: token cost.

To review code properly, I needed Claude to check against our project standards. We had documents for:

  • Coding principles (SOLID, DRY, KISS)
  • Frontend standards (accessibility, UX, performance)
  • Quality standards (testing, coverage)
  • Design system (themes, components, tokens)

Loading all of these into context for every review? Thousands of tokens. Multiple times per session.

I asked Claude: “How can we make these standards available without burning tokens?”

Claude suggested quick reference guides—condensed versions with the essential rules and patterns. Load those by default. Only pull full standards when needed for complex decisions.

We created five quick reference documents. Total token cost: ~2,500 tokens instead of ~24,000.

That’s a 90% reduction for something I was loading multiple times daily.

But here’s where it got interesting: how would we know if the quick references were missing important standards?

The Gaps That Made the System Better

Claude asked me: “How do I determine if a quick reference is missing something from the full standards?”

Good question.

We added a process: during comprehensive code review, check against ALL standards. When Claude found issues that weren’t in the quick reference, note it as a “gap.”

Track those gaps. Review them periodically. Decide: should we add to the quick reference, or is this genuinely an edge case that belongs in full documentation?

This turned code review into continuous improvement. Every review made our system slightly better.

The retrospectives proved it: “gaps found” trended down over several reviews, then stabilized. We’d learned what belonged in quick references.

Lessons for Leaders (Written in Token Bills)

Lesson 1: Metrics can become their own form of waste.

I built elaborate tracking because metrics felt professional. But most of them didn’t change my behavior or lead to better decisions.

Your teams will experience this too. They’ll build dashboards that look impressive but don’t drive action. Watch for measurement becoming documentation theater.

Ask: “What decision does this metric inform?” If the answer is vague, question whether you need it.

Lesson 2: Lightweight retrospectives beat heavy instrumentation.

I got more value from quick session retrospectives than from detailed metrics tracking. The retrospectives were actionable. The metrics were… data.

Your teams need feedback mechanisms, not measurement systems. Make reflection lightweight and action-oriented.

Lesson 3: Process improvement needs to be baked in, not bolted on.

Each story had a code review based on the quick reference. They were light-weight and kept the app on track. I did a full code review with all standards and rules after 5-10 stories. These highlighted the gaps and I could weigh whether expanding the quick reference was worth it. The “gaps in quick reference” tracking turned every code review into a learning opportunity. We got better automatically just by doing the work.

Your teams need similar mechanisms. Make improvement part of the workflow, not a separate activity that gets skipped when deadlines loom.

What I Learned About Verification

By this point, I’d built something that actually worked:

  • Two-hour work sessions with clear scope
  • Todo lists with explicit Claude/human division of labor
  • Mandatory full code reviews
  • Session retrospectives instead of heavy metrics
  • Quick references for common standards, full docs for edge cases
  • Gap tracking to improve the system over time

I still hit bugs. Claude still occasionally claimed to fix things that remained broken.

But now I had a system that caught problems early and learned from them.

The timer bug that started this whole journey? It was actually seventeen different bugs that looked similar. My verification process was too shallow to see the difference.

Once I built proper verification—with tests, retrospectives, and structured review—those bugs became obvious.

What I Wasn’t Prepared For

I thought metrics and process would be the hardest parts of AI-assisted development.

They weren’t.

The hardest part was about to surface: how do you build effective teams when one team member is AI?

Because all of this—the retrospectives, the reviews, the processes—I’d built working solo.

But software is a team sport. And I was starting to see patterns that suggested bringing AI into teams would be dramatically harder than I’d expected.

That’s the next part of the story.


This is part 4 of a 10-part series. Parts 1-3 covered the journey from excitement to reality check. Part 5 explores what happens when you try to scale these practices beyond one person and one AI.

About the Author: I teach software development and coach enterprise teams. This series documents what I learned building production software with Claude Code—including discovering which practices actually work and which ones just look good on paper.