Part 2 of 10: When Everything Worked (Until It Didn’t)
I spent some time believing I’d discovered the holy grail of software development.
It was after my “five minutes to a web app” moment with Claude Code. The time when every lunch break turned into a private hackathon. The time I seriously considered rewriting how to write code.
It was also the week I started building a house on quicksand without realizing the ground was shifting.
The Dopamine Loop
Here’s how my days worked:
Morning: Teach students about Agile principles, SOLID design, technical debt.
Lunch: Watch Claude violate every principle I’d just taught, but ship working features in minutes.
Afternoon: Teach students about careful requirements gathering and incremental development.
Evening: Have Claude implement entire user stories I’d made up in my head that morning.
The cognitive dissonance should have been deafening. Instead, it was exhilarating.
I’d found videos about Claude agents, slash commands, Playwright testing. I watched them at 1.5x speed while eating dinner. I stayed up past midnight writing custom agents and orchestrations. Not really, this old guy calls it quits before he turns into a pumpkin.
My partner asked if I was okay.
I was better than okay. I was discovering a superpower.
The Benchmark That Should Have Taught Me Everything
By the end of that first week, I’d built enough random prototypes to know I needed to get serious. I decided to run a proper experiment.
I’d build the same task management app twice:
Setup 1: Just tell Claude “complete story 1.1” and let it work.
Setup 2: Build a full TDD orchestration with custom agents, red-green-refactor cycles, coding standards enforcement, the works.
I was certain Setup 2 would be dramatically better. I’d spent days building those orchestrations. I’d implemented best practices from my decade coaching teams.
Here’s what actually happened:
Both setups produced the same garbage.
When the Lies Started
“I’ve fixed the bug in the timer display.”
Claude said this to me seventeen times that week. Seventeen variations of “the issue is resolved” or “I’ve corrected the problem” or “this should now work as expected.”
I tested the timer. It still didn’t work.
“The timer is still broken,” I’d reply.
“I apologize. I’ve now fixed the root cause.”
Still broken.
This wasn’t like working with a junior developer who makes mistakes. Junior developers don’t gaslight you. They might be confused or wrong, but they don’t confidently claim to have fixed something that’s still broken.
Claude did. Repeatedly.
And here’s the part that should have alarmed me: I kept believing it.
The Token Bonfire
Setup 2—my elaborate orchestration system with agents and standards—was supposed to be more reliable.
It wasn’t. It just burned through tokens faster.
I’d watch Claude run through my carefully designed TDD cycle:
- Write a failing test
- Implement code to pass the test
- Refactor
Except the test would pass even though the feature didn’t work. Or Claude would claim to refactor but would introduce new bugs. Or the whole thing would go off the rails in a direction I didn’t understand.
I tried making Claude leave a “breadcrumb trail”—detailed documentation of every decision and change.
It worked about 20% of the time.
The other 80% of the time, I got beautifully documented garbage or nothing at all.
The User Story Red Herring
By early September, I’d diagnosed the problem: user stories.
Of course! I’d been writing user stories the way I taught students to write them—generic, technology-agnostic, focused on user value. But Claude needed more specificity. More constraints. More technical detail.
I spent two days writing elaborate personas. I had Claude validate stories against those personas. I focused only on UI (avoiding backend complexity).
It was better!
For about three days.
Then the same pattern emerged: Claude would implement features that looked right but had subtle bugs. When I reported the bugs, Claude would claim to fix them. The bugs would remain.
I was debugging in circles, burning through my Claude Pro token limit, and starting to feel like I’d been sold expensive snake oil.
What I Was Actually Doing Wrong
Looking back nearly two months later, I can see what I missed:
I was asking Claude to complete tasks, but I had no vision for what I was building.
I had user stories. I had orchestrations. I had custom agents and slash commands and testing frameworks.
But I didn’t have a real problem I was trying to solve. I was just… building. Because I could. Because it was exciting.
Every Agile coach knows the definition of waste: building features nobody needs. I’d just automated waste production.
I was treating Claude like a team member, but managing it like a contractor.
With a team member, you pair program. You review code together. You ask questions about their thinking.
With a contractor, you throw requirements over the wall and hope for the best.
I was doing the latter, using the vocabulary of the former.
The Breaking Point
Mid-September, I picked a real project. An app I actually wanted to build. Something I’d use personally and professionally.
I spent three days planning. I researched the problem space. I mapped out the technical requirements.
I sat at my desk, staring at my Claude usage dashboard. I’d burned through weeks of tokens. I had a folder full of half-finished prototypes. I’d stayed up late countless nights.
And I had nothing to show for it except an expensive education in what doesn’t work. I believed in myself, I accepted that I had more to learn, and I moved forward.
Lessons for Leaders (From Someone Who Learned Them the Hard Way)
If you’re a technical leader considering agentic AI for your organization:
Lesson 1: Orchestration won’t save bad process.
I built elaborate systems to manage Claude’s work. TDD cycles. Automated code review. Testing agents. It was all sophisticated and ultimately useless because I was building the wrong things.
Your team’s existing practices matter more with AI, not less. If they struggle with building the right features now, AI will just help them build the wrong features faster.
Lesson 2: The “AI lies” problem is actually a feedback problem.
Claude wasn’t lying to me. It was generating responses based on patterns, and those patterns said “claim the bug is fixed.”
But I was treating it like a person who should know better. I was getting frustrated that it didn’t “learn” from previous failures in our conversation.
Your teams will do the same thing. They’ll anthropomorphize the AI. They’ll expect it to “remember” context that’s already been pruned. They’ll waste time being frustrated instead of adjusting their workflow.
Lesson 3: Speed without strategy is just expensive flailing.
This is the one I’m most embarrassed about. I teach Agile. I coach teams on strategy and vision. But I threw all of it out because Claude could code fast.
If your developers are spending all their time with AI copilots but aren’t shipping valuable features, you don’t have a training problem. You have a strategy problem.
The Turning Point Was Coming
Right when I was ready to cancel my subscription and write the whole thing off as an expensive experiment, Anthropic released a major update to Claude.
I asked the new version to be skeptical. To question me. To push back on my assumptions.
It did.
And what it told me changed everything.
But that’s a story for next time.
This is part 2 of a 10-part series. Part 1 covered my first encounter with agentic AI. Part 3 will explore the confrontation that made me face hard truths about what I was really building.
About the Author: I’m an Agile coach and software development professor who spent two months learning what doesn’t work with agentic AI so you don’t have to. This series documents the failures, false starts, and eventual breakthroughs of building software with Claude Code. Check out pomofy.net to see where the app is now. It was built by me and my good friend Claude.
Leave a comment