Harness Engineering Part 5 -The Big Picture

The Honest Truth About AI Coding Agents: What Nobody Wants to Admit

You’ve probably heard the pitch by now. AI coding agents are going to 10x your productivity. Three engineers can build a million-line product in five months. Stripe merges over a thousand agent-generated PRs per week. Individual developers ship thousands of commits per month running five to ten agents simultaneously.

Some of that is real. Some of it is hype. And the gap between the two is wider — and more interesting — than anyone in the industry wants to admit.

Over the past year, I’ve been digging into what it actually takes to make AI coding agents work in production. Not the demo version, where you type a prompt and marvel at the magic. The real version, where the code needs to survive contact with users, edge cases, security audits, and the passage of time.

What I found is this: agent-assisted development is genuinely powerful, genuinely flawed, and headed somewhere that should make you rethink what it means to be an engineer. This post covers the uncomfortable truths, the career implications, and the long-term trajectory — including what happens when the scaffolding we build today becomes unnecessary.

Let’s get into it.


The Verification Problem Nobody Has Solved

Here’s a question that should keep every agent-assisted development team up at night: how do you know agent-generated code actually works?

Not “tests pass” works. Not “it compiles and the linter is happy” works. Actually works. In production. Under load. With edge cases you didn’t think of.

The honest answer is: you don’t. Not fully. Not with certainty. And neither does anyone else.

When an agent writes code, it typically produces something that compiles, passes the linter, doesn’t break existing tests, and passes the new tests the agent itself wrote. Four layers of checking. That looks pretty good. But there’s a lot that slips through.

Edge cases nobody specified. Tests can only verify behavior you thought to test. The agent writes tests for the behavior it implemented. But what about the request with a 50MB JSON body? The user who hits the endpoint 1,000 times per second? The string field containing Unicode control characters? These don’t appear in any spec, so they don’t appear in any test.

Security vulnerabilities. Research shows 40-62% of AI-generated code contains security vulnerabilities. The failure rates on specific categories are alarming: 86% of XSS protections fail, 88% of log sanitization fails. These aren’t bugs that tests catch, because the tests are written by the same agent that wrote the vulnerable code. The agent doesn’t know what it doesn’t know about security.

The “wrong test” problem. This one is insidious. Agent-written tests pass but assert incorrect behavior, trivial outcomes, or mock pass-throughs. This is worse than no test, because it hides a gap behind a green checkmark. Research confirms a real implementation-test asymmetry: AI excels at writing implementation but struggles with quality testing. The tests look right. They exercise the code. The assertions are just too shallow to catch real problems.

Performance characteristics. Most test suites don’t measure response times, memory usage, or resource consumption under realistic load. An agent can produce an endpoint that returns correct results by making six database queries when one would do. Tests pass. Production crawls.

Long-term maintainability. A function can be correct today and unmaintainable tomorrow. Tests verify behavior; they don’t verify readability, extensibility, or the cognitive load the code places on future developers. Agents happily produce code that works perfectly but is structured in ways that make future changes expensive.

None of these are hypothetical. They’re happening right now, at scale.


Phantom Completion at Scale

There’s a term I use for a specific failure mode: phantom completion. It’s when an agent marks work as done, but the code doesn’t actually work correctly. You can build defenses against this for individual features — verification stacks, back-pressure loops, feature checklists with test-verified status. Those work.

The problem nobody has solved is phantom completion at scale.

Picture this: a team running agents, fifty features shipping over a month. Each one passed its tests. Each one went through code review. Each one merged. But six of those fifty features have edge cases that will fail in production. Two have security vulnerabilities that no test covers. Three have performance characteristics that are fine in testing but degrade under real load. One has a subtle data race that only manifests when two specific features are used simultaneously.

None of those issues are detectable by the current verification stack. Tests pass. Linters pass. Architecture checks pass. Review didn’t catch them because reviewers are human, they’re reviewing twenty PRs a day, and the issues are subtle enough that you’d need deep concentrated analysis to spot them.

This is systemic phantom completion. No single feature is obviously broken. The system as a whole is less reliable than it appears. And the verification infrastructure tells you everything is fine.

There’s a perception gap that makes it worse. Research found that developers take 19% longer with AI assistance yet believe it accelerated their work by 20%. That’s not laziness or delusion — it’s a genuine cognitive blind spot. The feeling of velocity is real. The code is appearing fast. But the verification that the code is right hasn’t kept pace.


The Human Bottleneck

Here’s the uncomfortable math. A senior developer can deeply review maybe 3-5 PRs per day — the kind of deep review where you understand the full context, think through edge cases, check security implications, and verify that the tests are testing the right things.

A single developer running agent sessions produces 3-5 PRs per day. Manageable.

A team of five developers running agents in parallel produces 15-25 PRs per day. Nobody is deeply reviewing all of those.

An organization with unattended agents at scale? You can triage — auto-merge for low-risk, quick review for moderate, deep review for high risk. But triage is a coping mechanism, not a solution. And the “low-risk” classification is an assumption that’s sometimes wrong.

This creates what I call verification debt: the accumulating risk from code that was reviewed lightly or not at all. It’s invisible until something breaks. And when something breaks in code that nobody deeply reviewed, the debugging cost is enormous because nobody has the mental model of why the code is the way it is.

The best approach available today isn’t a silver bullet. It’s defense in depth with diverse perspectives.

The implementing agent, a reviewing agent (with fresh context and an explicit verification mandate), the human reviewer, static analysis tools, runtime monitoring — each catches what the others miss. No single perspective gets everything. The combination catches more than any individual layer. You focus your human verification budget where risk is highest: authentication, payments, data migrations, security-critical paths. You accept lighter verification for low-risk utilities and UI components. You invest in test quality over test quantity — mutation testing campaigns that reveal where agent-written tests are shallow.

And you accept that the gap between “tests pass” and “production-ready” is real, it’s unsolved, and managing it honestly is better than pretending it doesn’t exist.


Your Job Is Changing (Whether You Like It or Not)

Here’s the sentence that lands differently once you’ve absorbed the verification problem: your job is no longer to write code.

That’s an exaggeration, but not by as much as it used to be. The OpenAI Codex team built a million-line product with three engineers over five months — zero hand-written application code. When we say “humans steer, agents execute,” it’s not a conceptual framework anymore. It’s a job description.

The identity shift is real and specific. It’s the shift from “I write code” to “I design systems that produce code.” From “I’m a maker” to “I’m an architect, a specifier, a reviewer, and a manager of automated systems.”

For some engineers, this feels like a promotion. They were always more interested in design than implementation. The tedious parts — boilerplate, CRUD endpoints, repetitive test scaffolding — are handled. They get to spend their time on the parts they always cared about most.

For other engineers, the shift feels like a loss. They became engineers because they love writing code. The craft of it. The flow state. The satisfaction of a function that does exactly what it should, elegantly and efficiently. Telling these people that writing code is now a lower-value activity isn’t a promotion. It’s an identity crisis.

Both reactions are valid. Let’s talk about what actually matters now.

What’s More Valuable

Architecture and systems thinking. When agents write the code, the architecture is the code — in the sense that the architecture determines the quality of everything the agent produces. Understanding how components interact, predicting emergent behavior, designing for change — this has always separated strong engineers from competent ones. Now it separates effective practitioners from people who are frustrated that their agents keep producing messy code.

Specification writing. Writing specifications precise enough for an agent to execute without ambiguity — that’s the new coding. It requires the same precision, the same attention to edge cases, the same understanding of the problem domain. The output is English instead of Python, but the cognitive skill is remarkably similar. The thinking that used to happen in your head between reading the ticket and opening the editor is now the work itself, because the spec is the deliverable.

Verification design. The skill of designing verification strategies — choosing which properties to test, building multi-layer verification stacks, creating back-pressure mechanisms that catch what tests miss — is one of the highest-leverage skills in the new landscape. You know what agents are bad at testing. So the human who designs the shape of the test suite adds more value than the human who writes individual test cases.

Domain expertise. Models are general-purpose. They know a little about everything. You know a lot about your specific domain — the business rules, the edge cases, the regulatory requirements, the things that have gone wrong historically. Knowing that payment retry logic needs to be idempotent, that healthcare data requires specific audit trails, that financial calculations need exact decimal handling — this knowledge doesn’t come from the model. It comes from you.

What’s Less Valuable

Let’s be equally honest about the other side.

If your competitive advantage was typing speed and syntax fluency — writing code fast, rarely looking up documentation — that advantage has shrunk dramatically. The agent has all of that memorized too.

If you were valued for boilerplate proficiency — setting up projects, scaffolding CRUD endpoints, writing configuration files from memory — that’s squarely in the agent’s sweet spot. The routine, well-patterned work that used to take hours takes minutes.

Implementation-level problem solving — “how do I implement a B-tree?” or “how do I parse this JSON format?” — still has value. But the frequency with which it matters has decreased. For most application code, the agent handles implementation-level problems correctly.

None of these go to zero. You still need syntax knowledge when reading agent output and catching subtle errors. You still need implementation skills for the genuinely hard problems. The shift isn’t from “valuable” to “worthless.” It’s from “primary competitive advantage” to “table stakes.”

The Craft Question

Is there still room for the joy of writing elegant code? Yes. But it lives in different places.

The hard problems that agents can’t solve — novel algorithms, performance-critical inner loops, complex concurrency patterns, security-sensitive cryptographic implementations — still demand human craft. And they demand it more intensely, because the agent handles the surrounding context while you focus entirely on the problem that requires genuine insight.

Architecture is craft. Choosing the right boundaries, the right abstractions, the right level of constraint. Designing a system that’s clear enough for agents to work in, flexible enough for requirements to evolve, and simple enough for humans to understand. This is creative, intellectually demanding work.

The craft doesn’t disappear. It migrates. If you derived satisfaction from the elegance of well-structured code, you can derive the same satisfaction from the elegance of a well-structured system. The medium changes; the sensibility doesn’t have to.

That said, if what you loved was specifically the act of typing code — the flow state of implementation, the rhythm of test-write-refactor — then yes, there’s a real loss. That experience will be less frequent. Acknowledging this isn’t weakness. It’s honesty.


The Skeptic’s Case (Steel-Manned)

Any honest assessment of agent-assisted development has to grapple with its strongest critics. Not the straw-man version where AI is a fad. The steel-man version — the strongest, most intellectually honest arguments that this approach has serious problems.

The Hype-Reality Gap

That METR study finding — developers 19% slower while perceiving themselves 20-24% faster — is the most uncomfortable number in the field. And here’s the skeptic’s question: how much of the evidence for harnessed development is also subject to a perception gap?

The headline numbers from success stories are impressive. Three engineers, a million lines, five months. But we don’t have controlled studies of harnessed versus unharnessed development at the same organizations, on the same projects, with the same developers. The comparison is always against a hypothetical: “This would have taken ten times longer without agents.” Maybe. Maybe not.

Many organizations report new inefficiencies that don’t make the success stories. Duplicated work across parallel agent sessions. Increased oversight burden. Hours spent correcting agent errors that wouldn’t have existed if a human had written the code. Building reliable harnesses requires significant engineering investment, and the returns don’t always materialize on the timeline organizations expect.

The Quality Numbers

The evidence on code quality is genuinely mixed. Research from CodeScene shows that AI-generated code increases duplication by 8x and reduces code reuse. Technical debt accumulates 3x faster. Developers report spending 63% more time debugging AI-generated code than writing equivalent code from scratch. AI co-authored pull requests have 2.74x higher security vulnerability rates. Organizations with 25% more AI usage see a 7.2% decrease in delivery stability.

On the other side, well-harnessed agent workflows produce code that passes comprehensive test suites, conforms to architectural constraints, and follows established conventions. But “good harness” is doing a lot of work in that sentence. Most organizations don’t have a good harness yet.

And there’s a deeper concern: what CodeScene calls “AI operating in self-harm mode” — writing code it cannot reliably maintain later. An agent can produce a perfectly functional implementation today that becomes a maintenance nightmare in six months. We don’t have multi-year data on large codebases built primarily by agents. Nobody does. The longest production track records are measured in months, not years.

The Trust Paradox

Developer trust in AI coding tools dropped from 43% to 29% in eighteen months. Yet usage rose from around 50% to 84% in the same period. Developers are using tools they increasingly distrust.

If developers don’t trust their tools, they review output more carefully — good for quality, bad for throughput. The entire economics of agent-assisted development depends on agents multiplying human productivity. If the human spends as much time reviewing agent output as they would have spent writing the code, the multiplication factor is zero.

The Over-Compliance Problem

Here’s one that’s genuinely surprising. Research from ETH Zurich found that LLM-generated context files degraded agent performance by approximately 3% and inflated costs by over 20%. In some cases, agents followed instructions too thoroughly — a failure mode researchers call over-compliance. The agent prioritizes following your rules over solving the actual problem.

This connects to a broader finding: Vercel discovered that removing 80% of their specialized tools improved performance from 80% to 100%. More rules aren’t always better. There’s a point of diminishing returns where additional constraints reduce flexibility more than they improve reliability.

The Balanced View

Here’s the thing about steel-manning the skeptical case: it doesn’t invalidate the optimistic case. It contextualizes it.

The techniques work. Architecture constraints make agents more productive. Verification stacks catch real problems. Specification-driven development produces better output than unstructured prompting. These are observed outcomes at real organizations.

And: the hype-reality gap is real. Quality concerns are legitimate. Security surfaces are expanding. Entropy is accumulating. The long-term maintainability question is unanswered.

Both things are true. The discipline is powerful and imperfect. The useful posture isn’t cheerleading or doom. It’s clear-eyed pragmatism: use the tools, measure the results, fix the problems, stay honest about what you don’t know.


The Bitter Lesson: Why Your Harness Is Temporary

In 2019, AI researcher Richard Sutton published a short essay called “The Bitter Lesson.” His argument: across seventy years of AI research, the methods that ultimately won were not the ones encoding human knowledge into clever algorithms. The winners were the general methods that leveraged computation at scale. Chess engines didn’t beat grandmasters because researchers encoded grandmaster strategy. They won because search plus compute beat hand-crafted knowledge. Every time.

What does a 2019 AI research essay have to do with the scaffolding you build around coding agents? Everything.

The harnesses we build are, in a real sense, hand-coded knowledge. Every rule in your instruction files, every custom linter check, every context management trick, every error recovery pattern — these are things we built because the model couldn’t reliably handle them on its own.

And models are getting better. Not incrementally. The pace is disorienting. Manus refactored their agent harness five times in six months. LangChain rebuilt their agent architecture three times in a single year. Vercel removed 80% of their specialized tools and saw accuracy go from 80% to 100%.

The Bitter Lesson applied to this space says: over time, models get better, and you’re having to strip away structure, remove assumptions, and make your harness simpler. The rules you write today are not permanent truths. They’re interventions designed for the current model generation. Some will become obsolete. Probably sooner than you think.

What Will Simplify

Context management tricks. The elaborate choreography of what to show the model and when exists because current models degrade at longer context lengths. As models get better at utilizing full context windows, this matters less.

Detailed formatting instructions. “Use camelCase for functions.” “Import statements go at the top, sorted alphabetically.” Future models will likely infer conventions from existing code with much higher reliability.

Error recovery patterns. “If a test fails, read the error message before changing code.” “Limit yourself to three retry attempts.” These rules exist because current models sometimes enter doom loops. As self-correction improves, prescriptive recovery instructions give way to the model’s own judgment.

Step-by-step workflow instructions. “First read the architecture file. Then plan your approach. Then implement. Then run tests.” Future models may adopt these patterns natively.

What Won’t Simplify

Architecture design. A model that’s ten times more capable still doesn’t know whether your team prefers microservices or a modular monolith. It still doesn’t know that the payments module should never call the notifications module directly. These are choices, not capabilities.

Specification and intent. The model doesn’t know your product requirements, your users, or what trade-offs your business values. Specification is about human intent, and human intent doesn’t get automated away.

Verification. The core issue isn’t that models are bad at testing — it’s that the entity writing the code shouldn’t be the sole verifier of the code. Independent verification is a principle that transcends model capability.

Business rules and domain knowledge. “The EU region uses GDPR-compliant data handling.” “Free-tier users are limited to three projects.” “All financial calculations must be auditable.” No model improvement makes these unnecessary.

Security and governance. Trust boundaries exist because of the principle of least privilege, not because of model limitations. A perfectly capable model still shouldn’t have unrestricted access to production databases.

The pattern is clear. What simplifies is everything related to model capability — the workarounds, the guardrails, the training wheels. What persists is everything related to human intent — the choices, the constraints, the domain knowledge, the governance.

The Type A / Type B Framework

There’s a useful way to think about this. Every rule in your system falls into one of two categories:

Type A patches are model-specific fixes. “Always check if a variable is null before accessing its properties.” “When generating SQL, always use parameterized queries instead of string concatenation.” These exist because the model has a specific weakness. A newer model might not have that weakness. Type A patches are temporary.

Type B patches are business truths. “All API endpoints must validate authentication tokens.” “Price calculations use four decimal places.” “The reports module is deprecated; use the analytics module instead.” These exist because of your domain, not because of the model. Type B patches are permanent.

Every time you upgrade to a new model, purge your Type A patches. Test whether the new model still needs each constraint. Organizations that do this systematically report reducing their instruction sets by 40-60% during model upgrades. That’s cutting your harness roughly in half.

The paradox of expertise applies here too. Beginners build elaborate harnesses because they don’t know which parts are essential and which are cargo cult. Experts build minimal harnesses because they know exactly where the failure modes are. The trajectory of mastery isn’t toward more complex harnesses. It’s toward simpler ones.


Where This Is All Headed

If you project the trajectory forward — from prompt engineering to context engineering to harness engineering — each era moved the human further from implementation details and closer to intent. The next shift is what you might call intent engineering: pure specification of what and why, with the system handling almost everything about how.

Today, a typical harness is maybe 40% intent (architecture choices, business rules, domain knowledge) and 60% scaffolding (error recovery scripts, formatting instructions, context management tricks). In the intent engineering era, that ratio flips. 90% intent, 10% scaffolding.

What remains is what only you can provide: decisions about what to build, how the system should be organized, what the business rules are, and how you know the output is correct.

Three Plausible Futures

Human as Architect. Engineers become pure architects. You design systems at a high level — bounded domains, data models, API contracts, security requirements, performance constraints — and agents handle everything from planning through deployment. You don’t write code. You don’t even write detailed specifications. You describe outcomes and constraints, and the system figures out the rest.

Agent-to-Agent Collaboration. Agents don’t just execute tasks — they collaborate with each other. A planning agent decomposes work. Worker agents implement in parallel. A reviewer agent checks output. A quality agent runs verification. The human’s role becomes strategic: deciding what to build, setting priorities, resolving trade-offs that require business judgment.

Self-Improving Harnesses. The harness improves itself. When an agent makes a mistake, the system diagnoses the root cause and proposes the harness change. The human approves, but the diagnosis and proposal are automated. Each cycle makes the system better. Each failure generates not just a fix but a structural improvement.

What’s constant across all three futures: humans decide what to build and why. Verification persists. Architecture decisions remain human choices. Governance gets more important, not less, because more capable agents can cause more damage when misconfigured.

The future isn’t about whether humans stay in the picture. It’s about where in the picture humans add the most value.


The Skills That Compound

Here’s the most important argument in this entire post.

Regardless of which future materializes, certain skills transfer to whatever comes next. Not the specific techniques — some of those will become obsolete. But the underlying capabilities compound regardless of how the tools evolve.

Systems thinking. The ability to see the whole environment, not just the prompt. To understand how architecture constraints, verification mechanisms, context management, tool access, and governance policies all work together as a system. Whether you’re designing a harness, an intent specification, or an agent coordination protocol, the skill is the same.

Specification writing. The discipline of thinking through what you actually want, what the edge cases are, what “done” looks like. As models become more capable, your specifications will become shorter and higher-level. But the discipline of precise intent only gets more valuable, not less.

Verification design. The discipline of asking “how do I know this is correct?” and designing systems to answer that question. The techniques may change — maybe adversarial agents replace human review, maybe formal verification replaces test suites — but the discipline is permanent.

Architecture design. The judgment to decide which boundaries matter for a given system. A ten-times-more-capable model still benefits from clear architectural boundaries. That judgment doesn’t get automated away.

The feedback mindset. When something goes wrong, you don’t patch the output — you improve the environment that produced the output. Whether that environment is an instruction file, an intent specification, or an agent coordination protocol, the mindset transfers.

Notice the pattern. All of these skills are about operating at a higher level of abstraction than writing code. They’re about designing systems, specifying intent, verifying correctness, and improving processes. Those are exactly the skills that become more valuable as models handle more of the implementation.

The engineers who understand these disciplines won’t be stranded when the tools change. They’ll be the ones best equipped for whatever comes next, because they’ve already made the mental shift from implementation to architecture, from coding to specification, from fixing symptoms to designing systems.


Practical Takeaways

If you’ve made it this far, here’s what to actually do with all of this.

Accept the verification gap honestly. Stop pretending that passing tests means production-ready. Build defense in depth: implementing agent, reviewing agent, human review, static analysis, runtime monitoring. Focus your deepest verification on your highest-risk code. Track verification debt explicitly, the way you track technical debt.

Measure everything. Don’t rely on perception. Track actual time-to-feature, defect rates, and review burden. The hype-reality gap means your intuition about productivity gains may be wrong. Let the data tell you whether your approach is working.

Invest in the skills that compound. Architecture, specification writing, verification design, domain expertise. These are the skills that remain valuable regardless of how capable models become. Typing speed and syntax fluency are becoming table stakes.

Build for deletion, not permanence. Every piece of harness infrastructure should be something you can remove. Classify your rules as Type A (model workarounds) or Type B (business truths). When models upgrade, purge your Type A patches. If your harness is more complex after a model upgrade than before, something went wrong.

Run simplification audits. Every quarter, review your infrastructure. For each rule, ask: is this compensating for a model limitation, or encoding human intent? Does it still prevent a real failure? Is the cost justified by the benefit? Organizations report 40-60% harness reduction during model upgrades. That’s not trimming fat — that’s removing half the system.

Keep harnesses lean. More rules aren’t always better. There’s a real point of diminishing returns. The research showing that removing 80% of tools improved accuracy from 80% to 100% applies to rules too. The best harness is the simplest one that prevents the failures that matter.

Budget for the hidden costs. Harness maintenance, review burden, token costs, training costs, the specification tax — these are real expenses. If maintenance exceeds 20-25% of engineering time, your harness is too complex. Include these costs in planning, not as an afterthought.

Address the junior developer pipeline deliberately. If juniors only review agent output and never write substantial code, they may not develop implementation intuition. Create structured learning programs. Budget time for no-agent exercises. This is a training cost worth paying, because the alternative is a generation of senior engineers who can’t evaluate whether an implementation is actually good.

Start now. Don’t wait for the tools to stabilize. Don’t wait for best practices to solidify. The experience compounds. Every harness you build teaches you something about the boundary between model capability and human intent. Every simplification sharpens your judgment about what matters and what doesn’t. The engineers who are building with these tools today are developing exactly the skills that will define engineering roles tomorrow.


The Honest Ending

The engineering profession is changing faster than at any point in its history. The shift from “person who writes code” to “person who designs systems that produce code” is real, happening now, and not reversible. The skills that differentiated top engineers five years ago — coding speed, language mastery, implementation fluency — are becoming table stakes. New skills are becoming the differentiators.

This is uncomfortable. And the people telling you it’s only an “opportunity” are being either naive or dishonest. Some engineers will find the transition natural. Others will find it genuinely difficult, not because they’re bad engineers, but because the aspects of engineering they loved most are the aspects that are changing most.

The verification problem is unsolved. The hype-reality gap is real. The quality numbers are mixed. The long-term maintainability question is open. The tools we build today are temporary, destined to be simplified or deleted as models improve.

And yet: the techniques work. The discipline is real. The engineers who understand systems thinking, specification writing, verification design, and architecture — who can hold both the power and the limitations of these tools in their heads simultaneously — will thrive.

The useful posture isn’t anxiety or complacency. It’s clear-eyed pragmatism. Use the tools. Build the harnesses. Measure the results. Fix the problems. Stay honest about what you don’t know. Build for the current generation, with an eye toward simplification for the next.

The wall is behind you. The tools are in your hands. The future is being shaped right now, by the people who are building with these systems, learning from their failures, and sharing what they find.


This post covers ideas from Harness Engineering for Vibe Coders, a book about building the structured environments that make AI coding agents actually work. If you found this useful, the full book covers everything from your first AGENTS.md file to scaling agent workflows across organizations — including the practical techniques, the worked examples, and the hard-won patterns that turn vibe coding into reliable engineering.


Comments

Leave a Reply

Discover more from Hamilton Road Agility

Subscribe now to keep reading and get access to the full archive.

Continue reading