Wholly Whackamoley
AI coding agents can build in hours what used to take days. But what used to take hours to debug can now take days. The relationship between human and AI is more Batman and Robin than most people think.
With apologies to the 1960s Batman series, the experience of building software using AI coding agents today is not too dissimilar to the relationship between Batman and Robin. In the original series, Batman is highly capable, smart, resourceful. Robin is the sidekick, with his own unique value, but very much the supporting partner. In the “pair programming” paradigm of a human software engineer working with an AI coding agent, the dynamic is curiously similar. But who is who?
Since May 2025 the capability of frontier models has continued to advance, often in step changes that have been startling in their upgrade in accuracy, efficiency, and scope. On 5 February this year Anthropic released Claude Opus 4.6, which despite its incremental version number has proven to be a significant step forward in the way software can be written with AI assistance.
What is currently still true, however, is that despite the acceleration in software architecture, planning, and execution now possible, there are behaviours and tendencies of the models that need to be tracked carefully. Drawing on our hands-on experience building production applications with Cursor, Claude Code, and a range of frontier and open-weight models across client engagements in the US and Europe, the following “personality traits” (used with caution) can still apply:
- Bias for action: the LLM will goal-seek for action and progress, when perhaps pausing to reframe and confirm the situation with the user is the right course of action
- Scope creep: even if the scope of the work is clear, and the immediate tasks and actions have been agreed, the LLM may get “carried away” and implement additional features, or sometimes ignore architectural choices that had been agreed
- Limited memory: this is arguably the hardest unsolved problem in AI-assisted development. Despite context windows growing, there is still a risk of “lost in the middle”, and with anything more than the smallest of demo apps, having an LLM understand and remember an entire codebase is at best prohibitively expensive and at worst impossible. With regular conversation history compacting, mistakes can be repeated, often surprisingly and frustratingly. “We just solved that 2 hours ago and you agreed not to repeat that mistake again!”
- Sycophancy tendencies: frontier LLMs are remarkably polite and patient. Despite this, when presented with choices and human input is sought, the LLMs still err on the side of agreeing. “You’re absolutely right!”, playing the part of Robin, when in reality they should assume the role of Batman
- Solving for symptoms, not architecture and design: this is quite hard to spot and very pernicious. When errors occur, the LLMs may write code that simply solves the specific problem, sometimes with hardcoded fixes, rather than addressing the underlying design issue
- Adaptive, fallback processing: this is a counter-intuitive problem. Surely adaptive software engineering that has fallbacks is a good thing? In reality, almost certainly not. This can cause structural errors in design or architecture to be masked, with “hacks” implemented quietly to enable the application to keep running. Usually, particularly when iteratively developing a software application, a noisy, hard failure is the right approach, enabling the architecture and design to be critically and intentionally examined
The whack-a-mole
In short, these challenges may often lead to a whack-a-mole experience, expending more time and tokens solving bugs and symptoms rather than building the right software architecture from the outset. What used to take 3 days to build, and 3 hours to debug, can now take 3 hours to build, and 3 days to debug.
A recent client project illustrates this vividly. The target architecture explicitly specified Celery for task orchestration and Apache Pulsar for event streaming. Both services were running in Docker. Both were referenced in the design documentation. But the AI agent, optimising for speed, implemented the pipeline using a simpler in-process approach instead. It worked. Events streamed, the UI updated in real time. The architectural shortcut was invisible until a routine deployment killed a running pipeline, losing all progress and the API tokens spent on it. Pulsar was running but unused. Celery was configured but unused. The code was functionally correct but architecturally wrong, and the AI had no awareness that it had deviated from the agreed design.
On the same project, the AI agent was asked to add a new configuration entry to a database seed file. Rather than inserting the new row alongside the existing data, it incremented the application’s data version number, which on the next deployment triggered a “drop everything and rebuild from scratch” migration path. The entire production database was wiped at half past midnight, hours before a client demo. The project’s own documentation, in three separate places, explicitly warned against using version bumps for configuration changes. The AI had access to all three documents and ignored them.
Perhaps the most telling incident involved deployment consent. The AI agent presented a plan, the user approved the design, and the agent interpreted this as permission to build, deploy, and push to production, all without pausing for explicit implementation consent. It asked “shall I proceed?” and then proceeded before the answer arrived.
Each of these failures was recoverable. None were catastrophic. But collectively they illustrate the pattern: the tools are extraordinarily capable at generating code, and systematically poor at respecting process, architecture, and the boundaries of what they have been authorised to do.
Mitigations that work
So how can we compensate for, and mitigate the impact of, these behavioural traits?
- Develop minimal, but clear, ways of working that form the basis of your unofficial “contract” with your AI coding assistant. Codify these in Cursor project rules, or with Claude Code in its MEMORY.md file. Less is more; the longer and more wide-ranging the rules, the harder it will be for the LLM to follow and adhere to. This forms the basis of your SDLC.
- Ensure that your agreed SDLC process includes:
- Well-defined points for git commit and push stages.
- Database backup scripts that are run before every deployment, particularly before deploying to production.
- An Agile backlog in any source and format. We use Google Sheets with an MCP connector that enables Claude Code to both read and write to the backlog, updating tickets organised as a prioritised backlog assigned to sprints.
- An expectation that the LLM will respectfully challenge, find multiple solutions, consider the overall architecture, and think carefully about whether the user is suggesting the right course of action given the overall project goals.
- One change at a time, ideally under a specific backlog item. Verify carefully before proceeding.
Holy Guacamole
Despite these challenges, the reality is still remarkable, and a game changer. The human in the loop is not reviewing code for syntax errors. They are enforcing discipline that the AI does not naturally possess. The role is closer to Batman than Robin: setting direction, maintaining standards, and occasionally pulling the emergency brake.
And with further advancements expected this year, the ability of AI coding agents to build software with only minimal architectural steer is a question of when, not if. Whether Robin exclaims “Holy Guacamole!” or the human mutters something less printable, the whack-a-mole is real, but the game is still very much worth playing.
Got comments?
We'd love to hear your thoughts on this article.
Thanks for your feedback
We appreciate you taking the time to share your thoughts.
Something went wrong. Please try again, or email us directly at contact@yellowrad.io.