The Programmers's Guide to Co-Designing with Agents

Wednesday, 11 March 202618 min read

In this piece I’m going to lay-out my current patterns for working with agents in software development - there’s a bunch of preamble about why I think this is important, so if you’re just here for the what, feel free to skip to the “Co-Design workflows” section.

More mulch faster was never the goal.

I’ve watched a lot of people put their foot on the gas over the last few months and steamroll out a mountain of code using the latest generation of model-assisted tools. I’ve done it myself.

I wrote recently about the burnout that comes from indulging in extreme concurrency - running a swarm of agents, producing at a pace that outstrips your capacity for comprehension - and I think it’s worth unpacking why that approach, while intoxicating, is probably a trap. It’s something I’ve changed in myself over the last month or so to try stem the flow of blood and find, new, good, working patterns.

The instinct to parallelise everything is the wrong instinct. I think it’s a fool’s errand to focus on concurrency as your primary workflow. You’ll still end up with unfinished projects, but this time they’ll be unfinished projects that you don’t understand. This isn’t really a new thought - we’ve long understood that focus time for software teams always wins. Because of this, over the last couple of weeks I’ve taken to preferring what I’m going to call Co-Design with agents over raw parallelism. I think I’ve probably been stumbling towards these working patterns since the end of last year, but I’ve only recently started to articulate them and understand them as a set of workflows.

This isn’t the same thing as what most people seem to be describing as “human in the loop”. Whenever I see people talk about human in the loop, I see a pattern focused on after-the-fact PR style review of machine generated code.

I suspect that model is already dying under the weight of its own volume constraints. After-the-fact review will become arduous, long-winded, and ineffective as the pace of code generation accelerates. PR workflows in organisations will probably take longer to die that we expect because people will cling to their existing, familiar, illusion of safety. Pull requests - an adversarial technique for untrusted authors to contribute to critical codebases - were never designed for the kind of workplace collaboration they’re normally used it, and were always worse than code review and pair programming and we shouldn’t lament their death.

With some sense of symmetry, traditional pair programming with the machine is also better than after-the-fact adversarial review.

Focusing on raw output and concurrency is the same mistake it ever was, because quality subsides underneath it. Even if you personally don’t care about the quality of the output, even in a world where models are generating most of the glue code, quality still matters for software that has to operate reliably in production.

Many of these assurances on quality don’t map to one-shotting consumer grade “apps”, but they absolutely matter when you operate systems.

Context Matters with Regards to Quality

It’s important to realise that much of what good looks like when it comes to adapting to model assisted development in enterprise are echos of the lessons we learnt twenty-five years ago in the extreme programming movement. This is an adaptation of technique to new tooling. The people that were sceptical or ineffective at writing tests, doing TDD, verifying their code in automated, system-driven ways will continue to be resistive to these techniques and will end up with very poor, low quality outcomes.

The context of the kinds of change you’re working on - especially in business software - change how effective these practices are. Models are mostly good at remixing existing ideas - which might sound limiting, but it’s largely fine in business software where the vast majority of programming and systems integration work is remixed work to start with.

The inverse is also true. people just saying “give me code that does X” are going to receive poor quality results, because quality of specification always begets quality of implementation.

I wrote a talk about fifteen years ago about how the gulf of understanding between specification and implementer was the quality ceiling of all software. That gap defined how good the software could possibly be. This will play out en mass with low quality tool usage - the specification problem doesn’t go away just because the implementer is a machine. If anything, it gets worse, because models lack the social context and domain intuition to fill in the gaps that a human colleague would.

This isn’t a new problem. It’s common to all code generation, low-code solutions and other boilerplate-centric techniques.

Despite all this, it seems to me today that anyone that can’t get roughly 80% good outcomes from the current early 2026 frontier models is experiencing operator error. The tools are good enough. The question is the same one of technique.

The reality is that most people have never really cared about technique or code quality - this isn’t new either. The same people that achieved poor results before will continue to do so using new tooling. The accelerant doesn’t change the trajectory, it just gets you there faster.

If our quality goals aren’t changing around the software we build, we need patterns of work to support them.

Quality Begets Reliability

Reliable systems are readable systems, because readable systems can be understood and diagnosed. This has been true since the first line of code was written and it remains true when the code is written by a machine.

Context windows and managing them are currently the only tools we have to keep AIs grounded. Context windows and their token usage cost money - if you let your software design complexity explode with rote repetitive code, these tools that provide you an accelerant will diminish in effectiveness over time. The model needs to reason about your software to change it well, and if it can’t fit the relevant context into its window, or the signal is buried in noise, the quality of its contributions degrades. Your code quality is now directly proportional to how useful your tools are.

This emphasis on design is reinforced by the reality that the worst time to learn what the design of your system is is when you’re diagnosing it in a production outage at 3am. Operating production systems requires excellent telemetry and verification, and usually a reasonable working model of what the software should have been doing. Without that you’re relying solely on navigational aids when you need a map.

The reason why technical managers are often more comfortable with these tools than engineers is that they’ve already outsourced their understanding of what is real and concrete to their teams. This shift is functionally no different for them - they were already operating at a level of abstraction above the code. For engineers, it’s a much more visceral change. You’re being asked to let go of something you’ve spent years learning to wield.

Greenfield vs Brownfield

Here’s a counter-intuitive observation: greenfield projects tend to be more susceptible to agent slop than brownfield ones.

AI is often maximalist in its application of patterns and enterprise bloatware. Ask a model to build you a new service from scratch and you’ll get the most enterprisey, over-architected, pattern-laden monstrosity you’ve ever seen. It doesn’t have that implicit temporal trade-off that the best software designs are should be smaller than the problem space they inhabit to be effective.

Over-design still carries the same cost burden of maintenance - and you probably don’t want to start designing software that’s fit for organisations orders of magnitude larger than you are just because the training data says so. It learned from the internet, and the internet is full of bad software written by people that confused complexity with quality.

Brownfield contains established examples that help ground the AI. The models are good at mimicry, at following established patterns and being constrained by the things in their context window. It makes them unusually effective at performing localised refactors that look more or less like what your own teams would write.

This is why I continue to believe that using AI to mutate and reduce the burden of existing code is a much more fruitful use of this category of tooling - AI assisted refactoring, minification, optimisation, error and vulnerability scanning.

This is because the cost of maintaining and operating software has always been the vast majority of the cost of software over its lifetime. Creating software has always been effectively “free” for large categories of programs and systems, whilst mutating it has not.

If you’re looking for the highest leverage use of these tools, it’s not in writing new code, it’s in making your existing code better. Shipping features that nobody wants or uses at a high velocity is intrinsically zero-value work.

Co-Design workflows

So what does good look like?

The best outcomes I’m seeing in early 2026 come from engaging in what I’m calling Co-Design - a set of workflows where you’re designing with model assistance, not reviewing its output after the fact. This is in part a reaction to the trade-offs that the models aren’t currently good at making without steering. Current frontier models without guardrails often will err on the side of repetition, and performance optimisation, over legibility. This sometimes leads to software designs where modularity and internal boundaries are extremely coupled in sometimes obtuse ways because there is no incentive for the models to optimise for human legibility.

Many of the following patterns are compensating practices that keep the model “on the straight and narrow” as it iterates. There are patterns to coerce the agents to write human friendly, quality code. This may not be your goal, but it frequently is mine.

With that in mind, here are the working patterns I’ve found most effective.

Specify Ahead

Problem: It’s complicated and exhausting to perpetually context switch between different streams of work while agents compute solutions. The experience is similar to repeated context switching that prevents programmers from entering a “flow state”.

Solution: Instead of trying to multi-task different workstreams, focus on the immediate next change to keep in flow.

This is my most frequent behavioural pattern, rather than continuous task switching, I “specify ahead” of the currently monitored agents task. This works best when I have a list of small, incremental changes to a system or group of systems.

While the agent is implementing the first step in the sequence, I’ll be human-refining it’s next task.

Review While Iterating

Problem: The model is often good at making progress on a task, but it can go off the rails if left unchecked for too long. Waiting until the end of a task to review can lead to more significant course corrections later on.

Solution: A mixture of live observation with steering, and code-review per increment.

Review while iterating rather than review on completion is a more involved co-design process than “delegate work to an agent”. This reduces the cognitive load of context switching because as the agent returns for review and feedback, you’re only a single increment away from it’s current workload.

This feels closest to traditional pair-programming than async PR review - and involves a mixture of live agent monitoring with steering and interruption, and taking live notes to feedback as you watch the agent implement.

You are effectively the navigator in a driver/navigator pair with the machine taking lie instruction. It’s vital that you focus not just on output but on structure and design during this process - if you don’t, it looses all value and you may as well one-shot and review after.

Human Directed Refactoring

Problem: Agents will frequently over-design, under-design, or make assumptions of context that are present in their training data.

Solution: After every successful change the agent makes, you inspect the design (modules, abstractions and organisation) and do interactive refactoring with the agent driven by your own taste.

This is similar to traditional code review, but without the ceremony of “PR and wait”. Never be satisfied with the first shot. Focus on what could be simplified in the design, what could be removed from the design, and what could be done to drive up cohesion in the design.

This is similar to a traditional TDD “red, green, refactor” incremental improvement cycle with the agent as a “ping-pong-pair”.

Another variant of this pattern that’s most prevalent in brownfield refactoring involves hand-executing the kind of transformation you expect the agent to follow and directing it to those examples.

A good example of this is migrating between test frameworks, or re-writing tests to conform to a pattern en-mass. The agents will be much more successful if you hand convert one or more files, then indicate that it should mimic the patterns and conventions in your example, rather than trying to describe the patterns in natural language - which will inherently be more imprecise than coded examples.

Agent Directed Refactoring

Problem: The agent paves the road with large volumes of code that contain obvious duplication and module boundary problems

Solution: Agent self-code review and iteration.

One of the easiest and cheapest tricks is you can ask the model to check it’s own work for human legibility and maintainability. You can design specific agents or skills to do this work, but the effectiveness isn’t significantly different than just asking “can you see any opportunities to refactor your latest change to make it more coherent and human readable”.

As a general rule the agents will come back with a list of reasonable changes that will improve their output with a second pass. This is an essential step whenever an agent produces work of any volume before investing time in human directed refactoring, if only to make the work easier to navigate.

Scaffold, Tweak, Iterate

Problem: Greenfield projects are too blank page for most agents, and often you end up with designs far away from the complexity of their problem space.

Solution: AI Scaffolding, Human Tweaking, AI Iterates

I’ve long been a fan of Alistair Cockburn’s “walking skeleton” metaphor for agile system evolution, and the associated Pragmatic Programmer “tracer bullet” system design technique where you provide the feature-free skeleton of the moving parts of your system then incrementally “flesh out” the capabilities.

It’s valuable to engage in this process with an AI, especially in greenfield projects where you need to ensure the structural stability of the design before you let a model rapidly iterate out the details.

Directing an AI to scaffold a project (or using traditional code generation and scaffolding) followed by human intervention to tweak the modules and concepts often ensures a model sticks to the patterns presented in the emerging codebase. This is a powerful way to steer the model towards the design you want, and then have it iterate outwards from there.

Hand Scaffold, AI Expand

The complimentary pattern, for when you know exactly what you want the skeleton to look like, is to human scaffold, and use the AI to expand your target design before it implements features.

You establish the patterns in code or natural language, and have the AI expand from there. This provides a greenfield project with the same basis for mimicry that it would get from a brownfield project by effectively populating the context with examples before any significant unsupervised work happens.

Surgical Preparation

Problem: You’re working in an ugly, complicated, brownfield application that the AI cannot operate inside effectively.

Solution: Fix the edges before touching the core.

Repositories are often in poor states, and asking AIs to review for obvious problems so you can iterate on expanding test coverage and guard rails before other work is essential. Think of this as preparing the ground. You want the codebase to be in a state where the model can reason about it effectively before you start asking it to make significant changes.

There are a number of forms of this - you can go through several iterations with a model to construct documentation, indexes or pointers around the codebase in a /docs folder, you can use the model to refactor and normalise tests, to improve build scripts, to address toil in verification of changes, before asking it to modify the code proper.

Usually this is all the toil that already existed in your codebases, but the models are uniquely useful at being able to provide quick remediation to make later changes safer. As a Senior IC, a lot of my work involved these categories of cleanup. Fixing builds, fixing tests, making sure everything can run in memory, in isolation, on a local machine. Having a model do this category of work to “prepare the ground” for future changes is one of the most valuable applications of the technology because you are increasing verifiability.

AI Safety Checks

A partner with the above. Embedding a safety check in your prompts to ask the AI for its confidence level on changes before it continues can be a good hook for human review. Something as simple as:

Before making changes to this module, assess:
1. What is your confidence level (high/medium/low) that these changes won't break existing behaviour?
2. What assumptions are you making about the codebase that you haven't verified?
3. What tests would you want to see passing before you'd be confident in this change?

If confidence is medium or low, stop and explain what additional context you need.

This acts as a circuit breaker - a point where the machine pauses and lets you decide if it should continue. It won’t catch everything, but it catches a surprising amount.

When you’re working in more fragile systems, performing these kinds of sanity check first pass investigations can save you a huge amount of rework later.

Cross-System Change

One of the step changes for me in recent months has been stepping up a level when reasoning about code so I can move “top to bottom” quicker - from system design to software design in a single session.

To do this, I built a script to git clone the entire hundreds-of-component distributed system that I attend to, so that I can have the models reason about disparate parts of the system together. This is a step change, because I can modify multiple systems at once where previously I would have had to do expensive coordination work with different teams of people to orchestrate change.

I think this is the single biggest accelerant in software development from these tools because it addresses a foundational “surface area” problem in team topologies.

Why is this important?

Over the last twenty years we’ve exploded the edges in software to satiate the corporate world’s desire for feature parallelism. I wrote about this before - about how every subdivision of a system has a cost.

Some of these edges we introduced were “good edges” - fault boundaries, scalability boundaries, async boundaries - but many of them trended our designs towards terrible nanoservice over-complexity, and distributed monolithic design. We made our systems actively worse because we wanted to expand the surface area so we could fit more people around it.

This was good for fitting more people around the problem, but often didn’t actually lead to any real-world advantage because the software became worse and more complex, and also incurred the cost of team coordination. So for each subdivision, the returns diminished, and the toil increased.

Reasoning about the system as a whole is a salve and partial solution to this scale-created problem. But it runs against the edges of the capabilities of this technology because it quickly exhausts context windows.

Building Maps

The solution?

Build a map.

To help work at system scale more effectively, I built a piece of software that worked through our Infrastructure as Code, did deep code scanning for service connections, and ingested other information about service interconnectivity to build a graph of the relationships between systems.

This map is presented as reference to model assisted tools so it can more effectively answer the question “if I make a change here, what else needs verifying or is in scope of this change”.

There are many ways to answer this question depending on your ecosystem - mine involved walking back from our Azure ARM API, through our deployment tool configuration and scanning code configurations to construct a text based map.

Of course, model assisted tooling can easily code-generate the kind of glue you need to build something like this for yourself. These maps are slow moving so they don’t need to be perfect, just mostly good enough to signpost which directories your model tooling should analyse while containing the “context sprawl”.

Can these workflows be automated as agents and skills?

Yes, it seems like they can.

Consider the skills as your canned prompts and your agents as the guard-rails for how they interact.

The challenge in “making everything a skill” is that you fill up your context window with a lot of low-level instructions that are situational, so the agent needs yet more instructions to do lots of narrowly scoped repetitions coordinating between different subsets of skills.

The agents are quite good at working out when to apply skills, but given the absence of taste, they often won’t highlight things that hamper human readability.

I’ve not yet witnessed an agent writing code that “looks and feels good” in the way that well-crafted human code does - they’re very good at paving rote repetition and procedural code, but the design of the thing, the form, is usually absent.

This is fine, because the Co-Design workflows above are designed to compensate for exactly this. You provide the authorial intent, the model provides the throughput.

Pair and Mob Co-Design

I suspect the future of software teams might look closer to the historical “masters and apprentices” models, where one experienced practitioner works with a small quorum of understudies who change systems together. I wrote about how teams could stay ahead recently, and I think this is where that thinking lands practically.

I suspect this might beget mob-programming style co-design sessions as teams engage in continual code review - which is really just design - refinement and specification. This is probably the “XP” of model-assisted programming in a team context. The same way that extreme programming took the best practices of the time and said “what if we did these things all the time”, focusing on co-design asks “what if we paired and mobbed all the time, but with machines as well as each other”.

The mobs will likely be smaller than the traditional “two pizza team” standard that has emerged over the last decade, and be closer to a “one pizza team”. Navigator-driver techniques remain from pair and mob programming, and they translate naturally when your “driver” is sometimes a model.

This is an effective model to keep what is essential in software - reliability of operators, shared understanding, and good design - while leveraging the accelerant of the new tools.

The through line

None of this is new - it’s all incremental. The XP movement told us twenty-five years ago that the answer to better software was tight feedback loops, continuous testing, pair programming, and a relentless focus on simplicity. The tools have changed but the direction should persist.

Quality of specification begets quality of implementation. Readability begets reliability. Focus begets understanding. These were true when we were writing C in the 90s and they’re true when driving agents in 2026.

Same as it ever was.