The Problem With Treating Every Task The Same
One thing I noticed fairly quickly when working with coding agents is that uniformity becomes expensive.
If the system treats:
- a one-line typo fix
- a small feature
- and a multi-week architectural migration
...as roughly the same category of work, things start breaking down surprisingly fast.
Tiny fixes suddenly trigger huge validation chains and endless planning overhead, while genuinely dangerous changes quietly slip through without enough structure around them.
That imbalance becomes costly in both directions.
So the solution I ended up experimenting with was a routing skill concept called run-pipeline.
The idea is intentionally simple really:
Size the work first, then send it through the appropriate level of process.
Not every engineering task deserves the same ceremony.
run-pipeline Is A Router, Not A Worker
This is probably the most important mental model.
run-pipeline doesn't really do engineering work itself.
It routes work.
You give it a task, anything from:
"rename this helper"
through to:
"replace the authentication system"
...and it moves through three phases:
- classify the task
- surface the proposed plan
- dispatch into the appropriate workflow
That middle step is the important one honestly.
The deliberate pause.
The agent announces what it thinks the task is before touching files. Then it waits for confirmation or correction.
That tiny pause saves an absurd amount of pain later.
It turns:
"Why did the AI do this?"
into:
"The AI was about to do this, and I corrected it first."
Which is vastly cheaper.
Three Tiers, Three Different Flows
The classification system itself is fairly straightforward.
Small
One narrow change.
Usually:
- single file
- no schema changes
- no migrations
- no auth
- no infrastructure
- no contracts changing
Things like:
- renaming a function
- tightening validation
- removing unused helpers
The flow here is intentionally lightweight.
Minimal reads, apply the change, run lint plus the most relevant test file, then report back in a few lines.
No review passes. No massive validation chain. No cache writes.
A typo fix shouldn't feel like launching a space shuttle.
Medium
This is the middle ground.
A feature slice across several related files, usually split into a few manageable chunks.
Examples:
- adding a route
- wiring feature flags
- refactoring one subsystem
- introducing an existing utility into a component tree
Medium tasks still get structure, but without the heavyweight orchestration layer.
This is where most normal development work lives honestly.
Large
Large is where the full process kicks in.
Broad architectural changes.
Multiple unrelated systems.
Anything touching:
- auth
- billing
- infrastructure
- migrations
- schemas
- security-sensitive paths
Or simply anything likely to expand beyond a few implementation chunks.
This flow gets the full multi-phase treatment:
- requirements intake
- shaping
- repository exploration
- architecture planning
- chunk execution
- closure verification
- cleanup validation
- second-pass AI review
The important bit though is that the flows themselves aren't bespoke systems.
They're composed from smaller reusable concepts.
Composition Over Giant Monolithic Prompts
One thing I deliberately tried to avoid was building a single gigantic "do everything" mega-agent.
Those systems become unpredictable quite quickly.
Instead the pipeline is made up of smaller atomic responsibilities with very narrow scope.
Things like:
- requirements intake
- task shaping
- chunk execution
- chunk verification
- cleanup validation
Each one does exactly one job.
The execution stage only executes one approved chunk.
The verification stage only checks whether that chunk genuinely passed.
The cleanup stage owns the final validation sweep.
The router orchestrates them together, but it doesn't duplicate their logic.
That sounds slightly boring architecturally, but boring systems are often the ones that survive contact with reality best.
It's a bit like Unix philosophy applied to AI workflows.
Small sharp tools connected together.
Prompt Design Matters More Than People Think
Honestly, one of the least glamorous parts of the system turned out to be one of the most important.
Every stage has extremely explicit instructions.
Not vague "help the user" style prompts.
More like job descriptions.
Each stage defines:
- what it is responsible for
- what it is not responsible for
- the workflow it must follow
- hard boundaries
- output expectations
For example:
"Do not silently widen scope."
or:
"Do not refactor unrelated code."
That role clarity changes behaviour massively.
The same underlying model can behave like completely different team members depending on how tightly the role boundaries are defined.
One becomes an implementer.
Another becomes a reviewer.
Another becomes a planner.
Without those boundaries the models tend to drift into trying to "help" everywhere simultaneously, which usually creates chaos eventually.
The Surprisingly Important Cache Layer
Long-running AI workflows become fragile very quickly if state only exists in the conversation context.
So I ended up experimenting with lightweight persistent cache files to preserve workflow state between sessions.
Mostly simple JSON.
Two conceptual files ended up doing most of the heavy lifting.
pipeline.json
Stores the live workflow state.
Things like:
- requirements
- implementation strategy
- chunk statuses
- validation outputs
- files changed
- closure verdicts
This means somebody can disappear halfway through a large task, come back tomorrow morning and ask:
"Where were we?"
...and the workflow reconstructs itself properly.
Without durable state, long-running sessions feel weirdly fragile.
last-gate.json
This one saves a surprising amount of time conceptually.
It stores a freshness stamp recording the last successful validation chain.
If lint, tests, builds and validation all passed 90 seconds ago, there's often no reason to immediately rerun everything again for a downstream step.
That sounds small, but across large multi-chunk flows it potentially saves genuine hours.
Especially once validation chains start becoming expensive.
The Guard Rails Matter More Than The Automation
The actual "AI coding" part is almost less interesting to me now than the guard rails around it.
Three safeguards ended up becoming particularly important.
Scope Guard
If a supposedly small task suddenly discovers schema changes or broader architectural impact mid-flight, execution stops.
The task gets reclassified instead of silently expanding.
This catches scope creep when the AI notices it, not three commits later when the human notices it.
That timing difference matters quite a lot.
Closure Verdicts
The verification stage doesn't simply say "looks good".
It produces structured outcomes:
- PASS
- PASS_WITH_NOTES
- FAIL
And importantly, those verdicts require evidence from validation output.
Not confidence.
That distinction becomes increasingly important as models get more persuasive.
Final Validation Ownership
Only the final validation stage is allowed to write the final successful validation stamp.
Not the router.
Not the executor.
This guarantees the "green light" always corresponds to a real validation chain that genuinely ran end-to-end.
Not just a vaguely optimistic:
"Yeah this probably works."
Which AI systems are unfortunately very capable of producing.
What I've Seen In Practice
The benefits ended up being more practical than theoretical.
Small fixes stay small
A tiny change no longer drags half the pipeline into existence unnecessarily.
That alone saves huge amounts of wasted time.
Medium work avoids over-engineering
Skipping heavyweight orchestration for bounded feature work saves both tokens and cognitive overhead without noticeably reducing quality.
Large work gets safer
The bigger flows now include mandatory review passes and validation ownership by default.
Broken migrations, auth mistakes and accidental guardrail bypasses get caught far earlier.
Sessions survive interruption
This one matters more than expected honestly.
Long AI workflows used to feel incredibly fragile. Lose context and everything collapsed.
Now work can pause and resume properly.
Which sounds obvious, but dramatically changes usability.
Things I'd Still Improve
The system definitely isn't "finished".
A few obvious weak spots still exist.
I don't yet track misclassification metrics properly, which means improving the routing matrix still relies too much on instinct.
The cache layer would probably eventually need proper concurrency handling once multiple agents start operating in parallel against the same repository.
Some validation behaviour should probably become declarative configuration instead of hardcoded orchestration logic.
And honestly, I think there still needs to be a proper "is this prompt even clear?" validation pass before classification begins.
A surprising amount of downstream chaos still originates from vague human requests.
Which admittedly is also true outside AI systems.
The Bigger Takeaways
I don't really think people need this exact approach specifically.
But there are a few principles here that probably generalise fairly well.
1. Classify Before Acting
Different sizes of work deserve different levels of process.
A typo fix and an auth migration should not move through the same workflow.
That sounds obvious when stated directly, but a lot of AI tooling still treats them surprisingly similarly.
2. Separate Roles Explicitly
Even if you only use one model, separating planner, implementer and reviewer roles changes behaviour massively.
Role clarity reduces drift.
The model stops approving its own assumptions quite so aggressively.
3. Durable State Matters
Long-running work needs somewhere persistent to live.
JSON. Markdown. Database rows. Doesn't really matter.
Without durable state the system keeps re-deriving context from scratch every session like somebody waking up with partial amnesia.
4. Confirmation Pauses Are Healthy
This might honestly be the biggest one.
Letting the AI announce its plan and wait before touching files feels slower initially, but usually saves enormous amounts of expensive correction later.
The pause exists exactly at the boundary between cheap mistakes and expensive mistakes.
That's where process probably matters most.
Final Thoughts
The interesting part for me isn't really the specific implementation details.
It's the idea that AI-assisted engineering probably works best when the structure lives in the process rather than blind trust in the model itself.
The prompts define behaviour.
The routing defines safety.
The validation defines confidence.
And the human still sits above the system making judgement calls when things become ambiguous or risky.
Honestly, that balance currently feels healthier to me than either extreme of:
"AI replaces engineers"
or:
"AI is useless."
The reality feels much more like orchestration.
The tools became dramatically more capable, but good engineering judgement still seems to be the thing holding the entire structure together.