back to /writing

Right-Sizing AI Coding Work: Agent Skill Theory

A walk through the routing approach I have been experimenting with to stop AI coding workflows from either overreacting to tiny changes or underestimating dangerous ones.

The Problem With Treating Every Task The Same

One thing I noticed fairly quickly when working with coding agents is that uniformity becomes expensive.

If the system treats:

  • a one-line typo fix
  • a small feature
  • and a multi-week architectural migration

...as roughly the same category of work, things start breaking down surprisingly fast.

Tiny fixes suddenly trigger huge validation chains and endless planning overhead, while genuinely dangerous changes quietly slip through without enough structure around them.

That imbalance becomes costly in both directions.

So the solution I ended up experimenting with was a routing skill concept called run-pipeline.

The idea is intentionally simple really:

Size the work first, then send it through the appropriate level of process.

Not every engineering task deserves the same ceremony.


run-pipeline Is A Router, Not A Worker

This is probably the most important mental model.

run-pipeline doesn't really do engineering work itself.

It routes work.

You give it a task, anything from:

"rename this helper"

through to:

"replace the authentication system"

...and it moves through three phases:

  1. classify the task
  2. surface the proposed plan
  3. dispatch into the appropriate workflow

That middle step is the important one honestly.

The deliberate pause.

The agent announces what it thinks the task is before touching files. Then it waits for confirmation or correction.

That tiny pause saves an absurd amount of pain later.

It turns:

"Why did the AI do this?"

into:

"The AI was about to do this, and I corrected it first."

Which is vastly cheaper.


Three Tiers, Three Different Flows

The classification system itself is fairly straightforward.

Small

One narrow change.

Usually:

  • single file
  • no schema changes
  • no migrations
  • no auth
  • no infrastructure
  • no contracts changing

Things like:

  • renaming a function
  • tightening validation
  • removing unused helpers

The flow here is intentionally lightweight.

Minimal reads, apply the change, run lint plus the most relevant test file, then report back in a few lines.

No review passes. No massive validation chain. No cache writes.

A typo fix shouldn't feel like launching a space shuttle.


Medium

This is the middle ground.

A feature slice across several related files, usually split into a few manageable chunks.

Examples:

  • adding a route
  • wiring feature flags
  • refactoring one subsystem
  • introducing an existing utility into a component tree

Medium tasks still get structure, but without the heavyweight orchestration layer.

This is where most normal development work lives honestly.


Large

Large is where the full process kicks in.

Broad architectural changes.

Multiple unrelated systems.

Anything touching:

  • auth
  • billing
  • infrastructure
  • migrations
  • schemas
  • security-sensitive paths

Or simply anything likely to expand beyond a few implementation chunks.

This flow gets the full multi-phase treatment:

  • requirements intake
  • shaping
  • repository exploration
  • architecture planning
  • chunk execution
  • closure verification
  • cleanup validation
  • second-pass AI review

The important bit though is that the flows themselves aren't bespoke systems.

They're composed from smaller reusable concepts.


Composition Over Giant Monolithic Prompts

One thing I deliberately tried to avoid was building a single gigantic "do everything" mega-agent.

Those systems become unpredictable quite quickly.

Instead the pipeline is made up of smaller atomic responsibilities with very narrow scope.

Things like:

  • requirements intake
  • task shaping
  • chunk execution
  • chunk verification
  • cleanup validation

Each one does exactly one job.

The execution stage only executes one approved chunk.

The verification stage only checks whether that chunk genuinely passed.

The cleanup stage owns the final validation sweep.

The router orchestrates them together, but it doesn't duplicate their logic.

That sounds slightly boring architecturally, but boring systems are often the ones that survive contact with reality best.

It's a bit like Unix philosophy applied to AI workflows.

Small sharp tools connected together.


Prompt Design Matters More Than People Think

Honestly, one of the least glamorous parts of the system turned out to be one of the most important.

Every stage has extremely explicit instructions.

Not vague "help the user" style prompts.

More like job descriptions.

Each stage defines:

  • what it is responsible for
  • what it is not responsible for
  • the workflow it must follow
  • hard boundaries
  • output expectations

For example:

"Do not silently widen scope."

or:

"Do not refactor unrelated code."

That role clarity changes behaviour massively.

The same underlying model can behave like completely different team members depending on how tightly the role boundaries are defined.

One becomes an implementer.

Another becomes a reviewer.

Another becomes a planner.

Without those boundaries the models tend to drift into trying to "help" everywhere simultaneously, which usually creates chaos eventually.


The Surprisingly Important Cache Layer

Long-running AI workflows become fragile very quickly if state only exists in the conversation context.

So I ended up experimenting with lightweight persistent cache files to preserve workflow state between sessions.

Mostly simple JSON.

Two conceptual files ended up doing most of the heavy lifting.

pipeline.json

Stores the live workflow state.

Things like:

  • requirements
  • implementation strategy
  • chunk statuses
  • validation outputs
  • files changed
  • closure verdicts

This means somebody can disappear halfway through a large task, come back tomorrow morning and ask:

"Where were we?"

...and the workflow reconstructs itself properly.

Without durable state, long-running sessions feel weirdly fragile.


last-gate.json

This one saves a surprising amount of time conceptually.

It stores a freshness stamp recording the last successful validation chain.

If lint, tests, builds and validation all passed 90 seconds ago, there's often no reason to immediately rerun everything again for a downstream step.

That sounds small, but across large multi-chunk flows it potentially saves genuine hours.

Especially once validation chains start becoming expensive.


The Guard Rails Matter More Than The Automation

The actual "AI coding" part is almost less interesting to me now than the guard rails around it.

Three safeguards ended up becoming particularly important.

Scope Guard

If a supposedly small task suddenly discovers schema changes or broader architectural impact mid-flight, execution stops.

The task gets reclassified instead of silently expanding.

This catches scope creep when the AI notices it, not three commits later when the human notices it.

That timing difference matters quite a lot.


Closure Verdicts

The verification stage doesn't simply say "looks good".

It produces structured outcomes:

  • PASS
  • PASS_WITH_NOTES
  • FAIL

And importantly, those verdicts require evidence from validation output.

Not confidence.

That distinction becomes increasingly important as models get more persuasive.


Final Validation Ownership

Only the final validation stage is allowed to write the final successful validation stamp.

Not the router.

Not the executor.

This guarantees the "green light" always corresponds to a real validation chain that genuinely ran end-to-end.

Not just a vaguely optimistic:

"Yeah this probably works."

Which AI systems are unfortunately very capable of producing.


What I've Seen In Practice

The benefits ended up being more practical than theoretical.

Small fixes stay small

A tiny change no longer drags half the pipeline into existence unnecessarily.

That alone saves huge amounts of wasted time.

Medium work avoids over-engineering

Skipping heavyweight orchestration for bounded feature work saves both tokens and cognitive overhead without noticeably reducing quality.

Large work gets safer

The bigger flows now include mandatory review passes and validation ownership by default.

Broken migrations, auth mistakes and accidental guardrail bypasses get caught far earlier.

Sessions survive interruption

This one matters more than expected honestly.

Long AI workflows used to feel incredibly fragile. Lose context and everything collapsed.

Now work can pause and resume properly.

Which sounds obvious, but dramatically changes usability.


Things I'd Still Improve

The system definitely isn't "finished".

A few obvious weak spots still exist.

I don't yet track misclassification metrics properly, which means improving the routing matrix still relies too much on instinct.

The cache layer would probably eventually need proper concurrency handling once multiple agents start operating in parallel against the same repository.

Some validation behaviour should probably become declarative configuration instead of hardcoded orchestration logic.

And honestly, I think there still needs to be a proper "is this prompt even clear?" validation pass before classification begins.

A surprising amount of downstream chaos still originates from vague human requests.

Which admittedly is also true outside AI systems.


The Bigger Takeaways

I don't really think people need this exact approach specifically.

But there are a few principles here that probably generalise fairly well.

1. Classify Before Acting

Different sizes of work deserve different levels of process.

A typo fix and an auth migration should not move through the same workflow.

That sounds obvious when stated directly, but a lot of AI tooling still treats them surprisingly similarly.


2. Separate Roles Explicitly

Even if you only use one model, separating planner, implementer and reviewer roles changes behaviour massively.

Role clarity reduces drift.

The model stops approving its own assumptions quite so aggressively.


3. Durable State Matters

Long-running work needs somewhere persistent to live.

JSON. Markdown. Database rows. Doesn't really matter.

Without durable state the system keeps re-deriving context from scratch every session like somebody waking up with partial amnesia.


4. Confirmation Pauses Are Healthy

This might honestly be the biggest one.

Letting the AI announce its plan and wait before touching files feels slower initially, but usually saves enormous amounts of expensive correction later.

The pause exists exactly at the boundary between cheap mistakes and expensive mistakes.

That's where process probably matters most.


Final Thoughts

The interesting part for me isn't really the specific implementation details.

It's the idea that AI-assisted engineering probably works best when the structure lives in the process rather than blind trust in the model itself.

The prompts define behaviour.

The routing defines safety.

The validation defines confidence.

And the human still sits above the system making judgement calls when things become ambiguous or risky.

Honestly, that balance currently feels healthier to me than either extreme of:

"AI replaces engineers"

or:

"AI is useless."

The reality feels much more like orchestration.

The tools became dramatically more capable, but good engineering judgement still seems to be the thing holding the entire structure together.

/ · / · / · end of file
filed under[essay]·permalink/blog/right-sizing-ai-coding-work/