The live gate you cannot skip for an agent that pushes code.

Imagine handing someone a key to your house and a note that says: tidy up, then leave. You'd want to trust them completely before you did that — not because you expect them to steal, but because once they're inside, alone, with a key, they can. The trust has to be earned before the door opens, not assumed because the note was polite.

We are building something like that: small software helpers that pick up a job, make the change to our code, and submit it for review, without a person watching each step. A human still has the final say on whether anything goes live; the helper never gets that far. But it does act in our name. It signs the work as us. That signing is the key to the house, and it is the part that has to be right before anything else matters.

We tested the helper as if it might turn on us

So before we built the helper itself, we built the careful part: the piece that supervises the work and walks it through review. And then we did something that feels almost rude. We assumed it could not be trusted, and we tried to break it ourselves, three separate times, each round looking for the holes the last round left behind.

The first two rounds turned up twenty-eight separate weaknesses. We fixed every one. The third round found two more, neither of them dangerous, and we fixed those too. Each round found fewer than the one before, which is the signal you want: the thing is getting safer, not just different.

Two of the weaknesses are worth describing, because they sound like paranoia until you see them.

The first was a spare key left in a pocket. Our helper happened to be carrying a powerful permission slip it had no reason to hold. Nobody handed it over on purpose; it was just there, in the kit the helper inherited. A helper that can look in its own pockets could have used that slip to let itself do far more than its job required. The fix was to stop trying to remove the dangerous items one by one, and instead start from empty and hand the helper only the exact things it needs. A short, deliberate list beats a long list of bans you might forget to update.

The second was a side door. We had locked the obvious entrance so the helper couldn't run anything it shouldn't while doing its work. The obvious lock wasn't enough. There was more than one way in, and the lock only covered the front. We found the other doors, closed them, and took a photograph of the room before and after the helper went in, so we could prove nothing had moved that shouldn't have.

Something that can sign work in your name can, until you've proven otherwise, do anything that signature allows. Trust is earned at the gate, not granted by good intentions.

Why the rehearsal beat the dress rehearsal

By now the supervising piece had passed every check we could write on paper. All green. That is the normal standard, and it was not enough.

Here is the catch with checks written on paper. When you test something in isolation, you have to pretend the outside world responds the way you imagine it will. You write down what you expect the world to say, and then you test against your own expectation. If your expectation is wrong, every test happily agrees with you, and you learn nothing.

So instead of only pretending, we ran the whole thing for real, against a throwaway copy we were happy to delete if it caught fire. That one real run found two problems every paper check had sailed straight past. The telling one: we had assumed a particular message would arrive in a particular form, and the real world delivered it in a completely different form. Every test we'd written was built on that wrong assumption, so every test agreed with us. The live run was the only thing in the building willing to disagree.

That is the whole point. A pretend test can only check the world you imagined. It cannot check the world that actually exists. For an everyday feature, the gap between those two is a small annoyance you patch later. For something that signs work in your name, the gap is the difference between a tool that works and a tool that quietly does the wrong thing, with your signature on it.

The rule we kept

We didn't let the helper anywhere near our real work until it had survived all three rounds of us attacking it and one honest run against something disposable. Not a polished demo. Not the almost-real environment. A throwaway, where a mistake costs nothing.

For anything where a mistake is expensive, the real-world rehearsal is not the last box you tick. It comes before you let the thing touch anything you'd be sad to lose.

For the technical reader

The "helper" is part of a small fleet of coding agents: each one takes a work item, writes the change, opens a pull request, and drives the code review to green unattended. The agent never merges and never deploys (a human owns that gate), but it commits and pushes under your identity, which is the whole reason for the rigour below.

What we built first was not the coder but the driver: an 840-line Python program that sits between the coder and the reviewer. Its only job is to run the review loop safely: attach to an open PR, request a full review pass, parse the result, apply fixes, push, and repeat until the review is clean or it hits a boundary it isn't allowed to cross.

We hardened it through three adversarial review passes, each trying to break the last. Convergence was 20 → 8 → 2: the first two passes found 28 defects (all fixed), the third found two more that were non-exploitable (also applied).

The two findings worth naming:

Credential exposure. The coder's environment inherited a high-trust internal token. Nothing in the code handed it over deliberately. It was simply present in the environment. A coder that can read its own environment could have used that token to mint a push credential it was never meant to have. Fix: switch from denylisting secrets to allowlisting the environment, so the coder receives exactly what it needs and nothing else.
git hooks bypass. We disabled git hooks before running the coder so it couldn't execute arbitrary code during a commit. We used the obvious core.hooksPath override, and that override does not cover every execution path git exposes: two uncovered paths could still run a program during normal git operations. We closed them and fingerprinted the repository before and after the coder touched it.

Then the live gate. By the end of hardening the driver had a full unit suite, all green, with every external service mocked, the normal standard, and insufficient. We ran the driver against a throwaway repository with a real PR and a real reviewer (kept generic here; the reviewer is a third-party code-review service). It surfaced two bugs the mocks had sailed past. The instructive one: our code assumed the reviewer's "actionable comments" summary arrives as an ordinary PR comment. It does not. It arrives as a review, a different object on a different surface of the API. Every mock encoded our wrong assumption, so every test agreed with us; the live integration was the only thing that disagreed.

We did not deploy the driver until it passed all three adversarial passes and the live gate against a disposable repo: not staging, not a demo, a throwaway where a mistake costs nothing. For security-critical automation, the live integration test is neither optional nor last.

The live gate you cannot skip for an agent that pushes code.

We tested the helper as if it might turn on us

Why the rehearsal beat the dress rehearsal

The rule we kept

Want one of these in your inbox once a month?

We automated the intake and kept the care human.

The service moved an hour earlier, and the page stayed dark.

A book on the shelf has never improved a codebase.