The intermittent error that was a ghost, not a bug.

We were building the booking system for a medical-conference client. Someone fills in the form, and their reservation gets saved. Simple enough.

Then it started failing. Roughly four in every ten people who tried to book got an error instead of a confirmation. The other six were fine. There was no pattern we could see, and nothing in our logs explained why.

The first thing you assume, naturally, is that you have written a bug.

Two clues we ignored for too long

Two facts were sitting in front of us the whole time, and we walked past both of them for hours.

The first: when we ran the exact same booking from a laptop, it worked every single time. Not most of the time. Every time. The failures only happened out on the live public site.

The second: we put up two identical copies of the system, and both failed at the same rough rate as each other. A real code bug doesn't behave like that. If the problem were in our code, it would fail on a particular booking: a certain email address, a missing field, something you could reproduce on demand. Ours failed on no particular booking at all. It was a coin toss, and the coin came up "error" about four times in ten no matter what we sent it.

That is the fingerprint of a delivery problem, not a code problem. A code bug fails on what you send. A delivery problem fails on where the request happens to land. We had the evidence to tell those two apart, and we spent the morning looking the other way.

Half a day in the wrong place

So we went digging into the plumbing that carries requests across the internet. We changed how the connection was made, and it seemed to fix things. For about four hours. Then the failures came back. We tried several more adjustments at that same level. Each one appeared to help for a while, then stopped.

That is the cruellest kind of false signal. A fix that holds for a few hours feels like a fix. It was actually just luck about which path the next batch of requests happened to take.

A fix that holds for four hours and then fails again is not a fix. It is a lucky streak that ran out.

The real answer was almost insulting. Our system was reachable through a kind of relay: a piece of software whose job is to forward requests from the public internet to our server. There was supposed to be one of those relays running. There were two.

The second one was a leftover. Days earlier, during setup, one had been started on a developer's laptop and never shut down. It quietly kept running in the background, still registered as a valid way to reach our system. Except it couldn't actually reach anything, because it wasn't connected to the real server.

The traffic was being split evenly between the two relays. Half the requests went to the working one and got a confirmation. Half went to the ghost on the laptop, which had nowhere to send them, and those people got the error. As the laptop slept and woke, the share drifted around. That drift was our "random four in ten." It was never really random. It was a coin toss between a relay that worked and one that wasn't really there.

The fix took seconds: shut down the leftover relay on the laptop. The moment we did, every request started succeeding.

The lesson worth keeping

When something is failing only some of the time, and you can prove the core system itself is healthy, the very first thing to check is how many doors lead into it. An extra forgotten door (left open during some setup session and never closed) will send a chunk of your visitors into an empty room. You are not debugging your system. You are debugging a stranger nobody remembered inviting.

There is a habit underneath this that I think is worth naming. We reach for the complicated explanation because it flatters us. Fiddling with deep technical settings feels like real engineering. "You left something running on a laptop last week" feels like an embarrassment. So the cheap, slightly humiliating check is exactly the one we skip, and it is almost always the one that would have saved the day.

The check that takes ten seconds and might make you look careless should come before the one that takes four hours and makes you look thorough.

We did add one safety net to the system itself, so that a hiccup like this would quietly retry instead of failing the customer. But we were careful about which actions we allowed it to retry. That distinction matters more than it sounds. Blindly retrying the wrong kind of action doesn't fix a booking; it creates three copies of it. We chose to fix the real cause rather than paper over the symptom and make a worse mess.

The cause was a door nobody knew was open. The half-day was the price of not looking.

Under the hood

The system was a serverless function on Cloudflare Pages that wrote each reservation to a self-hosted database sitting behind a Cloudflare Tunnel. The edge was returning error code: 502 on roughly 40% of requests, with no stack trace and no log line on the origin.

Two facts framed it as a routing problem from the start: the same request from a laptop succeeded 100% of the time against the same origin, and two byte-identical deploys failed at the same ~40% rate. Random with respect to the request, fixed with respect to the rate — that is a routing signature, not a logic one.

We burned half a day in the transport layer anyway. We forced HTTP/2 instead of QUIC (held ~4 hours, then the 502s returned), checked MTU, looked at connection-tracking limits on the origin box, and restarted then upgraded the tunnel connector. Each "worked" briefly because it shifted which path the next requests took.

The diagnostic we should have run first: list the connectors registered to the tunnel. A healthy named tunnel has one connector group, uniform in architecture and version. Ours had two. The second was a stale connector still running as a Windows service on a developer laptop, left over from an earlier setup session: auto-start, quietly alive, registered against the same tunnel.

Cloudflare load-balances across every connector registered to a tunnel. Half the requests routed to the laptop connector, which had no origins configured, so the edge returned 502; the other half hit the real connector and returned 200. The ratio drifted as the laptop slept and woke. The fix was to stop and disable the rogue Windows service; the tunnel dropped back to one connector group and the edge went to 12/12.

The mitigation in the function was an escalating-backoff retry, applied only to the idempotent calls (every operation on that backend was a safe upsert or update-by-id), which took the reserve flow back to 100% even while the ghost was still routing badly. Never retry a non-idempotent write to mask an infrastructure fault: you turn one lost record into three duplicates.

The intermittent error that was a ghost, not a bug.

Two clues we ignored for too long

Half a day in the wrong place

The lesson worth keeping

Want one of these in your inbox once a month?

We automated the intake and kept the care human.

The service moved an hour earlier, and the page stayed dark.

A book on the shelf has never improved a codebase.