← FIELD NOTES
TEARDOWNMar 25, 2026·5 MIN READ

Restarting one container took down four sites that had nothing to do with it.

From the build log. We restarted one small piece of plumbing to make a routine change. The change itself was harmless — but four unrelated public sites went dark, because of a shared layer almost nobody thinks to check.

B
Brynn
FOUNDER, TRANSFORMATE

Picture a building where four different shops all share one front door. Each shop is fine. The corridors are fine. But if you take that single front door off its hinges for a moment, all four shops lose their customers at once — not because anything is wrong inside the shops, but because everyone reaches them through the same doorway.

That is, almost exactly, what happened to us one afternoon.

The change was meant to be safe

We needed to make a small, routine update to one of our systems. To apply it, we restarted a single piece of plumbing — a small service that passes traffic along to the part doing the real work. We deliberately left the important stuff alone. We only bounced the messenger in front of it.

It felt like the careful choice. Don't touch the engine, just restart the doorman.

Within seconds, four unrelated public websites started failing. Not the one we were working on. Four others, with nothing to do with the change we'd made. And they didn't recover on their own. They simply stayed broken.

Why four sites went down for a one-site change

Here is the part we'd missed. All four of those sites were reaching the outside world through one shared connection point — the same front door from the analogy. The thing we restarted sat right at that door.

While our little service blinked off and on, the shared connection in front of it was still trying to hand visitors through to where it had been a moment ago. It was pointing at an address that, briefly, had nothing behind it. And crucially, it didn't fix itself once traffic kept arriving. It just kept failing, over and over, for every site behind that door.

So the damage had nothing to do with what we were actually changing. It had to do with the fact that four sites quietly depended on the same doorway — and we'd treated a shared-doorway event as if it were a private, one-room event.

The moment several things share one entrance, restarting anything at that entrance is a decision about all of them — not just the one you came to fix.

The fix was smaller than the outage

It turned out there was a gentle way to make our original change all along — a quiet nudge that tells the system "pick up the new setting" without taking anything offline. No restart. No front door coming off its hinges. The four other sites would never have noticed.

We'd reached past the gentle option and grabbed the heavy one, because the heavy one felt more certain to work. It wasn't. It just added a way for everything to go wrong.

What travels out of this

Two habits came out of that afternoon, and they apply far beyond our particular setup.

  1. Find out what shares the door before you touch it. If anything sits in front of a service routing visitors to it, that service is shared infrastructure — even when it looks like it serves one purpose. Restarting it is a decision about everyone who depends on it, not just the job in front of you.

  2. Prefer the quiet nudge over the full restart, every time one exists. Most well-built tools have a way to reload a setting in place, without disturbing anything that's connected to them. The restart is the blunt instrument you reach for when you don't realise the gentle one is there.

There's a quieter point underneath both, and it's the one I'd actually defend. On shared infrastructure, the cheapest, most boring action is usually the right one — and the heavier action almost never buys you the certainty it seems to promise. The restart didn't make our change land any better than the gentle nudge would have. It only handed us an outage. We paid for certainty we never received, in the currency of four sites being down.

Under the hood

The setup: a self-hosted Supabase instance shared by several services. We needed to expose a new Postgres schema to PostgREST — the layer that turns the database into a REST API. The list of exposed schemas is an environment variable (PGRST_DB_SCHEMAS), so the new schema has to be added and then picked up by the API layer.

The "messenger we restarted" was the API gateway container sitting in front of PostgREST. It publishes its proxy on a fixed localhost port (127.0.0.1:54321). A Cloudflare tunnel sits in front of that port and serves four separate public hostnames off it. The tunnel holds long-lived keepalive connections to the origin — that's how it stays fast.

When we restarted the gateway container, the host-port it was bound to went away and came back. The tunnel's keepalive connections were now pointing at nothing, but it kept serving traffic across all four hostnames against those stale origin connections — which is where the 502s came from. The tunnel does not re-establish those connections on its own while traffic is still flowing; under live load it just kept failing. The database, PostgREST, and the gateway itself all reported healthy at the origin throughout. The break was entirely at the edge.

The correct reload needs no network-level change at all. PostgREST reloads its schema cache in place when you send the database a NOTIFY:

NOTIFY pgrst, 'reload schema'

The gateway never moves, so the four sites in front of it never notice. If PGRST_DB_SCHEMAS itself genuinely changed, you can recreate the PostgREST container alone — it isn't on the tunnel's critical path — and accept a sub-second blip on that one service. The gateway stays untouched on purpose.

The transferable rule, stated technically: when a tunnel (or load balancer, or reverse proxy) multiplexes several hostnames through a single localhost port, restarting the service behind that port is a tunnel-level event, not a container-level one. The blast radius is everything that tunnel serves.

IF YOU LIKED THIS

Want one of these in your inbox once a month?