Most of the systems we run are reached through one front door. Not a door each — one door for all of them. A visitor arrives at the address, the door lets them through, and the right room is waiting behind it. It is an ordinary, sensible arrangement, and almost nobody gives that door a second thought until the morning it won't open.
One morning, it wouldn't open. Every public site we run went dark at the same moment — not one of them, all of them at once. And every part we knew how to check insisted it was perfectly healthy.
Everything was fine, and nothing worked
This is the disorienting part, so it's worth sitting with. The machines were powered on. The programs doing the actual work were running. The databases were up. If you logged in and asked any single component how it was doing, it answered: fine, no errors, business as usual.
Yet a member of the public typing in any of our addresses got an error, every time, on every site. Healthy parts, dead whole. When the inside of a building is spotless but nobody can get through the entrance, you have stopped having a problem with the rooms and started having a problem with the door.
The error code was the whole story
There's a quiet distinction hiding in error messages that most people never need to learn, and it saved us here. Some failures mean "I reached your building's address, but the front door would not open." Others mean "I got through the door fine — but the room I walked into is broken." They look equally like "the website is down" to a visitor, and they point in opposite directions for whoever has to fix it.
We were getting the first kind, uniformly, across every address. That single fact told us where not to spend the morning. The rooms were not the problem. There was no point reading application logs or restarting services. Something was wrong at the door itself, before any request ever reached the work behind it.
Two programs wanted the same door
Here is what had actually happened, in plain terms.
That front door can only be held by one program at a time. Our web server — the doorman whose entire job is to greet arriving visitors and pass them to the right room — normally holds it. That's correct and that's how it had always been.
But a second program on the same machine had, at some point, also been configured to use that exact door. For as long as the web server kept a firm grip on it, there was no argument. The second program simply never got a turn, and nobody noticed the latent conflict.
Then, overnight, the machine updated itself. A routine security patch — the kind that lands automatically, with nobody at the keyboard — needed to refresh the web server, and to finish the job it briefly restarted it. For a fraction of a second, the doorman set the key down.
That fraction of a second was all the second program needed. It took the door. When the web server came back and reached for the key it had held for months, the key was gone. It could not get in. So it did what a tidy program does when it cannot do its job: it gave up cleanly and stayed down. And every site standing behind that one doorman stayed dark with it.
The detail that stung: nobody had done anything. We had not deployed, not changed a setting, not touched the box. The machine had quietly maintained itself, and the harmless-looking restart inside that maintenance was exactly the gap the collision had been waiting for.
The second clue we nearly walked past
One more thing nearly tripped us. As we worked, one of our addresses started answering again while another stayed dead. That mismatch is easy to read as "it's recovering on its own." It isn't. It meant a second machine had the identical fault, independently, and we'd only fixed the first.
When everything is down but every health check reads green, stop inspecting the apps. The fault is almost always in the shared layer they all pass through — the one piece no single app feels responsible for.
A failure in a shared layer is rarely in just one place, because the thing that caused it — here, an automatic update — runs on every machine set up the same way. Fix the first one and you are not finished; you are halfway. Check every box that shares the pattern before you call it solved.
The fix, and the smaller fix that prevents it
The repair itself took under a minute once we understood it. Tell the squatting program to let go of the door. Watch the door come free. Start the web server, which immediately takes its rightful place. The sites came back the instant it did.
The durable fix was smaller still, and it's the one that matters. We gave the second program its own side entrance — a separate door of its own — so that the two can never again want the same one. One door, one owner, no negotiation. The collision we'd just spent a morning on became structurally impossible rather than merely unlikely.
What travels out of this
Three habits came out of that morning, and none of them are specific to our setup.
-
Decide who owns the shared thing, and make it exclusive. Anything several programs can reach for — a doorway, a lock, a single port — needs one undisputed owner on that machine. Shared-by-default is a conflict you haven't had yet.
-
Automatic updates are still changes. They just happen while you sleep. On an ordinary machine that convenience is a gift. On the one machine every visitor funnels through, an unsupervised restart is an unsupervised change to production — and it will choose its own moment.
-
A green health check only certifies the room, never the doorway. If your monitoring asks each component "are you alive?" but never asks a real visitor "could you actually get in?", it will report all-clear through an outage like this one. Watch the front door from the outside, the way a customer experiences it.
The honest opinion underneath all three: convenience and your single most important machine pull in opposite directions, and you have to pick. Automatic security updates are sound policy almost everywhere — keep them. But the box that every site depends on has earned the right to be updated on purpose, with someone watching, rather than on a timer in the small hours. We treat that one box differently now, and we should have from the start.
Under the hood
The shared front door is TCP port 443 — where all HTTPS traffic terminates. On the affected origins, nginx owns 443 and reverse-proxies every public hostname behind it. In front of that sits Cloudflare, proxying the domains.
The symptom was uniform Cloudflare 521 across every hostname on the origin. The distinction the post leans on is real and worth internalising: 521 means Cloudflare resolved DNS but could not open a TCP connection to the origin — the front door is unreachable — whereas 502 means the origin answered but the upstream application failed. A dead app surfaces as 502; a downed reverse proxy surfaces as 521. Seeing 521 everywhere is the signature of a shared edge being down, not a per-app fault — which is why reading application logs would have been wasted time.
The collision: the host's automatic security-update mechanism (unattended-upgrades on Debian/Ubuntu) applied an nginx package update. The package's post-install step restarts the service via dpkg. During that restart window the socket [::]:443 was momentarily free — and a second process on the box, a mesh-VPN tool whose built-in reverse-proxy feature had been pointed at 443 to expose an internal admin service, grabbed it. nginx's restart then failed to re-bind:
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
and the unit settled into failed. The trap for the unwary: nginx -t passes — the configuration is valid. This is not a config error; it is a socket-ownership race. The diagnostic that ends the guesswork is asking who actually holds the socket:
ss -ltnp | grep ':443 ' # shows the VPN daemon on 443, not nginx
Remediation, in order: clear the VPN tool's serve mapping to release 443, systemctl start nginx, then verify through Cloudflare — a live curl that returns a normal 2x/3xx, not just systemctl is-active. Service-state green is not the same as reachable-from-the-edge green.
Prevention is a single rule: nginx owns 443 exclusively on these boxes. Any localhost or container service that needs to be reachable over HTTPS on the private network is re-homed to a non-443 port, so an update-triggered restart can never hand the socket to the wrong process again.
The multi-origin tell: the per-hostname 521 split — one subdomain serving, another dead — meant a second origin had taken the identical update and the identical hit. The same automatic patch ran fleet-wide; every box fronted by nginx was a candidate, and "fixed the first one" was not "fixed it."