The feature was easy to describe. Every trainee who joins the HOFMI workspace should land in a space where their own AI assistant is already waiting — no command, no setup, just open it and start talking. We had it working in testing within a day.
Then we switched it on for real users, and it broke.
A filing cabinet with one folder filed wrong
Think of the system as a filing cabinet. Almost every folder in it was created the same way, by the same machine, so they all look identical: same labels, same layout, filed in the same place. Our software was written to find folders that look like that. And it did, every time we tested it.
But one folder had been made by hand, early on, before the machine that makes the rest existed. It held the same information, but it was labelled differently and filed somewhere else. When our software went looking for it in the usual place, it wasn't there. So the software assumed no folder existed yet and tried to create a new one — and the cabinet refused, because a folder for that person was already sitting there. That collision is what users saw as an error.
Nothing was actually wrong with our software on its own terms. It was wrong about reality. It had been taught what folders are supposed to look like, and it trusted that picture instead of going and looking at the real one.
The rulebook tells you what a record is allowed to look like. It does not tell you what your records actually look like. Those are different questions, and only one of them reaches a real user.
The part you only learn the hard way
The folder that broke us was the oldest and most important one — the original, hand-made account. Our tests never caught the problem because our tests only ever created the neat, machine-made kind. The messy original was the one record they could never produce, and it was exactly the one that mattered.
That is the lesson worth keeping. The records that break you in production are rarely the ones your software made. They are the old ones, made by hand before the rules were settled, the ones nobody ever went back and tidied up. They sail through every test and then fail on the one account you most wanted to work.
So the rule we now follow is simple. Before trusting that real data matches the tidy picture in our heads, we go and read the actual record first — especially anything that was set up by hand. And we check the fix by walking through the system exactly as a brand-new trainee would, on the live site, rather than on a stand-in we built to be convenient.
Under the hood
The feature gives every HOFMI trainee a default channel where their own agent replies freely.
The 500. The query that finds an agent's default channel was written against the shape the schema implies: a channel of kind='agent' with the agent's id stored on the row. Every channel created through the provisioning code looked exactly like that, so the tests were green.
The one agent that mattered had been built by hand, before the provisioning path existed. It was stored as a plain kind='channel' row with a null agent_id, and its link to the agent lived only in a separate join table (the many-to-many record binding agent to channel). The find query never looked there. It missed the existing channel, fell through to the create path, and collided with a UNIQUE constraint on a channel already in the database — hence the 500.
The fix. Make the join table the authoritative lookup — the primary path to resolve the channel, rather than the column the code had assumed. Once the channel is found, the agent is promoted to governor of that channel with its mention-only flag switched off, so it replies without being summoned.
The second system. Free conversation needed changes on two systems that didn't know about each other. One flag lives in our agent platform and controls how a wake event is routed. But the agent runs on a different runtime, and that runtime's listener decides whether to reply based on the channel's name, independent of any database flag. Flip one and not the other and the agent either stays silent or talks in rooms it should leave alone. Both halves had to agree, and neither would tell you the other was wrong.
Verification. We didn't trust a local check. We drove the deployed endpoint over HTTPS as an actual trainee user — seed the session, log in, hit the create path — then read the live database to confirm the binding had flipped from member to governor exactly as designed. The check ran against the same surface a real first-day trainee touches, not a mock of it.