- 09 Dec, 2025 *
Here’s a common problem:
One subsystem (a service, a front end, or whatever) calls into another subsystem (another service, the back end, or whatever). There is a happy path, but things can go wrong.
For concreteness, let’s say the first subsystem is a front end for submitting hat specifications, and the second subsystem is a back end for processing the front end’s messages. If the request succeeds, the back end sends a success message. (It would probably also send information about the resulting object, but we don’t need to worry about that here.)
What if something can go wrong? Let’s say the back end needs to make sure that the hat size is between 6 and 8, inclusive. Now the back end sends either the success message or some message indicating the "invali…
- 09 Dec, 2025 *
Here’s a common problem:
One subsystem (a service, a front end, or whatever) calls into another subsystem (another service, the back end, or whatever). There is a happy path, but things can go wrong.
For concreteness, let’s say the first subsystem is a front end for submitting hat specifications, and the second subsystem is a back end for processing the front end’s messages. If the request succeeds, the back end sends a success message. (It would probably also send information about the resulting object, but we don’t need to worry about that here.)
What if something can go wrong? Let’s say the back end needs to make sure that the hat size is between 6 and 8, inclusive. Now the back end sends either the success message or some message indicating the "invalid hat size" error state. The front end is designed to handle both cases.
So far, so good. Or not: already a lot can go wrong. An error message can be misunderstood as a malformed success message, or vice versa. The front end can send 7.5 or "7.5" instead of "7 1/2" (many permutations of this problem are possible). The back end can accidentally accept 77 because it’s handling everything as strings and its validation is ensuring only that the size is between 6 and 8. (If you’re scoffing at this: I’ve seen a lot worse.)
Put those to the side; I’m worried here about a whole different class of problems. They arise when more than one thing can go wrong: let’s say that the hat specifications (1) must conform to the size limitation, (2) may not exactly duplicate those of any existing hat in the system, and (3) must not contain a description field longer than 10,000 characters.
The back end now checks for these and throws an error if any validation fails. This is often done sequentially: it looks for the bad-size error state and throws the bad-size error if there’s a problem; then it looks for the duplicate-specification error state and throws that error if there’s a problem; and so on. Meanwhile, the front end (or some other module) maintains a mapping between back-end error messages and front-end error responses.
From this common-sensical line of development, we enter a world of pain:
1. The wrong-order problem
Let’s say front end needs to take specific action for the duplicate-specification error state, and looks for the server response specific to that message. But the size validation fails and the server sends that error response instead. That’s a problem.
2. The incomplete-information problem
Let’s say the front end expects to get a full analysis of the request from the back end, but the response it gets only has partial information. (And race conditions on either end could entail that it gets different partial information, nondeterministically.) That’s a problem.
3. The 2^n problem
Often, that different combinations of error states have different appropriate responses. In a paradigm where the back end has to return an error response for each error state, this requires tracking 2^n such responses. That gets infeasible quickly.
All of these are aspects of:
4. The locality problem
In the design I described, the subsystems need to know too much about each other:
a. The back end is returning just one error, so that the front end needs to know a lot about the details of its error-checking (e.g., the order in which it’s happening).
b. The system maintains a one-to-one mapping between error messages (or error states) in two systems. This forces a lot of coupling between the structure of the states of those systems.
Putting the problem this way makes the beginning of a solution clear. Systems that detect error states should return as much information as is feasible: in our example above, a list of errors would be better than a single error. This isn’t always feasible: some errors are errors because they break assumptions in a way that undermines further analysis of the input. At the very least, try not to return messages that suggest that only X is the problem if all you know is that X is a problem.
It is remarkably difficult to maintain solid encapsulation in real-world engineering situations. This is often particularly true when handling errors, both for the formal reason I briefly mentioned above (that errors are the sorts of things that make it hard to guarantee complete analyses) and for cultural reasons (there just is a widespread practice of mapping error codes one-to-one across services). A bit of care, however, goes a long way. As always, subsystems should do what they say they’re going to do, return responses that are as accurate as possible, and not lie to each other.