Cascade Faliure: HTTP Status Code Best Practices

Let's walk through a sequence of HTTP illustrations of how status codes should work.

First example:

GET /docs/examples/health-check/healthy

Where /docs is just a mapping to some folder on disk, so this is really "just" a file, the contents of which are an example of the response content returned by a health check resource when the system is, in fact, healthy. When a client sends this HTTP request, and the server correctly processes that reques, copying the file contents into the body of the response, what should the HTTP status code be?

Answer: 200 - the request was successful, the response content is a representation of the target resource. This is textbook, and should be unsurprising.

Second example:

GET /docs/examples/health-check/unhealthy

Assume the same mapping as before; this is really "just" a file, in the same directory in the previous example, but the contents of the file are an example of the response content returned by a health check resource when the system is unhealthy. When a client sends this HTTP request, and the server correctly processes that request, copying the file contents into the body of the response, what should the HTTP status code be?

Answer: 200 - the request was successful, the response content is a representation of the target resource. Again, we're just copying the contents of a file, just as though we were returning a text document, or a picture of a cat, or a file full of javascript. The client asked for the file, here it is.

Third example:

GET /api/health-check

Here, we are connecting to a dynamic resource, something that actually runs the internal checks against the system, and sends back a report describing the health of the system. Let's assume that the API is currently healthy, and furthermore that it is well documented; the response content returned when the API is healthy exactly matches the corresponding example in the documentation (/docs/examples/health-check/healthy). When a client sends this HTTP request, and the server correctly processes that request while the API is healthy, therefore producing response content that describes a healthy API, what should the HTTP status code be?

Answer: 200 - the request was successful, the response content is a representation of the target resource. Everything is in the green, we're able to do what the client asked. It is, in fact, the same response that we would get from a boring web server returning the contents of a static document (as seen in the first example above).

Fourth example:

GET /api/health-check

Here, we are connecting to a dynamic resource, something that actually runs the internal checks against the system, and sends back a report describing the health of the system. Let's assume that the API is currently unhealthy, and furthermore that it is well documented; the response content returned when the API is healthy exactly matches the corresponding example in the documentation (/docs/examples/health-check/unhealthy). When a client sends this HTTP request, and the server correctly processes that request while the API is unhealthy, therefore producing response content that describes a unhealthy API, what should the HTTP status code be?

The key idea being this: that the HTTP status code is describing the semantics of the HTTP response, not the semantics of the response content. The status code is metadata of the transfer-of-documents-over-a-network domain, it tells general purpose HTTP components what is going on, how different headers in the response may be correctly interpreted. It informs, for example, HTTP caches that any previously cached responses for this same request can be invalidated.

This is the uniform interface constraint at work - the semantics of all of the self descriptive messages in our system are the same, even though the semantics of the targeted resources are different. Our API, our help docs, Pat's blog, Alex's online bookshop, Kim's archive of kitten pictures all use the same messages, and any HTTP client can understand all of them.

BUT...

We lost the war years ago.

The right way to share information with a general purpose load balancer would be to standardize some link relations, or a message schema; to encode the information that the load balancer needs into some readily standardizable form, and then have everybody code to the same standard.

But the expedient thing to do is to hijack the HTTP status code, and repurpose it to meet this specialized need.

A 429 response to a request for a health-check should mean that the client is exceeding its rate limit, a 500 response should mean that some unexpected condition has prevented the server from sending the current representation of the health check resource.

But that's not where we ended up. And we ended up here long enough ago that the expedient choice has been normalized. Do the Right Thing ™ did not have a sufficient competitive advantage to overcome the advantages of convenience.

And this in turn tells us that, at some scales, the expedient solutions are fine. We can get away with writing APIs with specialized semantics that can only be correctly understood by bespoke clients that specialize in communication with our API because, well, because it's not a problem yet, and we'll probably need to redesign our system at least once before the scale of the problem makes a difference, and in the meantime the most likely failure modes are all somewhere else; a given API may never get to a point where the concessions to convenience matter.

So... which best practices do you want?

Cascade Faliure

Saturday, June 1, 2024

HTTP Status Code Best Practices

No comments:

Post a Comment