Cascade Faliure: Events are messages that describe state, not behavior

I have felt, for some time now, that the literature explains event sourcing poorly.

The basic plot, that current state is just a left fold over previous behaviors, is fine, so far as it goes. But it rather assumes that the audience is clear on what "previous behaviors" means.

And I certainly haven't been.

In many, perhaps even most, domain models can be thought of as state machines:

Cargo begins its life cycle when it is booked, and and our delivery document changes state when the cargo is received, and when an itinerary is selected, it gets loaded in port, the itinerary changes, each of these messages arrives and changes the state of the delivery document.

So we can think of the entire process as a state machine; each message from the outside world causes the model to follow a transition edge out of one state node and into another.

If we choose to save the state of the delivery document, so that we can resume work on it later, there are three approaches we might take

We could simply save the entire delivery document as is
We could save the sequence of messages that we received
We could save a sequence of patch documents that describe the differences between states

Event sourcing is the third one.

We call the patches "events", and they have domain specific semantics; but they are fundamentally dumb documents that decouple the representation of state from the model that generated it.

This decoupling is important, because it allows us to change the model without changing the semantics of the persisted representation.

To demonstrate this, let's imaging a simple trade matching application. Buy and Sell orders come in from different customers, and the model is responsible for pairing them up. There might be elaborate rules in place for deciding how matches work, but to save the headache of working them out we'll instead focus our attention on a batch of buy and sell orders that can be paired arbitrarily -- the actual selects are going to be determined by the model's internal tie breaker strategy.

So we'll image that a new burst of messages appear, at some new price -- we don't need to worry about any earlier orders. The burst begins...

After things have settled down, we restart the service. That means that the in memory state is lost, as has to be recovered from what was written to the persistent store. We now get an additional burst of messages.

Using a first in, first out tiebreaker, we would expect to see pairs (A,1), (B,2), (C,3), and (D,4). If we were using a last in, first out tiebreaker, we would presumably see (D,1), (E,2), (C,3), (B,4).

But what happens if, during the shutdown, we switch from FIFO to LIFO? During the first phase, we see (A,1) matched, as before. After that, we should see (D,2), (E,3), (C,4).

In order to achieve that outcome, the model in the second phase needs knowledge of the (A,1) match before the shutdown. But it can only know about that match if there is evidence of the match written to the persistent store. Without that knowledge, the LIFO strategy would expect that (D,1) were already matched, and would in turn produce (C,2), (E, 3), and (A,4). The last of these conflicts with the original (A,1) match. In other words, we're in a corrupted state.

Writing the entire document to the event store works just fine, we read a representation that suggests that A and 1 are unavailable, and the domain model can proceed accordingly. Writing the sequence of patches works, because when we fold the patches together we get the state document. It's only the middle case, where we wrote out representations that implied a particular model, that we got into trouble.

The middle approach is not wrong, by any means. The LMAX architecture worked this way; they would write the input messages to a journal, and in the event of a failure they would recover by starting a copy of the same model. The replacement of the model behavior happened off hours.

Similarly, if you have the list of inputs and the old behavior, you can recover current state in memory, and then write out a representation of that state that will allow a new model to pick up from where the old left off.

Not wrong, but different. One approach or the other might be more suitable for your unique collection of operational constraints.

Cascade Faliure

Tuesday, January 16, 2018

Events are messages that describe state, not behavior

No comments:

Post a Comment