Cascade Faliure: cqretc

Showing posts with label cqretc. Show all posts

Wednesday, October 10, 2018

Event Sourcing: Lessons on failure, part one.

I've written a couple of solo projects using "event sourcing", failing miserably at them because I failed to properly understand how to properly apply that pattern to the problem I was attempting to solve.

Part 1: Fantasy Draft Automation

I came into event sourcing somewhat sideways - I had first discovered the LMAX disruptor around March of 2013. That gave me my entry into the idea that state could be message driven. I decided, after some reading and experimenting, that a message driven approach could be used to implement a tool I needed for my fantasy baseball draft.

My preparation for the draft was relatively straight forward - I would attempt to build many ranked lists of players that I was potentially interested in claiming for my team, and during the draft I would look at these lists, filtering out the players that had already been selected.

So what I needed was a simple way to track lists of all of the players that had already been drafted, so that they could be excluded from my lists. Easy.

Failure #1: Scope creep

My real ambition for this system was that it would support all of the owners, including helping them to track what was going on in the draft while they were away. So web pages, and twitter, and atom feeds, and REST, and so on.

Getting all of this support right requires being able to accurately report on all of the players who were drafted. Which in turn means managing a database of players, and keeping it up to date when somebody chooses to draft a player that I hadn't been tracking, and dealing with the variations in spellings, and the fact that players change names and so on.

But for MVP, I didn't need this grief. I had already uniquely identified all of the players that I was specifically interested in. I just needed to keep track of those players; so long as I had all of the people I was considering in the player registry, and could track which of those had been taken (no need to worry about order, and I was tracking my own choices separately anyway).

Failure #2: Where is the book of record?

A second place where I failed was in understanding that my system wasn't the book of record for the actions of the draft. I should have noticed that we had been drafting for years without this database. And over the years we've worked out protocols for unwinding duplicated picks, and resolving ambiguity.

What I was really doing was caching outcomes from the real world process into my system. In other words, I should have been thinking of my inputs as a stream of events, not commands, and arranging for the system to detect and warn about conflicts, rather than rejecting messages that would introduce a conflict.

There was no particular urgency about matching picks with identifiers of players in the registry, or in registering players who were not part of the registry. All of that math could be delayed a hundred milliseconds without anybody noticing.

Failure #3: Temporal queries

The constraints that the system with were trying to enforce the rules that only players in the player registry could be selected, and that each player in the registry could only be selected once. In addition to the fact that wasn't the responsibility of the system, it was complicated by the fact that the player registry wasn't static.

Because I was trying to track the draft faithfully (not realizing until later that doing so wasn't strictly necessary for my use case), I would stop the program when my registry had a data error. The registry itself was just dumb bytes on disk; any query I ran against the database was a query against "now". So changing the entries in the registry would change the behavior of my program during "replay".

Failure #4: Compatibility

Somewhat related to the above - I wasn't always being careful to ensure that the domain logic was backwards compatible with the app that wrote the messages, nor did my message journal have any explicit markers in it to track when message traffic should switch to the new handlers.

So old messages would break, or do something new, screwing up the replay until I went into the "immutable" journal to fix the input errors by hand.

Failure #5: Messages

My message schemas, such as they were, were just single lines of text - really just a transcript of what I was typing at the interactive shell. And my typing sucks, so I was deliberately making choices to minimize typing. Which again made it harder to support change.

Friday, September 14, 2018

On aggregates: values, references, and transactions.

Gods help me, I'm thinking about aggregates again.

I think aggregates, and the literature around aggregates, deserve a poor reputation.

Part of the problem is that the early descriptions of the concepts have implicit in them a lot of the assumptions of enterprise solutions written in Java circa 2003. And these assumptions haven't aged very well - we are writing a lot of software today that has different design considerations.

CQRS Meetup

Yesterday's meetup of the Boston DDD/CQRS/ES group was at Localytics, and featured a 101 introduction talk by James Geall, and a live coding exercise by Chris Condon.

CQRS is there to allow you to optimize the models for writing and reading separately. NOTE: unless you have a good reason to pay the overhead, you should avoid the pattern.

James also noted that good reasons to pay the overhead are common. I would have liked to hear "temporal queries" here - what did the system look like as-at?

As an illustration, he described possibilities for tracking stock levels as a append only table of changes and a roll-up/view of a cached result. I'm not so happy with that example in this context, because it implies a coupling of CQRS to "event sourcing". If I ran the zoo, I'd probably use a more innocuous example: OLTP vs OLAP, or a document store paired with a graph database.

The absolute simplest example I've been able to come up with is an event history; the write model is optimized for adding new information to the end of the data structure as it arrives. In other words, the "event stream" is typically in message order; but if we want to show a time series history of those events, we need to _sort_ them first. We might also change the underlying data structure (from a linked list to a vector) to optimize for other search patterns than "tail".

Your highly optimized model for "things your user wants to do" is unlikely to be optimized for "things your user wants to look at".

This was taken from a section of James's presentation explaining why the DDD/CQRS/ES tuple appear together so frequently. He came back to this idea subsequently in the talk, when responding to some confusion about the read and write models

You will be doing roll ups in the write model for different reasons than those which motivate the roll ups in the read model.

A lot of people don't seem to realize that, in certain styles, the write model has its own roll ups. A lot of experts don't seem to realize that there is more than one style -- I tried to give a quick calca on an alternative style at the pub afterwards, but I'm not sure how well I was able to communicate the ideas over the background noise.

The paper based contingency system that protects the business from the software screwing up is probably a good place to look for requirements.

DDD in a nut shell, right there.

That observation brings me back to a question I haven't found a good answer to just yet: why are we rolling our own business process systems, rather than looking to the existing tooling for process management (Camunda, Activiti and the players in the iBPMS Magic Quadrant)? Are getting that much competitive advantage from rolling our own?

Event sourcing gives you a way to store the ubiquitous language - you get release from the impedance mismatch for free. A domain expert can look at a sequence of events and understand what is going on.

A different spelling of the same idea - the domain expert can look at a given set of events, and tell you that the information displayed on the roll up screen is wrong. You could have a field day digging into that observation: for example, what does that say about UUID appearing in the event data?

James raised the usual warning about not leaking the "internal" event representations into the public API. I think as a community we've been explaining this poorly - "event" as a unit of information that we use to reconstitute state gets easily confused with "event" as a unit of information broadcast by the model to the world at large.

A common theme in the questions during the session was "validation"; the audience gets tangled up in questions about write model vs read model, latency, what the actual requirements of the business are, and so on.

My thinking is that we need a good vocabulary of examples of different strategies for dealing with input conflicts. A distributed network of ATM machines; both in terms of the pattern of a cash disbursement, and also reconciling the disbursements from multiple machines when updating the accounts. A seat map on airline, where multiple travelers are competing for a single seat on the plane.

Chris fired up an open source instance of Event Store, gave a quick tour of the portal, and then started a simple live coding exercise: a REPL for debits and credits, writing changes to the stream, and then then reading it back. In the finale, there were three processes sharing data - two copies of the REPL, and the event store itself.

The implementation of the logic was based on the Reactive-Domain toolkit; which reveals its lineage, as it is an evolution of ideas acquired from working with Jonathan Oliver's Common-Domain and with Yves Reynhout, who maintains AggregateSource.

It's really no longer obvious to me what the advantage of that pattern is; it always looks to me as though the patterns and the type system are getting in the way. I asked James about this later, and he remarked that no, he doesn't feel much friction there... but he writes in a slightly different style. Alas, we didn't have time to explore further what that meant.

Sunday, June 10, 2018

Extensible message schema

I had an insight about messages earlier this week, one which perhaps ought to have been obvious. But since I have been missing it, I figured that I should share.

When people talk about adding optional elements to a message, default values for those optional elements are not defined by the consumer -- they are defined by the message schema.

In other words, each consumer doesn't get to choose their own preferred default value. The consumer inherits the default value defined by the schema they are choosing to implement.

For instance, if we are adding a new optional "die roll" element to our message, then consumers need to be able to make some assumption about the value of that field when it is missing.

But simply rolling a die for themselves is the "wrong" answer, in the sense that it isn't repeatable, and different consumers will end up interpreting the evidence in the message different ways. In other words, the message isn't immutable under these rules.

Instead, we define the default value in the schema - documenting that the field is xkcd 221 compliant; just as every consumer that understands the schema agrees on the semantics of the new value, they also agree on the semantic meaning of the new value's absence.

If two consumers "need" to have different default values, that's a big hint that you may have two subtly different message elements to tease apart.

These same messaging rules hold when your "message" is really a collection of parameters in a function call. Adding a new argument is fine, but if you aren't changing all of the clients at the same time then you really should continue to support calls using the old parameter list.

In an ideal world, the default value of the new parameter won't surprise the old clients, by radically changing the outcome of the call.

To choose an example, suppose we've decided that some strategy used by an object should be configurable by the client. So we are going to add to the interface a parameter that allows the client to specify the implementation of the strategy they want.

The default value, in this case, really should be the original behavior, or it semantic equivalent.

Tuesday, January 16, 2018

Events are messages that describe state, not behavior

I have felt, for some time now, that the literature explains event sourcing poorly.

The basic plot, that current state is just a left fold over previous behaviors, is fine, so far as it goes. But it rather assumes that the audience is clear on what "previous behaviors" means.

And I certainly haven't been.

In many, perhaps even most, domain models can be thought of as state machines:

Cargo begins its life cycle when it is booked, and and our delivery document changes state when the cargo is received, and when an itinerary is selected, it gets loaded in port, the itinerary changes, each of these messages arrives and changes the state of the delivery document.

So we can think of the entire process as a state machine; each message from the outside world causes the model to follow a transition edge out of one state node and into another.

If we choose to save the state of the delivery document, so that we can resume work on it later, there are three approaches we might take

We could simply save the entire delivery document as is
We could save the sequence of messages that we received
We could save a sequence of patch documents that describe the differences between states

Event sourcing is the third one.

We call the patches "events", and they have domain specific semantics; but they are fundamentally dumb documents that decouple the representation of state from the model that generated it.

This decoupling is important, because it allows us to change the model without changing the semantics of the persisted representation.

To demonstrate this, let's imaging a simple trade matching application. Buy and Sell orders come in from different customers, and the model is responsible for pairing them up. There might be elaborate rules in place for deciding how matches work, but to save the headache of working them out we'll instead focus our attention on a batch of buy and sell orders that can be paired arbitrarily -- the actual selects are going to be determined by the model's internal tie breaker strategy.

So we'll image that a new burst of messages appear, at some new price -- we don't need to worry about any earlier orders. The burst begins...

After things have settled down, we restart the service. That means that the in memory state is lost, as has to be recovered from what was written to the persistent store. We now get an additional burst of messages.

Using a first in, first out tiebreaker, we would expect to see pairs (A,1), (B,2), (C,3), and (D,4). If we were using a last in, first out tiebreaker, we would presumably see (D,1), (E,2), (C,3), (B,4).

But what happens if, during the shutdown, we switch from FIFO to LIFO? During the first phase, we see (A,1) matched, as before. After that, we should see (D,2), (E,3), (C,4).

In order to achieve that outcome, the model in the second phase needs knowledge of the (A,1) match before the shutdown. But it can only know about that match if there is evidence of the match written to the persistent store. Without that knowledge, the LIFO strategy would expect that (D,1) were already matched, and would in turn produce (C,2), (E, 3), and (A,4). The last of these conflicts with the original (A,1) match. In other words, we're in a corrupted state.

Writing the entire document to the event store works just fine, we read a representation that suggests that A and 1 are unavailable, and the domain model can proceed accordingly. Writing the sequence of patches works, because when we fold the patches together we get the state document. It's only the middle case, where we wrote out representations that implied a particular model, that we got into trouble.

The middle approach is not wrong, by any means. The LMAX architecture worked this way; they would write the input messages to a journal, and in the event of a failure they would recover by starting a copy of the same model. The replacement of the model behavior happened off hours.

Similarly, if you have the list of inputs and the old behavior, you can recover current state in memory, and then write out a representation of that state that will allow a new model to pick up from where the old left off.

Not wrong, but different. One approach or the other might be more suitable for your unique collection of operational constraints.

Wednesday, October 4, 2017

Value Objects, Events, and Representations

Nick Chamberlain, writing at BuildPlease:

Domain Models are meant to change, so changing the behavior of a Value Object in the Domain Model shouldn’t, then, affect the history of things that have happened in the business before this change.

Absolutely right.

The Domain Model is mutable, therefore having Events take dependencies on the Domain Model means that Events must become mutable - which they are not.

Arrrg.

As your domain model evolves, you may add new invariants to be checked, or change existing ones on the Value Object that you’re serializing to the Event Store.

Fundamentally, what Chamberlain is suggesting here is that you may want to replace your existing model with another that enforces stricter post conditions.

That's a backwards incompatible change - in semantic versioning, that would call for a major version change. You want to be really careful about how you manage that, and you want to be certain that you design your solution so that the costs of doing that are in the right place.

If the Domain Model was mutable, we’d also need versioning it - having classes like RoastDate_v2… this doesn’t match the Ubiquitous Language.

Right - so that's not the right way to version the domain model. The right way is to version the namespace. The new post condition introduces a new contract, both at the point of change, and also bubbling up the hierarchy as necessary. New implementations for those new contracts are introduced.
The composition root chooses the new implementations as it wires everything together.

"Events are immutable" is a design constraint, not an excuse to bathwater the baby.

Yes, we may need to evolve our understanding of the domain in the future that is incompatible with the data artifacts that we created in the past. From this, it follows that we need to be thinking about this as a failure mode; how do we want that failure to manifest? what options can we support for recovery?

Presumably we want the system to fail safely; that probably means that we want a fail on load, an alert to a human operator, and some kind of compensating action that the operator can take to correct the fault and restore service.

For instance, perhaps the right kind of compensating action is correcting entries. If your history is tracked an an append only collection of immutable events, then the operator will need to append the compensating entries to the live end of the stream. So your history processing will need to be aware of that requirement.

Another possibility would be to copy the existing history into a new stream, fixing the fault as you go. This simplifies the processing of the events within the model, but introduces additional complexity in discovering the correct stream to load.

We might also want to give some thought to the impact of a failure; should it clobber all use cases that touch the broken stream? That maximizes your chance of discovering the problem. On the other hand, being more explicit about how data is loaded for each use case will allow you to continue to operate outside of the directly impacted areas.

My hunch is that investing early design capital to get recovery right will also ease the constraints on how we represent data within the domain model. At the boundaries, the events are just bytes; but within the domain model, where we are describing the changes in the business, the interface of events is described in the ubiquitous language, not in primitives.

ps: Versioning in an Event Sourced System (Young, 2017) is an important read when you are thinking about messages that evolve as your requirements change.

Monday, October 2, 2017

A not so simple trick

Pawel Pacana

Event Sourcing is like having two methods when previously there was one.

Noooooooo

In fairness, the literature is a mess. Let's see what we can do about separating out the different ideas.

Data Models

Let's consider a naive trade book as an example; it's responsible for matching sell orders and buy orders when the order prices match. So the "invariant", such as it is, is that we never are holding unmatched buy orders and sell orders at the same price.

Let's suppose we get three sell orders; two offering to sell 100 units at $200, and between them a third offer to sell 75 units at $201. At that point in the action, the data model might be represented this way.

The next order wants to buy 150 units at $200, and our matching algorithm goes to work. The resulting position might have this representation.

And after another buy order arrives, we might see a representation like

After each order, we can represent the current state of the trade book as a document.

There is an alternative representation of the trade book; rather than documenting the outcome of the changes, we document the changes themselves. Our original document might be represented this way

Then, when the buy order arrives, you could represent the state this way

But in our imaginary trade book business, matches are important enough that they should be explicit, rather than implicit; so instead, we would more likely see this representation

And then, after the second buy order arrives, we might see

The two different models both suffice to describe equivalent states of the same entity. There are different trade offs involved, but both approaches provide equivalent answers to the question "what is the state of the trade book right now".

Domain Models

We can wrap either of these representations into a domain model.

In either case, the core interface that the application interacts with is unchanged -- the application doesn't need to know anything about how the underlying state is represented. It just needs to know how to communicate changes to the model. Thus, the classes playing the role of the "aggregate root" have the same exposed surface. It might look like

The underlying implementations of the trade book entity is effectively the same. Using a document based representation, we would see an outline like

Each time an order is placed, the domain model updates the local copy of the document. We get the same shape if we use the event backed approach...

Same pattern, same shape. When you introduce the idea of persistence -- repositories, and copying the in memory representation of the data to a durable store, the analogs between the two hold. The document representation fits well with writing the state to a document store, and can of course be used to update a relational model; perhaps with assistance of an ORM. But you could just as easily copy the event "document" into the store, or use the ORM to transform the collection of events into some relational form. It's just data at that point.

There are some advantages to the event representation when copying state to the durable store. Because the events are immutable, you don't need to even evaluate whether the original entries in the list have changed. You don't have to PUT the entire history; you can simply PATCH the durable store with the updates. These are optimizations, but they don't change the core of the patterns in any way.

Projections

Event histories have an important property, thanks to their append only nature -- updates are non-destructive. You can create from the event history any document representation you like; you only need to have an understanding of how to represent a history of no events, and how each event type in turn causes the document representation to change.

What we have here is effectively a state machine; you load the start state and then replay each of the state transitions to determine the final state.

This is a natural approach to take when trying to produce "read models", optimized for a particular search pattern. Load the empty document, replay the available events into it, cache the result, obtain more events, replay those, cache this new result, and so on. If the output representation is lost or corrupted, just discard it and replay the complete history of the model.

There are three important facets of these projections to pay attention to

First, the motivation for the projections is that they serve queries much more efficiently than trying to work with the event history directly.

Second, that because replaying an entire event history can be time consuming, the ability to resume the projection from a previously cached state is a productivity win.

Third, that a bit of latency in the read use case is typically acceptable, because there is no risk that querying stale data will corrupt the domain.

The Tangle

Most non-trivial domain models require some queries when updating the model. For instance, when we are processing an order, we need to know which previously unmatched orders were posted with a matching price. If the domain requires first in first out processing, then you need the earliest unmatched order.

Since projections are much better for query use cases than the raw event stream, the actual implementation of our event backed model probably cheats by first creating a local copy of a suitable projection, and then using that to manage the queries

That "solves" the problem of trying to use the event history to support queries directly, but it leads directly into the second issue listed above; processing the entire event history on demand is relatively expensive. You'd much prefer to use a cached copy.

And while using a cached copy for a write is fine, using a stale copy for a write is not fine. The domain model must be working with representations that reflect the entire history when enforcing invariants. More particularly, if a transaction it going to be consistent, then the events calculated at the conclusion of the business logic must take into account earlier events in the same transaction. In other words, the projection needs to be continuously updated by the work in progress.

This leads to a design where we are using two coordinated data models to support writes: the event backed representation that will eventually be used to update the durable store, and the document backed representation this is used to support the queries needed to enforce the invariant. The trade book, in effect, becomes its own cache.

We could, of course, mutate the document directly, rather than projecting the new events into it. But that introduces a risk that the document representation we have now won't match the one we create later when we have only the events to work from. It also ensures that any projections we make for supporting reads will have the same state that was used while performing the writes.

To add to the confusion: once the document representation of the model has been rehydrated, the previously committed events don't contribute; they aren't going to be changed, the document supports the queries, updating the event store is only going to append the new information.... Consequently, the existing history gets discarded, and the use case only tracks the new events that have been discovered locally in this update.

Thursday, September 22, 2016

Set Validation

Vladimir Khorikov wrote recently about enforcing uniqueness constraints, which is the canonical example of set validation. His essay got me to thinking about validation more generally.

Uniqueness is relatively straight forward in a relational database; you include in your schema a constraint that prevents the introduction of a duplicate entry, and the constraint acts as a guard to protect the invariant in the book of record itself -- which is, after all, where it matters.

But how does it work? The constraint is effective because it blocks the write to the book of record. In the abstract, the constraint gets tested within the database while the write lock is held; the writes themselves have been serialized and each write in turn needs to be consistent with its predecessors.

If you try to check the constraint before obtaining the write lock, then you have a race; the book of record can be changed by another transaction that is in flight.

Single writer sidesteps this issue by effectively making the write lock private.

With multiple writers, each can check the constraint locally, but you can't prove that the two changes in flight don't conflict with each other. The good thing is that you don't need to - it's enough to know that the book of record hasn't changed since your checked it. Logically, each write becomes a compare and swap on the tail pointer of the model history.

Of course, the book of record has to trust that the model actually performed the check before attempting the write.

And implementing the check this way isn't particularly satisfactory. There's not generally a lot of competition for email addresses; unless your problem space is actually assignment of mail boxes, the constraint has generally been taken care of elsewhere. Introducing write contention (by locking the entire model) to ensure that no duplicate email addresses exist in the book or record isn't particularly satisfactory.

This is already demonstrated by the fact that this problem usually arises after the model has been chopped into aggregates; an aggregate, after all, is an arbitrary boundary drawn within the model in an attempt to avoid unnecessary conflict checks.

But to ensure that the aggregates you are checking haven't changed while waiting for your write to happen? That requires locking those aggregates for the duration.

To enforce a check across all email addresses, you also have to lock against the creation of new aggregates that might include an address you haven't checked. Effectively, you have to lock membership in the set.

If you are going to lock the entire set, you might as well take all of those entities and make them part of a single large aggregate.

Greg Young correctly pointed out long ago that there's not a lot of business value at the bottom of this rabbit hole. If the business will admit that mitigation, rather than prevention, is a cost effective solution, the relaxed constraint will be a lot easier to manage.