Cascade Faliure: cqrs

Showing posts with label cqrs. Show all posts

Wednesday, April 10, 2019

Read Models vs Write Models

At Stack Exchange, I answered a question about DDD vs SQL, which resulted in a question about CQRS that I think requires more detail than is appropriate for that setting.

The "read model" is not the domain model (or part of it)? I am not an expert on CQRS, but I always thought the command model is quite different from the classic domain model, but not the read model. So maybe you can give an example for this?

So let's lay some ground work

A domain model is not a particular diagram; it is the idea that the diagram is intended to convey. It is not just the knowledge in the domain expert's head; it is a rigorously organized and selective abstraction of that knowledge. -- Eric Evans, 2003.

A Domain Model creates a web of interconnected objects, where each object represents some meaningful individual, whether as large as a corporation or as small as a single line on an order form. -- Martin Fowler, 2003.

I think that Fowler's definition is a bit tight; there's no reason that we should need to use a different term when modeling with values and functions, rather than objects.

I think it is important to be sensitive to the fact that in some contexts we are talking about the abstraction of expert knowledge, and in others we are talking about an implementation that approximates that abstraction.

Discussions of "read model" and "write model" almost always refer to the implemented approximations. We take a single abstraction of domain knowledge, and divide our approximation of it into two parts - one that handles our read use cases, and another that handles our write use cases.

When we are handling a write, there are usually constraints to ensure the integrity of the information that we are modeling. That might be as simple as a constraint that we not overwrite information that was previously written, or it might mean that we need to ensure that new writes are consistent with the information already written.

So to handle a write, we will often take information from our durable store, load it into volatile memory, then create from that information a structure in memory into which the new information will be integrated. The "domain logic" calculates new information, which is written back to the durable store.

On the other hand, reads are safe; "asking the question shouldn't change the answer". In that case, we don't need the domain logic, because we aren't going to integrate new information. We can take the on disk representation of the information, and transform it directly into our query response, without passing through the intermediate representations we would use when writing.

We'll still want input sanitation, and message semantics that reflect our understanding of the domain experts abstraction, but we aren't going to need "aggregate roots", or "locks" or the other patterns that prevent the introduction of errors when changing information. We still need the data, and the semantics that approximate our abstraction, but we don't need the rules.

We don't need the parts of our implementation that manage change.

When I answer "the query itself is unlikely to pass through the domain model", that's shorthand for the idea that we don't need to build domain specific data structures as we translate the information we retrieved from our durable store into our response message.

Friday, October 12, 2018

Event Sourcing: lessons in failure, part 2

I've written a couple of solo projects using "event sourcing", failing miserably at them because I failed to properly understand how to properly apply that pattern to the problem I was attempting to solve.

Part 2: Fantasy League Scoring

Our fantasy league used a bespoke scoring system, so I decided to try my hand at creating a report for myself to track how every player in baseball was doing. This gave me extra insights about how I might improve my team by replacing players in the middle of the season.

And to a large extent, it worked - I was able to pick up a number of useful pieces that would otherwise have slipped under the radar, turning over my team much more aggressively than I would have otherwise.

It was still pretty much a crap shoot -- past performance does not promise future results. But it did have the benefit of keeping me more engaged.

Failure #1: Where are the events?

Again, I had a book of record issue - the events were things happening in the real world, and I didn't have direct access to them. What I had was a sort of proxy - after a game ended, a log of the game would become available. So I could take that log, transform it into events, and the proceed happily.

Well, to be honest, that approach is pretty drunk.

The problem is that the game log isn't a stream of events, it is a projection. Taking what is effectively a snapshot and decomposing it reverses cause and effect. There were two particular ways that this problem would be exposed in practice.

First, it would sometimes happen that the logs themselves would go away. Not permanently, usually, but at least for a time. Alternatively, they might be later than expected. And so it would happen that data would get lost - because a stale copy of a projection was delivered instead of a live one. Again, the projections aren't my data, they are cached copies of somebody else's data, that I might use in my own computations.

Second, when these projections appear, they aren't immutable. That's a reflection of both problems with data entry upstream (a typo needs to be fixed), and also the fact that within the domain, the interpretation of the facts can change over time -- the human official scorers will sometimes reverse an earlier decision.

In other words, for what I was doing, events didn't gain me anything over just copying the data into a RDBMS, or for that matter writing the data updates into a git repository on disk.

Failure #2: Caching

The entire data processing engine depended on representations of data that changed on a slow cadence (once a day, typically) and I wasn't tracking any meta data about how fresh the data was, how stable it ought to be, whether the new data in question was a regression from what had been seen earlier, and so on.

In an autonomous system, this is effectively a sort of background task - managing local static copies of remote data.

To make this even more embarrassing; I was of course downloading this data from the web, and the HTTP specification has a lot to say about caching, that I didn't even consider.

(I also failed to consider the advantages I might get from using a headless browser, rather than just an html parser. This bit me hard, and made a big contribution toward the abandoning of the project.)

Failure #3: What's missing?

The process I was working from only considered logs that were available; there was no monitoring of logs that might be missing, or that might have been removed. This introduced small errors in data comparisons.

I needed to be able to distinguish "here are Bob Shortstop's 6 scores from last week" from "here are Bob Shortstop's 5 scores from last week, and there is a game unaccounted for".

Again, I was thinking of events as things that happen, rather than as events as a way of representing state over time.

Failure #4: Process telemetry

What I wanted, when the wheels were done turning, was the collection of reports at the end. And that meant that I wasn't paying enough attention to the processes I was running. Deciding when to check for updated data, on what grain, and orchestrating the consequences of changes to the fetched representations was the real work, and instead I was thinking of that as just "update files on disk". I didn't have any reports I could look at to see if things were working.

Again, everything was flat, everything was now, absolutely no indications of when data had appeared, or which representation was the source of a particular bit of data.

Solving the Wrong Problem

In effect, what happened is that I would throw away all of my "events" every morning, then regenerate them all from the updated copies of the data in the cache. If all of your events are disposable, then something is going badly wrong.

The interesting things to keep track of were all related to the process, and discovering that I wanted to refresh the caches slowly, but in a particular priority.

What I should have been looking toward was Rinat's model of a process manager; how would I support a dashboard showing a list of decisions to be made, could I then capture the priorities of the "domain expert" and automate the work. Could I capture time as a first class concern driving the elements of the system forward?

Some of the reason that I missed this is that I had too much time -- I was deliberately hobbling the fetch of the data, which meant that the cost of redoing all of the work was lost in the noise. On the other hand, that doubly emphasizes the point that all of the value add was in the bookkeeping, which I never addressed.

Key Question:

Did I need temporal queries? No.

Wednesday, October 10, 2018

Event Sourcing: Lessons on failure, part one.

Part 1: Fantasy Draft Automation

I came into event sourcing somewhat sideways - I had first discovered the LMAX disruptor around March of 2013. That gave me my entry into the idea that state could be message driven. I decided, after some reading and experimenting, that a message driven approach could be used to implement a tool I needed for my fantasy baseball draft.

My preparation for the draft was relatively straight forward - I would attempt to build many ranked lists of players that I was potentially interested in claiming for my team, and during the draft I would look at these lists, filtering out the players that had already been selected.

So what I needed was a simple way to track lists of all of the players that had already been drafted, so that they could be excluded from my lists. Easy.

Failure #1: Scope creep

My real ambition for this system was that it would support all of the owners, including helping them to track what was going on in the draft while they were away. So web pages, and twitter, and atom feeds, and REST, and so on.

Getting all of this support right requires being able to accurately report on all of the players who were drafted. Which in turn means managing a database of players, and keeping it up to date when somebody chooses to draft a player that I hadn't been tracking, and dealing with the variations in spellings, and the fact that players change names and so on.

But for MVP, I didn't need this grief. I had already uniquely identified all of the players that I was specifically interested in. I just needed to keep track of those players; so long as I had all of the people I was considering in the player registry, and could track which of those had been taken (no need to worry about order, and I was tracking my own choices separately anyway).

Failure #2: Where is the book of record?

A second place where I failed was in understanding that my system wasn't the book of record for the actions of the draft. I should have noticed that we had been drafting for years without this database. And over the years we've worked out protocols for unwinding duplicated picks, and resolving ambiguity.

What I was really doing was caching outcomes from the real world process into my system. In other words, I should have been thinking of my inputs as a stream of events, not commands, and arranging for the system to detect and warn about conflicts, rather than rejecting messages that would introduce a conflict.

There was no particular urgency about matching picks with identifiers of players in the registry, or in registering players who were not part of the registry. All of that math could be delayed a hundred milliseconds without anybody noticing.

Failure #3: Temporal queries

The constraints that the system with were trying to enforce the rules that only players in the player registry could be selected, and that each player in the registry could only be selected once. In addition to the fact that wasn't the responsibility of the system, it was complicated by the fact that the player registry wasn't static.

Because I was trying to track the draft faithfully (not realizing until later that doing so wasn't strictly necessary for my use case), I would stop the program when my registry had a data error. The registry itself was just dumb bytes on disk; any query I ran against the database was a query against "now". So changing the entries in the registry would change the behavior of my program during "replay".

Failure #4: Compatibility

Somewhat related to the above - I wasn't always being careful to ensure that the domain logic was backwards compatible with the app that wrote the messages, nor did my message journal have any explicit markers in it to track when message traffic should switch to the new handlers.

So old messages would break, or do something new, screwing up the replay until I went into the "immutable" journal to fix the input errors by hand.

Failure #5: Messages

My message schemas, such as they were, were just single lines of text - really just a transcript of what I was typing at the interactive shell. And my typing sucks, so I was deliberately making choices to minimize typing. Which again made it harder to support change.

Thursday, June 14, 2018

CQRS Meetup

Yesterday's meetup of the Boston DDD/CQRS/ES group was at Localytics, and featured a 101 introduction talk by James Geall, and a live coding exercise by Chris Condon.

CQRS is there to allow you to optimize the models for writing and reading separately. NOTE: unless you have a good reason to pay the overhead, you should avoid the pattern.

James also noted that good reasons to pay the overhead are common. I would have liked to hear "temporal queries" here - what did the system look like as-at?

As an illustration, he described possibilities for tracking stock levels as a append only table of changes and a roll-up/view of a cached result. I'm not so happy with that example in this context, because it implies a coupling of CQRS to "event sourcing". If I ran the zoo, I'd probably use a more innocuous example: OLTP vs OLAP, or a document store paired with a graph database.

The absolute simplest example I've been able to come up with is an event history; the write model is optimized for adding new information to the end of the data structure as it arrives. In other words, the "event stream" is typically in message order; but if we want to show a time series history of those events, we need to _sort_ them first. We might also change the underlying data structure (from a linked list to a vector) to optimize for other search patterns than "tail".

Your highly optimized model for "things your user wants to do" is unlikely to be optimized for "things your user wants to look at".

This was taken from a section of James's presentation explaining why the DDD/CQRS/ES tuple appear together so frequently. He came back to this idea subsequently in the talk, when responding to some confusion about the read and write models

You will be doing roll ups in the write model for different reasons than those which motivate the roll ups in the read model.

A lot of people don't seem to realize that, in certain styles, the write model has its own roll ups. A lot of experts don't seem to realize that there is more than one style -- I tried to give a quick calca on an alternative style at the pub afterwards, but I'm not sure how well I was able to communicate the ideas over the background noise.

The paper based contingency system that protects the business from the software screwing up is probably a good place to look for requirements.

DDD in a nut shell, right there.

That observation brings me back to a question I haven't found a good answer to just yet: why are we rolling our own business process systems, rather than looking to the existing tooling for process management (Camunda, Activiti and the players in the iBPMS Magic Quadrant)? Are getting that much competitive advantage from rolling our own?

Event sourcing gives you a way to store the ubiquitous language - you get release from the impedance mismatch for free. A domain expert can look at a sequence of events and understand what is going on.

A different spelling of the same idea - the domain expert can look at a given set of events, and tell you that the information displayed on the roll up screen is wrong. You could have a field day digging into that observation: for example, what does that say about UUID appearing in the event data?

James raised the usual warning about not leaking the "internal" event representations into the public API. I think as a community we've been explaining this poorly - "event" as a unit of information that we use to reconstitute state gets easily confused with "event" as a unit of information broadcast by the model to the world at large.

A common theme in the questions during the session was "validation"; the audience gets tangled up in questions about write model vs read model, latency, what the actual requirements of the business are, and so on.

My thinking is that we need a good vocabulary of examples of different strategies for dealing with input conflicts. A distributed network of ATM machines; both in terms of the pattern of a cash disbursement, and also reconciling the disbursements from multiple machines when updating the accounts. A seat map on airline, where multiple travelers are competing for a single seat on the plane.

Chris fired up an open source instance of Event Store, gave a quick tour of the portal, and then started a simple live coding exercise: a REPL for debits and credits, writing changes to the stream, and then then reading it back. In the finale, there were three processes sharing data - two copies of the REPL, and the event store itself.

The implementation of the logic was based on the Reactive-Domain toolkit; which reveals its lineage, as it is an evolution of ideas acquired from working with Jonathan Oliver's Common-Domain and with Yves Reynhout, who maintains AggregateSource.

It's really no longer obvious to me what the advantage of that pattern is; it always looks to me as though the patterns and the type system are getting in the way. I asked James about this later, and he remarked that no, he doesn't feel much friction there... but he writes in a slightly different style. Alas, we didn't have time to explore further what that meant.

Monday, October 2, 2017

A not so simple trick

Pawel Pacana

Event Sourcing is like having two methods when previously there was one.

Noooooooo

In fairness, the literature is a mess. Let's see what we can do about separating out the different ideas.

Data Models

Let's consider a naive trade book as an example; it's responsible for matching sell orders and buy orders when the order prices match. So the "invariant", such as it is, is that we never are holding unmatched buy orders and sell orders at the same price.

Let's suppose we get three sell orders; two offering to sell 100 units at $200, and between them a third offer to sell 75 units at $201. At that point in the action, the data model might be represented this way.

The next order wants to buy 150 units at $200, and our matching algorithm goes to work. The resulting position might have this representation.

And after another buy order arrives, we might see a representation like

After each order, we can represent the current state of the trade book as a document.

There is an alternative representation of the trade book; rather than documenting the outcome of the changes, we document the changes themselves. Our original document might be represented this way

Then, when the buy order arrives, you could represent the state this way

But in our imaginary trade book business, matches are important enough that they should be explicit, rather than implicit; so instead, we would more likely see this representation

And then, after the second buy order arrives, we might see

The two different models both suffice to describe equivalent states of the same entity. There are different trade offs involved, but both approaches provide equivalent answers to the question "what is the state of the trade book right now".

Domain Models

We can wrap either of these representations into a domain model.

In either case, the core interface that the application interacts with is unchanged -- the application doesn't need to know anything about how the underlying state is represented. It just needs to know how to communicate changes to the model. Thus, the classes playing the role of the "aggregate root" have the same exposed surface. It might look like

The underlying implementations of the trade book entity is effectively the same. Using a document based representation, we would see an outline like

Each time an order is placed, the domain model updates the local copy of the document. We get the same shape if we use the event backed approach...

Same pattern, same shape. When you introduce the idea of persistence -- repositories, and copying the in memory representation of the data to a durable store, the analogs between the two hold. The document representation fits well with writing the state to a document store, and can of course be used to update a relational model; perhaps with assistance of an ORM. But you could just as easily copy the event "document" into the store, or use the ORM to transform the collection of events into some relational form. It's just data at that point.

There are some advantages to the event representation when copying state to the durable store. Because the events are immutable, you don't need to even evaluate whether the original entries in the list have changed. You don't have to PUT the entire history; you can simply PATCH the durable store with the updates. These are optimizations, but they don't change the core of the patterns in any way.

Projections

Event histories have an important property, thanks to their append only nature -- updates are non-destructive. You can create from the event history any document representation you like; you only need to have an understanding of how to represent a history of no events, and how each event type in turn causes the document representation to change.

What we have here is effectively a state machine; you load the start state and then replay each of the state transitions to determine the final state.

This is a natural approach to take when trying to produce "read models", optimized for a particular search pattern. Load the empty document, replay the available events into it, cache the result, obtain more events, replay those, cache this new result, and so on. If the output representation is lost or corrupted, just discard it and replay the complete history of the model.

There are three important facets of these projections to pay attention to

First, the motivation for the projections is that they serve queries much more efficiently than trying to work with the event history directly.

Second, that because replaying an entire event history can be time consuming, the ability to resume the projection from a previously cached state is a productivity win.

Third, that a bit of latency in the read use case is typically acceptable, because there is no risk that querying stale data will corrupt the domain.

The Tangle

Most non-trivial domain models require some queries when updating the model. For instance, when we are processing an order, we need to know which previously unmatched orders were posted with a matching price. If the domain requires first in first out processing, then you need the earliest unmatched order.

Since projections are much better for query use cases than the raw event stream, the actual implementation of our event backed model probably cheats by first creating a local copy of a suitable projection, and then using that to manage the queries

That "solves" the problem of trying to use the event history to support queries directly, but it leads directly into the second issue listed above; processing the entire event history on demand is relatively expensive. You'd much prefer to use a cached copy.

And while using a cached copy for a write is fine, using a stale copy for a write is not fine. The domain model must be working with representations that reflect the entire history when enforcing invariants. More particularly, if a transaction it going to be consistent, then the events calculated at the conclusion of the business logic must take into account earlier events in the same transaction. In other words, the projection needs to be continuously updated by the work in progress.

This leads to a design where we are using two coordinated data models to support writes: the event backed representation that will eventually be used to update the durable store, and the document backed representation this is used to support the queries needed to enforce the invariant. The trade book, in effect, becomes its own cache.

We could, of course, mutate the document directly, rather than projecting the new events into it. But that introduces a risk that the document representation we have now won't match the one we create later when we have only the events to work from. It also ensures that any projections we make for supporting reads will have the same state that was used while performing the writes.

To add to the confusion: once the document representation of the model has been rehydrated, the previously committed events don't contribute; they aren't going to be changed, the document supports the queries, updating the event store is only going to append the new information.... Consequently, the existing history gets discarded, and the use case only tracks the new events that have been discovered locally in this update.

Thursday, March 2, 2017

DDD Repository Interfaces

Composed in response to Vladimir Khorikov.

One issue is that the above interface doesn’t constitute an actual abstraction. It just duplicates the concrete class’s functionality. The Principle of Reused Abstractions tells us that, in order for an interface to become one, it needs to have more than one implementation.

If we reboot the Wayback Machine, and take a look at the description provided by Eric Evans, we see at once the existence of other implementations.

Another transition that exposes technical complexity that can swamp the domain design is the transition to and from storage. This transition is the responsibility of another domain design construct, the REPOSITORY

The idea is to hide all the inner workings from the client, so that client code will be the same whether the data is stored in an object database, stored in a relational database, or simply held in memory.

I would certainly expect to see an in memory implementation, used by tests that protect me from errors in refactoring -- I'm going to burn the world as soon as the test passes anyway, so neither writing nor running integration boilerplate adds value.

But Vladimir raises an interesting point

Note that neither integration tests, nor unit tests would require seams that “abstract” the database out from the rest of the code. Unit tests just don’t involve anything other than isolated domain logic. Integration tests verify the database directly as part of the bounded context.

I love that -- it really shows that he has dug deeper into the question, to really think about the principles involved and whether or not they fit.

An application database (a database fully devoted to a single bounded context) is one of such systems. It belongs to your application only and not shared with anyone else.

An application with multiple writers is sharing. Your isolated domain logic doesn't share anything, so you can't check the behavior of conflicting writes that way. Trying to introduce conflicts, in all the paths that you need, during integration testing threatens many nightmares because of the combinatoric explosion of possibilities. If you are going to be refactoring your contingency pathways, you need a seam that discounts the overhead of checking the error to the point that you will actually pay the price. That requires a seam somewhere between the command handler and the process boundary, and the price drops each as you get closer to the handler.

In addition, that seam is a natural place to introduce an in memory cache; why reload an object from the book of record when the copy that you saved is still available? Why treat that optimization as an all or nothing affair when each composition root could be making its own choice on a case by case basis?

Vladimir is absolutely right that the repository (as written here) doesn't really align properly with boundaries. That thought is worth exploring in more detail.

Wednesday, April 20, 2016

Shopping Carts and the Book of Record.

If I'm shopping for books on Amazon, I can look at my shopping list, click on a book, and have it added to my shopping cart. For some items, Amazon will decline to add the item to my cart, and inform me, perhaps, that the item is no longer available.

At the grocery store, no matter how many times I click on the shopping list, the Fruit Loops don't appear in my cart. I have to place the box in my cart by hand, next to the salad dressing that my phone says I can't put in the cart because it has been discontinued, and the milk that I can't put in my cart because it has expired.

If creating a user isn't a lot more fun than sending a command message to an aggregate, you are doing it wrong.

We often want representations of entities that we don't control, because the approximation we get by querying our representations is close enough to the answer we would get by going off to inspect the entities in question, while being much more convenient.

But if the entities aren't under our control, we have no business sending commands to the representations. Our representations don't have veto power over the book of record.

Aggregates only make sense when your domain model is the book of record.

Which means that you have no ability to enforce an invariant outside of the book of record. You can only query the available history, detect inconsistencies, and perhaps initiate a remediation process.

On Read Models

I learned something new about read models today.

Most discussions I have found so far emphasize the separation of the read model(s) from the write model.

For example, in an event sourced solution, the write model will update the event history in the book of record. Asynchronously, new events are read out of the book of record, and published. Event handlers process these new events, and produce updated projections. The read model answers queries using the most recently published projection. Because we are freed from the constraints of the write model, we can store the data we need in whatever format gives us the best performance; reads are "fast".

But by the time the read models can access the data; the data is old -- there's always going to be some latency between "then" and "now". In this example, we've had to wait for the events to be published (separately from the write), and then for the event handlers to consume them, construct the new projections, and store them.

What if we need the data to be younger?

A possible answer; have the read models pull the events from the book of record directly, then consume the events directly. It's not free to do this -- the read model has to do its own processing, which adds to its own latency. There record book is doing more work (answering more queries), which may make your writes slower, and so on.

But it's a choice you can make; selecting which parts of the system get the freshest data, and which parts of the system can trade off freshness for other benefits.

Example

In some use cases, after handling a command, you will want to refresh the client with a projection that includes the effect of the command that just completed; think Post/Redirect/Get.

In an event sourced solution, one option is to return, from the command, the version number of the aggregate that was just updated. This version number becomes part of the query used to refresh the view.

In handling the query, you compare the version number in the query with that used to generate the projection. If the projection was assembled from a history that includes that version, you're done -- just return the existing projection.

But when the query is describing a version that is still in the future (from the perspective of the projection), one option is to suspend the query until history "catches up", and a new projection is assembled. An alternative approach is to query the book of record directly, to retrieve the missing history and update a copy of the projection "by hand". More work, more contention, less latency.

If the client is sophisticated enough to be carrying a local copy of the model, it can apply its own changes to that model, provisionally; reconciling the model when the next updates arrive from the server. That supports the illusion of low latency without creating additional work for the book of record (but might involve later reconciliation).

Tuesday, February 9, 2016

Event Sourcing: on Event Handlers

One of the things I've been doing in my toy "study" problem, has been to implement an in memory event store. That means no persistence, per se, but all of the block and tackle of getting data to move from the "write model" to the "read model".

In particular, I've been taking pains to ensure that the asynchronous points in the data transfer are modeled that way -- I'm using a DirectExecutorService to run the asynchronous tasks, but I want to make sure that I'm getting them "right".

So, for this toy event store, I use the streamIds as keys to a hash; the object that comes out is a description of the stream, including a complete list of the events in that stream. Each write is implemented as a task submitted to the executor service, which uses a lock to ensure that only one thread writes to the event store at a time. The commit method replaces a volatile reference to the hash with a reference to an updated copy, producing an atomic commit. As the toy problem has very forgiving SLAs, writes are not merely appends to the stream, but actually check for conflicts, duplication, and so on.

Riddle: how to now update the read model. The transaction is the write to the volatile memory location, and if that part succeeds the client should be informed. So we really can't do any sort of synchronous notification of the read model. Instead, another task is scheduled to perform the update.

What should that task do? Pub/sub! which is right, but deceptively so. The basic idea is fine - we're going to asynchronously dispatch a message to an event queue, and all the subscribers will pick up that update and react.

What's the message though. I had been thinking that we could just enumerate the events, or possible the collection of events, but that makes a mess on the downstream side. The two basic issues being (a) the broadcast is asynchronous, so you really need the message handling to be idempotent, and (b) being asynchronous, the messages can arrive out of order.

Which means that simply publishing each of the domain events onto an asynchronous bus is going to make a mess of the event handlers, which all need a bunch of sequencing logic to repair the inevitable ordering edge cases.

Too much work.

The key clue is that the event sourced projections, process managers, and so on aren't really interested in a stream of events, so much as they are interested in a sequence of events. That sequence already exists in the write model, so the key idea is to not screw it up; we should be pushing/polling for updates to the sequence, rather than trying to track things at the level of the individual domain events.

The answer is to think in terms of publishing the cursor position for each stream.

In the write model, we push the events to the store as before. But we keep track of the positions in the stream that we have just written. After the transaction has been committed, we schedule an asynchronous task to push an event describing the new cursor position to the pub/sub system. Each event handler subscribes to that queue, and on each message compares the cursor position to its own high water mark; if there is further progress to be made, the handler fetches an ordered sub sequence of the events from the stream.

A potentially interesting byproduct of this idea: the write can return the cursor position to the caller, which can then use that position to rebuild it's next view. A reader that knows the specific position that it is waiting for can block until the read model has been updated to that point.

Because each of the event handlers is tracking its own high water mark, the cursor update messages are trivial to handle idempotently; the incorrectly ordered update messages are trivial to recognize and drop.

Friday, January 8, 2016

Coordinated Entities

Question: how do you decide if two different entities belong in the same aggregate?

I've been puzzling over this for a while now, looking for the right sorts of heuristics to apply.

The book answer is straight forward, without being useful. The aggregate boundary encompasses all of the state required to maintain the business invariant. So if you know what the business invariant, then the problem is easy. You start with an aggregate of a single entity, then you fold in all of the business rules that reference the state of the entity, then you fold in all of the entities touched by those rules, and then fold in more rules... it's turtles until you reach a steady state. Then that aggregate, at least, is complete. You set it aside, pick a new entity, and repeat the process until all the entities in the domain have been assigned to an aggregate.

In any interesting problem space, the invariant is not so clearly defined. Most of the discussions describing the evolution of a model talk about the discovery that the model is missing some element of the Ubiquitous Language, and that inspires someone to recognize why some use case has been broken, or incredibly difficult to implement. Or that the Ubiquitous Language has actually been missing some important concept, that -- one expressed -- brings new clarity to the actual requirements of the business. Most of the refactoring exercises I have seen have described cases where entities were too tightly coupled; contention between unrelated entities was making the system harder to use.

Lesson I learned today:

Thinking about the happy path doesn't inform anything. Any composition of the objects will do when the command history never violates any business rules. The interesting cases are partial failures.

Contention, as noted previously, is a primary pressure to separate entities. Commands are being applied to different entities, where there should be no interplay between the affected states. Yet if both commands are being run through the same aggregate root, then one otherwise satisfactory command will fail because it happened to be trying to commit after a different command has already advanced the history of the aggregate. This is a failure of interference between uncoordinated commands. The inverse problem are two coordinated commands are broadcast to separate entities, where one command succeeds and the other fails.

Thought experiment: suppose that we were to model these two entities in separate aggregates, so that they are participating in different transactions. What would this coordination failure look like in the event stream? Well, you would be watching the events go by, and you would see the history of the successful command, and then you would wait, and wait, and you would never see the history from the other aggregate.

Let's put a scope on it - we have a coordination contingency if some specified amount of time passes without seeing the missing history. That we are watching the event history, and thinking about the passage of time, announces at once that we are considering a process manager; which is an entity that implements a state machine. Within their own transactions, a process manager will emit events describing the changes to the state machine, asynchronously schedule calls to itself (a time trigger), and perhaps dispatch asynchronous commands to the domain model.

There's some block and tackle to be done at this point -- the processManager is an entity in its own right, and we need to be sure that the observed events are dispatched to "the right one". We're going to need some meta data in the events to ensure that they are all going to the right destination.

Back to our experiment; the history of the first command arrives. We load a process manager and pass the event to it. The process manager uses its SLA to schedule a message to itself at some time in the future. Time passes; the scheduled message is delivered. The process manager fires the timeout trigger into its state machine, arrives at the Contingency state, and writes that event into the log.

How does that help?

It gives us something to look for in the Ubiquitous Language. If the coordinated entities really do belong in separate aggregates, then this contingency is a thing that really happens in the business, and so somebody should know about it, know the requirements for mitigating the contingency, what events should appear in the log to track the mitigation progress, and so on.

On the other hand, if the domain experts begins saying "that can't happen", "that MUST NOT happen", "that is too expensive when it happens, which is why we have you writing software to prevent it", and so forth; then that is strong evidence that the two entities in question need to be modeled as part of the same aggregate.

Monday, November 9, 2015

Domain Driven Design vs REST

For a couple of weeks now, I've been banging my head against Domain Driven Design (DDD), Command Query Responsibility Segregation (CQRS), and Representational State Transfer (REST).

I had been making a big, and probably common mistake: I started looking for nouns. The ubiquitous language gives me lovely nouns, and they seemed a natural fit for resources.

But I couldn't get the same natural feeling from the verbs. In the ubiquitous language, I've got all of these lovely expressive verbs to motivate change in my business model. In my link relations, I've got GET, PUT, POST, DELETE.

I finally tracked down Jim Webber's DDD in the Large presentation. Yeah, that helped.

Application vs Domain

Taking it very slowly: the key idea underlying REST is "Hypermedia as the Engine of Application State".

Application State.

Why am I thinking about trying to represent my aggregate roots as resources? Those are two completely different layers! The application layer talks to the domain layer, there's an interface between the two, but there's no particular reason to expect a one to one correspondence.

All of the RESTful bits are going to be over there, with the anti corruption logic.

Commands as Resources

Resources were the second bit that I had flat out gotten wrong. It had occurred to me that I could just cheat, turn the problem around, and use my commands as resources.

What Jim's talk clarified for me: that's not cheating, it's the whole damn point.

The hand wavy argument is that we are looking for nouns, and the commands, as messages, are the noun that we want. No kidding, our resources are representations of little pieces of paper, a ToDo on a post it note, that the client is passing to the server.

It still feels like a cheat to me.

But Jim in his talks points out, rightly, that if you are communicating over HTTP, then you are using a document management system to communicate your applications state. So if you aren't passing documents, you're clearly Doing It Wrong.

That's getting closer, but I needed to make one more connection to sell myself.

Stepping back from the problem; ignore the REST constraint completely. The client, over there, needs to communicate with the server here. We crossing a boundary; if we watch on the wire, we're not going to see objects in the transfer, but data. We see the same thing when the client queries a projection - Data Transfer Objects are being exchanged.

Data Transfer Object is a synonym for document. Oh.

I find it a little bit easier to sneak up on the idea by looking at the interaction with the read model. The client sends us a query, and we send back some projection of the model. It doesn't make sense to think about caching a model (it changes in time), and caching the projection (which also changes in time) is similarly dubious, but caching a snapshot of the projection at some point in time -- that does make sense. "Get me the report as of Time.now()" sure sounds like a document to me.

Something similar happens with domain events, and the communication between the read model and the write model. "Stream of Events" might not sound like a document, but journal, ledger, log -- those certainly are.

Commands as documents? I mentioned the ToDo analog earlier, but another good fit would be orders.

Model Change

Of course, the whole point to this mess is to have an application that can interact with the model, so something needs to connect the two.

The read model, that's easily managed -- the queries arrive, the appropriate event history is loaded into the projection, and report is generated and delivered. All of these steps are idempotent and safe - we might change some state in memory, like caching the projection data for a time in case we are about to need it, but the model and the event history are not changed at all.

The write model is more difficult - changing the model is a side effect of the arrival of the command, and the command may arrive more than once.

For instance, the client puts a command, the command is received and executed, but the acknowledgement of the command is lost in transit. As PUT is supposed to be idempotent, the client may send the command document a second time.

The model, it should be running the commands given to it. So you need either that the commands in the model handle commands idempotently, or you need the anti-corruption layer to do the right thing when the duplicated command arrives.

An example scenario: Alice puts command.id:1 to the server. The server executes the command, updating the history of the model, and publishes a reply. That reply is lost in transit. Bob puts command.id:2 to the server, updating the model further. Alice times out waiting for the acknowledgment that her command arrived, and resends it. Charlie puts command.id:3 to the server, but is data is stale because he was working from a state prior to Alice's first command.

What should the responses look like?

Bob, clearly, successfully delivered a command that was executed, he should see a status 201 Created message. Charlie's command should be rejected, because the preconditions under which he submitted the command were not met, which probably means a 409 or 411 response. Alice's first command should like Bob, get a 201. When she resubmits the same command a second time, it is still supposed to be an idempotent operation, so she should still be getting the 201 response (and not the error code seen by Charlie).

That probably means a message store: to cache the response in the application layer for a time so that it can be replayed without interacting with the model.

My feeling is that the command should be considered immutable by the client; a second put that doesn't agree with the first should be rejected. The delete by the client might be a way to incorporate an acknowledgement, and expire the data in the message store.