Cascade Faliure: Coordinated Entities

Question: how do you decide if two different entities belong in the same aggregate?

I've been puzzling over this for a while now, looking for the right sorts of heuristics to apply.

The book answer is straight forward, without being useful. The aggregate boundary encompasses all of the state required to maintain the business invariant. So if you know what the business invariant, then the problem is easy. You start with an aggregate of a single entity, then you fold in all of the business rules that reference the state of the entity, then you fold in all of the entities touched by those rules, and then fold in more rules... it's turtles until you reach a steady state. Then that aggregate, at least, is complete. You set it aside, pick a new entity, and repeat the process until all the entities in the domain have been assigned to an aggregate.

In any interesting problem space, the invariant is not so clearly defined. Most of the discussions describing the evolution of a model talk about the discovery that the model is missing some element of the Ubiquitous Language, and that inspires someone to recognize why some use case has been broken, or incredibly difficult to implement. Or that the Ubiquitous Language has actually been missing some important concept, that -- one expressed -- brings new clarity to the actual requirements of the business. Most of the refactoring exercises I have seen have described cases where entities were too tightly coupled; contention between unrelated entities was making the system harder to use.

Lesson I learned today:

Thinking about the happy path doesn't inform anything. Any composition of the objects will do when the command history never violates any business rules. The interesting cases are partial failures.

Contention, as noted previously, is a primary pressure to separate entities. Commands are being applied to different entities, where there should be no interplay between the affected states. Yet if both commands are being run through the same aggregate root, then one otherwise satisfactory command will fail because it happened to be trying to commit after a different command has already advanced the history of the aggregate. This is a failure of interference between uncoordinated commands. The inverse problem are two coordinated commands are broadcast to separate entities, where one command succeeds and the other fails.

Thought experiment: suppose that we were to model these two entities in separate aggregates, so that they are participating in different transactions. What would this coordination failure look like in the event stream? Well, you would be watching the events go by, and you would see the history of the successful command, and then you would wait, and wait, and you would never see the history from the other aggregate.

Let's put a scope on it - we have a coordination contingency if some specified amount of time passes without seeing the missing history. That we are watching the event history, and thinking about the passage of time, announces at once that we are considering a process manager; which is an entity that implements a state machine. Within their own transactions, a process manager will emit events describing the changes to the state machine, asynchronously schedule calls to itself (a time trigger), and perhaps dispatch asynchronous commands to the domain model.

There's some block and tackle to be done at this point -- the processManager is an entity in its own right, and we need to be sure that the observed events are dispatched to "the right one". We're going to need some meta data in the events to ensure that they are all going to the right destination.

Back to our experiment; the history of the first command arrives. We load a process manager and pass the event to it. The process manager uses its SLA to schedule a message to itself at some time in the future. Time passes; the scheduled message is delivered. The process manager fires the timeout trigger into its state machine, arrives at the Contingency state, and writes that event into the log.

How does that help?

It gives us something to look for in the Ubiquitous Language. If the coordinated entities really do belong in separate aggregates, then this contingency is a thing that really happens in the business, and so somebody should know about it, know the requirements for mitigating the contingency, what events should appear in the log to track the mitigation progress, and so on.

On the other hand, if the domain experts begins saying "that can't happen", "that MUST NOT happen", "that is too expensive when it happens, which is why we have you writing software to prevent it", and so forth; then that is strong evidence that the two entities in question need to be modeled as part of the same aggregate.

Cascade Faliure

Friday, January 8, 2016

Coordinated Entities

No comments:

Post a Comment