Cascade Faliure: 2015

Monday, November 9, 2015

Domain Driven Design vs REST

For a couple of weeks now, I've been banging my head against Domain Driven Design (DDD), Command Query Responsibility Segregation (CQRS), and Representational State Transfer (REST).

I had been making a big, and probably common mistake: I started looking for nouns. The ubiquitous language gives me lovely nouns, and they seemed a natural fit for resources.

But I couldn't get the same natural feeling from the verbs. In the ubiquitous language, I've got all of these lovely expressive verbs to motivate change in my business model. In my link relations, I've got GET, PUT, POST, DELETE.

I finally tracked down Jim Webber's DDD in the Large presentation. Yeah, that helped.

Application vs Domain

Taking it very slowly: the key idea underlying REST is "Hypermedia as the Engine of Application State".

Application State.

Why am I thinking about trying to represent my aggregate roots as resources? Those are two completely different layers! The application layer talks to the domain layer, there's an interface between the two, but there's no particular reason to expect a one to one correspondence.

All of the RESTful bits are going to be over there, with the anti corruption logic.

Commands as Resources

Resources were the second bit that I had flat out gotten wrong. It had occurred to me that I could just cheat, turn the problem around, and use my commands as resources.

What Jim's talk clarified for me: that's not cheating, it's the whole damn point.

The hand wavy argument is that we are looking for nouns, and the commands, as messages, are the noun that we want. No kidding, our resources are representations of little pieces of paper, a ToDo on a post it note, that the client is passing to the server.

It still feels like a cheat to me.

But Jim in his talks points out, rightly, that if you are communicating over HTTP, then you are using a document management system to communicate your applications state. So if you aren't passing documents, you're clearly Doing It Wrong.

That's getting closer, but I needed to make one more connection to sell myself.

Stepping back from the problem; ignore the REST constraint completely. The client, over there, needs to communicate with the server here. We crossing a boundary; if we watch on the wire, we're not going to see objects in the transfer, but data. We see the same thing when the client queries a projection - Data Transfer Objects are being exchanged.

Data Transfer Object is a synonym for document. Oh.

I find it a little bit easier to sneak up on the idea by looking at the interaction with the read model. The client sends us a query, and we send back some projection of the model. It doesn't make sense to think about caching a model (it changes in time), and caching the projection (which also changes in time) is similarly dubious, but caching a snapshot of the projection at some point in time -- that does make sense. "Get me the report as of Time.now()" sure sounds like a document to me.

Something similar happens with domain events, and the communication between the read model and the write model. "Stream of Events" might not sound like a document, but journal, ledger, log -- those certainly are.

Commands as documents? I mentioned the ToDo analog earlier, but another good fit would be orders.

Model Change

Of course, the whole point to this mess is to have an application that can interact with the model, so something needs to connect the two.

The read model, that's easily managed -- the queries arrive, the appropriate event history is loaded into the projection, and report is generated and delivered. All of these steps are idempotent and safe - we might change some state in memory, like caching the projection data for a time in case we are about to need it, but the model and the event history are not changed at all.

The write model is more difficult - changing the model is a side effect of the arrival of the command, and the command may arrive more than once.

For instance, the client puts a command, the command is received and executed, but the acknowledgement of the command is lost in transit. As PUT is supposed to be idempotent, the client may send the command document a second time.

The model, it should be running the commands given to it. So you need either that the commands in the model handle commands idempotently, or you need the anti-corruption layer to do the right thing when the duplicated command arrives.

An example scenario: Alice puts command.id:1 to the server. The server executes the command, updating the history of the model, and publishes a reply. That reply is lost in transit. Bob puts command.id:2 to the server, updating the model further. Alice times out waiting for the acknowledgment that her command arrived, and resends it. Charlie puts command.id:3 to the server, but is data is stale because he was working from a state prior to Alice's first command.

What should the responses look like?

Bob, clearly, successfully delivered a command that was executed, he should see a status 201 Created message. Charlie's command should be rejected, because the preconditions under which he submitted the command were not met, which probably means a 409 or 411 response. Alice's first command should like Bob, get a 201. When she resubmits the same command a second time, it is still supposed to be an idempotent operation, so she should still be getting the 201 response (and not the error code seen by Charlie).

That probably means a message store: to cache the response in the application layer for a time so that it can be replayed without interacting with the model.

My feeling is that the command should be considered immutable by the client; a second put that doesn't agree with the first should be rejected. The delete by the client might be a way to incorporate an acknowledgement, and expire the data in the message store.

Friday, October 30, 2015

Domain Events and DTO Lifetimes

The client sends commands to the write model. If the write model doesn't understand the messages sent by the client, then (as far as that client is concerned), the model is effective immutable. The effective lifetime of the command itself is very brief - we need momentary agreement.

The read model shares projections with the client. If the client doesn't understand the messages it receives, then (again, from the perspective of this client), the model is write only. The effective lifetime of the projection is again short; once the appropriate view has been updated in the client, the projection can be discarded - we need momentary agreement.

The write model shares events with the read model, but the pattern doesn't hold.

The distinction is simply this: events persist.

You might need to save off commands in a queue, to ensure that they don't stomp on each other, they may need to be scheduled. But we know that it has to be ok for commands to evaporate, because fail fast is a correct expression of congestion control when the application will not be able to meet the SLA.

Similarly, you might persist projections; but that's primarily a performance optimization -- when the cache expires the projection, it will be rebuilt. The client might want to insulate the user from the dynamic nature of the model for a time, but an eventually consistent view will eventually change. That's its nature.

Events are more than just a representation of change pushed across the boundary between the write model and the read model. It also cross the boundary between the write model of today, and the write model of the future.

In particular, that means that putting domain objects directly into the representation of the event is dangerous, because we expect to be aggressively and continuously refining the domain model as we learn more and more about it. In other words, the instability of the domain model in the scale of product lifetime cautions us against mapping our persistent messages too closely to the domain.

We need to prepare for event streams that include multiple instances of the same event emitted by different versions of the model. Which suggests that, for each message in the stream, we'll need a hint in the meta data that indicates the proper recipe for restoring the domain event -- as the model in the past would have written the event knowing when it was going to be read in the future.

Avro? and tag every event in the history of the model with the writer schema of that time? Thrift/ProtocolBuffers, and hope that the evolution of the events can be supported entirely by non destructive schema changes? JSON, because you get the easy part of the answer for free? Take the hit to upgrade the immutable events in your store, so that all events are taken from the same version of the api?

My best guess today? You are going to need a schema eventually - this seems obvious to me as soon as other domains start subscribing to these events.

So the early guess is about how much value you can deliver before you take the plunge, and how expensive the first schema migration will be.

Friday, October 23, 2015

Can we enforce the transaction consistency boundaries with interfaces?

Maybe.

Let's first consider the case where we have an aggregate which includes a reference to another aggregate. That's perfectly reasonable, provided that the business is satisfied that coordinated changes between the aggregates are eventually consistent.

Now, each of the aggregates has their own commands (each changing their own state). Best practices suggest that we should only be modifying one aggregate per transaction; in other words, we should only be running command(s?) on one aggregate or the other.

Can we organize our code to enforce that?

I've been chewing on a remark from Greg Young, that getters and setters are evil. Setters, sure -- setters should instead be commands, written in the Ubiquitous Language. But getters? how on earth are you going to do anything useful with another object if you can't read it? What are you going to do with a Specification that can't read the object it is supposed to constrain?

I've chosen, for the moment, to understand his comment in this way: getters and setters have no place in the model; getters are perfectly acceptable in an immutable projection.

I'm borrowing these two ideas from Greg; ~~which I believe he lifted from an earlier generation of CQRS experts~~ [wrong - Greg is the earlier generation of CQRS experts]. Commands are sent to the model, which is optimized for validating and calculate all changes. Queries are sent to a projection -- there can be several -- which is optimized for reads, but may be stale.

So if we send a command to a model, and the execution of that command required state from some other aggregate, then we need to hydrate the appropriate projection of the remote aggregate.

I had been blocked on this until recently, because I couldn't see past needing a getter to obtain the reference to the remote aggregate to do the hydration.

But the answer to that puzzle is to pass a DomainService as one of the arguments in the command. The root can look up the referenceId without needing to expose it, and pass that value to the service to get back an immutable projection with precisely the data that it needs.

Essentially, we are building into the signature of the command the contract that promises we won't change anybody else.

Two use cases where I need more thought. The first is factory commands; calls into this aggregate to create a new instance of that aggregate. The second is a query on this aggregate to run a command on that one.

Another perspective on the problem: if the other aggregate is responsible for a business invariant, then it may throw a checked exception. I don't see how I can claim to be implementing a query that changes the model (in another aggregate), or an immutable object that throws exceptions.

My guess right now is that You Don't Do That. Instead, some hand waving happens in the Application Service fronting this mess that gets all the dancers on the correct step.

Are aggregate roots always entities?

Yes.

The critical characteristic of an aggregate root is that it acts as a transaction consistency boundary. In other words, it is responsible for changes that must always and immediately satisfy some business invariant.

That immediately rules out the possibility that it is a ValueObject, because ValueObjects are immutable. Any changes to a value reflected in the model are going to introduce a new instance of the value object.

Similarly, DomainEvents are also ruled out -- events are things that happened in the past, and we don't have time machines.

As an aside, I think the DDD Sample gets this wrong; the HandlingEvent aggregate is modeled as a DomainEvent. The description in the class header is

HandlingEvent's are sent from different Incident Logging Applications

Written in the active voice

Incident Logging Applications send HandlingEvents

and now I'm suspicious. Does the business really not track the source of these external events? Notice, also, that we never load a DomainEvent from the repository -- we only collect a history of events that match a TrackingId, which is a value object typically created within cargo.

DomainService is stateless, so there's no need for a transaction.

Sagas? I don't believe that's a fit; long running business processes that span multiple transactions and potentially more than one aggregate.

ApplicationService doesn't fit because the business invariant belongs in the model.

Thursday, January 29, 2015

A surprise in graphite monitoring

Here at the office, we have been using graphite to collect runtime metrics broadcast by our JVM processes.

Some time back, we discovered that our single graphite collector (carbon-cache) could no longer keep up with the load. We replaced it with a carbon-relay, which uses consistent-hashing to distribute the work to multiple carbon-cache instances.

To keep an eye on things, I created a dashboard to monitor the activity of the carbon processes. It was comforting to see the distributed workload, as we went from a cpu bottleneck to an i/o bottleneck.

We later moved the whisper databases to an SSD, removing the i/o bottleneck.

One thing I had noticed - the load on the caches wasn't balanced; one carbon-cache instance was being asked to do more than its share of the work. I ported the hash ring logic into iPython Notebook to run some experiments, and discovered that my initial scheme for instance naming was pathologically bad. So the cache instances were renamed, and the balance of the load became a lot more even.

I was disappointed that my graph seemed to be a bit hit or miss on the most recent datapoint for the metrics. graphite-web will initially try to fetch metrics from disk, but it will then query the caches to look for points available that haven't been flushed to disk yet. Clearly, something was wrong with the lookup there.

A review of local_settings.py revealed that I did not update the instance names for the CARBONLINK_HOSTS; which means that when attempting to determine which cache should hold a metrics, the webapp was using the wrong hash ring. This would sometimes work (if the metric name happened to map to the same instance in both schemes, just by luck). So I scrambled off to update this setting, and restarted the server.

No improvement. What's going on.

My theory is this: graphite-web is using the hash ring to determine which cache should hold the metric. But the carbon-metrics are always local to the cache that produces them. So trying to find them via a hash-ring is going to fail more often than not. Which means that the webapp cannot read those metrics until they have been flushed to disk.