Cascade Faliure: 2016

Monday, December 26, 2016

Observations on Write Models

I was recently involved in a protracted discussion of write models, specifically within the context of CQRS and event sourcing. Some of my observations I was learning on the fly; I want to take some time to express them clearly.

For a write (which is to say, a change of state) to use useful, it must be visible -- without a way to retrieve the state written, the write operation itself might as well be implemented as a no-op.

Three useful ways of implementing a write

1) Write into a shared space. Examples of this include I/O, writing to a database, writing to shared memory. The write model makes publicly readable writes.

2) Write into a private space, and provide an interface to repeat those writes to some service provider. Changes to state are locally private, the publishing responsibility lives with the service provider.

3) Write into a private space, and provide an interface to query those writes. Changes to state are locally private, but we also provide an interface that supports reads.

The middle of these options is just a punt -- "we can solve any problem by adding an extra layer of indirection.

The first approach couples the aggregate to the process boundary -- any write to the aggregate is necessarily tied to the write at the boundary. This is especially visible in Java, where you are likely to have checked exceptions thrown at the boundary, and bubble up through the interface of the aggregate.

The third option leaves the aggregate decoupled from the boundary, the aggregate tracks changes to the local state, but some other service is responsible for making those changes durable.

In DDD, this other service is normally the "repository"; a simplified template of a command handler will typically look something like

Of course, within Repository.save() there must be logic to query the aggregate for state, or to pass to the aggregate some state writer callback. For instance, almost all of the event sourcing frameworks I've seen have all aggregates inherit some common base class that tracks changes and supports a query to fetch those changes to write them to an event store.

I have two concerns with this design -- the first problem is that the Repository, which is supposed to be an abstraction of the persistence layer, seems to know rather a lot about the implementation details of the model. The second problem is that the repository is supposed to

provide the illusion of an in memory collection of all objects of that type... provide methods to add and remove objects, which will encapsulate the actual insertion of removal of data in the data store.

That illusion, I find, is not a particularly satisfactory one -- sending data across a process boundary is not like storing an object in memory. Furthermore, in this definition, we're palming a card; we've switched the vocabulary from objects to data (state). The conversion of one to the other is implied.

What if we wanted to make this conversion explicit -- what might that look like? At SCNA 2012, Gary Bernhardt spoke of boundaries, and the roles that value types play. Mark Seemann touched on a similar theme in 2011 -- noting that applications are not object oriented at the process boundary. So what happens if we adapt our example with that principle in mind?

State here is a value type - it is part of the API satisfied by the model; in particular, we want that API to be stable between old versions of the model and new versions of the model. The aggregate -- the object which carries with it the business rules that constrain how the state can change, is really an implementation detail within the model itself. Which is to say that in simple cases we may be able to write

That's a bit "cute" for my taste, but it does make it really obvious how the automated tests for the model are going to go.

Written out this style, I tend to think of the model in a slightly different way; the model performs simulations of business use cases on demand, and the application chooses which outcomes to report to the persistence component.

A curiosity that falls out of this design is that one model could support both event persistence and snapshot persistence, leaving the choice up to the application and persistence components. The model's contribution might look like

Another effect of this design that I like is the tracking of the original state is no longer hidden. If we're going to support "compare and swap persistence", or merge, then we really need to maintain some sense of the starting point, and I'm much happier having that exposed as an immutable value.

Additionally, because the current state isn't lost within the repository abstraction, we can think about resolving merge conflicts in the model, rather that in the application or the persistence component. In other words, resolving a merge becomes another model simulation, and we can test it without the need to create conflicts via test doubles.

What does all this mean for aggregates? Well, the persistence component is responsible for writing changes to the book of record, so the aggregate doesn't need to be writing into a public space. You could have the model pass a writer the the aggregate, where the writer supports a query interface; I don't yet see how that's a win over having the aggregate implementation support queries directly. So I'm still leaning towards including an interface that returns a representation of the aggregate.

Which means that we could still use the repository illusion, and allow the repository implementation to query that same interface for that representation. At the whiteboard, I don't like the coupling between the model and persistence; but if it's simpler, and if we understand how to back out the simplification when it no longer meets our needs, then I don't think there's a problem. It's important not to get locked into a pattern that prevents future innovation.

Sunday, November 6, 2016

A short lesson in unit testing.

I've been reviewing Uncle Bob's essay on Dijkstra's Algorithm, and Greg Young's 2011 Probability Kata. From this, I've taken away a few sharp lessons.

Test Specifications

It absolutely doesn't matter how you write your test specifications. Ideally, you'd like to maintain flexibility, so the notion of having then in a language agnostic representation where anybody can work on them has merit.
If you are seeing references to your current domain model objects in your test specifications, you are way too tightly coupled.

API

Your test specifications don't talk to your implementations
Your test specifications don't talk to your implemented apis
Your test specifications talk to your TEST api

More carefully, your test api interprets the test specifications, and adapts them to the system under test.

My current interpretation is that the test api should consist of three stages -- Given/When/Then is a pretty good approximation. We need to load the system under test in the appropriate initial state (Given), we need to exchange a message with it (When), and we need to run checks on the message that we receive in the exchange (Then).

Note that last point - we don't want to be poking around in the internals of anything we are testing; we want any probes to be looking a stable interpretation of the api we are testing.

Evolution over time

Worth repeating for emphasis - we should not need to change our checks when the implementation that we are measuring changes. We're crossing a process boundary, so the messages connecting the test implementation to the subject should be DTOs in both directions. For any given implementation, the DTOs should be stable: they are effectively part of the API of the implementation.

But the test design should allow us to swap out one implementation for another with an entirely different API. That will mean finding the new way to load instances of the implementation, and a new adapter of the messages from the specifications language to the implementations API. The adapter also brings the api language back to the language of the specification, and the checks do not change.

Think HTTP: the web server offers a stable interface against which a specification can be written. The implementation of the resources on the web server adapt the incoming messages, creating new representations that can be exchanged with the the domain's API. The domain responds to the web server, the web server transforms the DTO from the domain model to that which conforms to the specification, and the result is sent back to the client.

So when we are looking at the execution of our specifications, we should be able to find a seam; a point in the execution where the specified inputs (Given, When) are passed to the Test API, and a response is returned that can be checked against Then.

Here in Uncle Bob's code, makePathFinder serves as the seam, although it is an imperfect one. (Editorial note: which is fine. Uncle Bob's kata was about test specifications leading to Dijkstra's algorithm. My kata is about test evolution. Horses for courses.).

The main flaw here is the appearance of the PathFinder type, which is taken from the implementation namespace, rather than the specification namespace. We can't see that from the signature alone, but it becomes clear if we drill deeper into the test implementation

When we see that this is a type built from inputs, rather than from outputs, which is a really big hint that something has gone south. Only messages should be crossing the boundary from the solution to the test, but PathFinder is a type with behavior. The abstraction barrier is leaking. We shouldn't need to change our specification execution when the name of the target changes; that's a violation of encapsulation -- we should be able to replace the connection to the real implementation with a lookup table without needing to make any changes in the execution of the specifications.

Sure, you could sexy it up with an immutable message type, or even a EnumMap to ensure that you don't get confused. The point is to keep in mind that you are checking the specification against plain old data - the response cross the barrier from the system under test to the test, and the test api aligns that response with the representations expected by your speficication.

Friday, September 30, 2016

Explicit provenance

I'm working on a rewrite of some existing automation - never where you want to be. My justification is that the solution we have in place is over fit -- it deliciously low maintenance on the happy path, but absurd to work with in all other circumstances. Extending it to meet my current requirements promises to absolutely suck, so I'm going the other way around, leave the existing system untouched, prove out the new approach on the new problem, migrate the old approach as circumstances demand.

One of the problems I'm trying to address; in the current implementation, it's very difficult to see the motivation behind what's going on. In my new work, I'm making a deliberate effort to write things out long hand.

Making things more interesting is the fact that I've chosen to implement the "human operator override" use cases first. Motivation: if a human must be involved, that makes the exercise appreciably more expensive. So smoothing that path out -- in particular, making it as easy as possible to remember how the contingency designs work -- is a priority even though the use is rare.

In a manner of speaking, in the first use case the human operator is acting as a surrogate for the automation to follow. As this exercise is intended to provide a surface for the user, I start with a single input field and a button with VERB written on it in large friendly letters.

I then proceed to hard code into the handler the remainder of the state required to achieve the desired result. Test. Green. Moving on...; there are two problems with the hard coded state that need to be addressed. First, it needs to be exposed to give the human operator additional control when required. I had, in fact, already run into problems with this when I tried to use as my input data that was not representative of the happy path. One simple input field and I had already managed to over fit my solution.

The second is to provide the correct provenance of the data. This begins with meaningful names, but also includes faithfully representing where in the business the data is coming from. The argument from the input control is a String, but that String is really just a representation of a URL, which provides an address for the projection I'm working with. But the spelling of the URL is not arbitrary: it in fact expresses within it details about the source that was used to create the projection, or more specifically about the instance of the process that created the source....

With the provenance written out long hand, it becomes much easier to see where the future seams might lie.

It also exposes that the existing process isn't particularly well thought out, that names are not preserved when crossing from one context to another, and quite honestly that the don't align particularly well with the language of the business as we speak it today.

Thursday, September 22, 2016

Set Validation

Vladimir Khorikov wrote recently about enforcing uniqueness constraints, which is the canonical example of set validation. His essay got me to thinking about validation more generally.

Uniqueness is relatively straight forward in a relational database; you include in your schema a constraint that prevents the introduction of a duplicate entry, and the constraint acts as a guard to protect the invariant in the book of record itself -- which is, after all, where it matters.

But how does it work? The constraint is effective because it blocks the write to the book of record. In the abstract, the constraint gets tested within the database while the write lock is held; the writes themselves have been serialized and each write in turn needs to be consistent with its predecessors.

If you try to check the constraint before obtaining the write lock, then you have a race; the book of record can be changed by another transaction that is in flight.

Single writer sidesteps this issue by effectively making the write lock private.

With multiple writers, each can check the constraint locally, but you can't prove that the two changes in flight don't conflict with each other. The good thing is that you don't need to - it's enough to know that the book of record hasn't changed since your checked it. Logically, each write becomes a compare and swap on the tail pointer of the model history.

Of course, the book of record has to trust that the model actually performed the check before attempting the write.

And implementing the check this way isn't particularly satisfactory. There's not generally a lot of competition for email addresses; unless your problem space is actually assignment of mail boxes, the constraint has generally been taken care of elsewhere. Introducing write contention (by locking the entire model) to ensure that no duplicate email addresses exist in the book or record isn't particularly satisfactory.

This is already demonstrated by the fact that this problem usually arises after the model has been chopped into aggregates; an aggregate, after all, is an arbitrary boundary drawn within the model in an attempt to avoid unnecessary conflict checks.

But to ensure that the aggregates you are checking haven't changed while waiting for your write to happen? That requires locking those aggregates for the duration.

To enforce a check across all email addresses, you also have to lock against the creation of new aggregates that might include an address you haven't checked. Effectively, you have to lock membership in the set.

If you are going to lock the entire set, you might as well take all of those entities and make them part of a single large aggregate.

Greg Young correctly pointed out long ago that there's not a lot of business value at the bottom of this rabbit hole. If the business will admit that mitigation, rather than prevention, is a cost effective solution, the relaxed constraint will be a lot easier to manage.

Wednesday, August 10, 2016

RESTBucks Client

Where are the agents?

Personally, I never understood how overgrowth can be a good thing: 12,000 APIs means 12,000 different ways of engaging in machine-to-machine interaction. -- Ruben Verborgh

Good to know that other people have tread this ground before I start ranting.

I was introduced to Robert Zubek's needs based ai paper recently, which returned my thinking to a problem I've struggled with in understanding hypermedia.

To wit, what does it mean to have a client that is decoupled from the server?

We know from the experience of the web that hypermedia controls are a big part of it -- allowing the server to communicate the available affordances to the client means that the server can change the how as it likes, and the client doesn't need to survey the possibility space, but can instead limit itself to the immediately relevant set.

So we take a look at a hypermedia driven example -- for instance, Jim Webber's RESTBucks. The domain application protocol looks something like this:

Riddle: what happens to a client coded to search for these controls if the shop decides to comp your order? What happens when we shop at Model T Coffee?

You can have your coffee any way you like, as long as it's black.

In either case, if the client is focused on following the protocol, it's screwed. The trap is the same client/server coupling we fled to hypermedia to escape. Instead, we should be thinking in terms of multiple protocols -- if RESTBucks happens to be offering a take-free-coffee endpoint today, clients that understand how to negotiate it should be able to understand it.

What I'm beginning to see is that there is a choice; instead of teaching a client that a specific extended protocol will produce a goal, we may be able decompose the protocol into a number of different pieces that can be composed to reach a goal -- if we can manage to describe the affordances in a way that the client can understand, then we can start building complicated protocols out of simpler ones and glue.

Maybe -- we still need a client that can snap together the legos in the right order.

Friday, July 22, 2016

Lessons of Pokémon GO

Chris Furniss

But what Pokémon GO is benefitting from right now is a emergent mentorship that almost completely replaces a traditional tutorial. I would argue that this emergent mentorship is even more powerful than an actual tutorial. Mentorship increases engagement through social bonds. You've probably experienced this already with Pokémon GO: teaching a confused player something about the game that you've figured out already not only makes you feel smart and altruistic, it assuages confusion for the person you've mentored in a way that is highly personalized, and therefore more impactful.

Interesting to make that behavior part of the interface: "I'm not going to give you a hint, but I will connect you to a couple of nodes in your own social graph that have figured it out."

Friday, June 24, 2016

Data Science in a Nutshell

Generating Type I errors by using second hand data to model third order effects.

The problem with language

The Problem with Language

We only know how to share language with entities that are capable of pattern recognition.

Friday, June 3, 2016

Session Data

A recent question on stack exchange asked about storing session data as a resource, rather than as part of a cookie. In the past, I had wondered if the transience of the session data is the problem, but that's really only distraction from the real issue

Fielding, in his thesis, wrote

Cookie interaction fails to match REST's model of application state, often resulting in confusion for the typical browser application.

What kind of confusion is he talking about here?

The first architectural constraint of REST is Client Server (3.4.1). The client here sends messages to the server there, and gets a message in return.

Riddle - how does the server interpret the messages that it receives? In a stateless client server architecture, interpreting the client message is trivial, because the message itself is derived from the state of the client application. Which is to say that the client and server are in agreement, because the client created the message from its local local application state, and the message is immutable in transit, and the server has no application context to add to the mix.

When you drop the stateless architectural constraint, and introduce session data, the client and server no longer necessarily understand the message the same way: the client creates the message from its local application state, the message transits unchanged, but the server applies the application context stored in the session when interpreting the message.

In the happy path, there's no real problem: the session data applied by the server is the same data that would otherwise have been included by the client in the message, and the resulting interpretation of the message is equivalent using either architectural style.

Outside the happy path, sessions introduce the possibility that the actual state of the client and that state assumed by the server drift. The client sends a message, expecting it to achieve desired end A, but because the server's copy of the session data is not in sync, the message is understood to prefer outcome A-prime. Furthermore, with the session data "hidden" on the server, you may end up with quite a few mismatched messages going back and forth before the client begins to realize that a disaster is in the making.

The fundamental breakdown is here: the server does not know what the state of the application on the client is. It can know what messages have been received, it can know what messages have been sent in response, but it doesn't know (without specific correlation identifiers being built into the messages) that the dispatched messages have arrived.

Furthermore

The application state is controlled and stored by the user agent and can be composed of representations from multiple servers. In addition to freeing the server from the scalability problems of storing state, this allows the user to directly manipulate the state (e.g., a Web browser's history), anticipate changes to that state (e.g., link maps and prefetching of representations), and jump from one application to another (e.g., bookmarks and URI-entry dialogs).

Which is to say, the client is explicitly allowed to rewind to any previously available state, cached or reconstructed from its own history, without consulting the server.

The server defines allowed application states, and the hypermedia controls that lead to new application states, but it doesn't control which state in the sea of revealed application states is the current one.

The client, not the server, is the book of record for application state.

Why is it OK to PUT an order for a cup of coffee ? Because in that case, all of the ambiguity is in the state of the resource, not the state of the message. The client and the server both share the same precise understanding of what the message means, because all of the required context appears within the message itself. The client rewinds to some previous state, and fires off an obsolete message, and the server is able to respond "I know precisely what you mean, but you aren't allowed to do that any more". There's no confusion here; the state of the resource has simply evolved since the hypermedia control was described.

So long as everybody agrees what the message means, there is no confusion. That's why the happy path looks OK -- if the client and the server are still sharing the same assumptions about message context, the confusion doesn't arise; the client and the server happen to mean the same thing by fortuitous accident, and it all "just works". Until it doesn't.

Friday, May 27, 2016

The name of the URI is callled....

I just found myself proposing the following "RESTful" URI:

/userStories?asA=emailScammer&iWantTo=mineEmailAddresses&soThat=iCanBroadcastMoreSpam

I'm not sure I was kidding.

Followup: I used query parameters; HTML is a successful implementation of a media-type which supports hypermedia controls. It's an ideal reference implementation for illustrating RESTful principles.

But it is not without its limitations. URI Templates offer a lot of flexibility; web forms -- not so much. You expose query parameters, or you force the client to traverse a graph to find the control that they want.

Digging around for URI design guidelines, I found this summary by K. Alan Bates

Hierarchical data is supposed to be represented on the path and with path parameters. Non-hierarchical data is supposed to be represented in the query. The fragment is more complicated, because its semantics depend specifically upon the media type of the representation being requested.

Moving the parameters of the story from the query string would be preferred, because it correctly represents that this the known identification of a resource, rather than a specification for a resource that may not have any matches. Furthermore, doing so better conforms to the convention that path segments represent hierarchy. For instance, it's reasonable to suppose that the URI for the story card ought to look like:

/userStories/asA=emailScammer&iWantTo=mineEmailAddresses&soThat=iCanBroadcastMoreSpam/card

The design guidelines in the RESTful Web Services Cookbook suggest an improvement on the previous design...

Use the comma (,) and semicolon (;) to indicate nonhierarchical elements in the path portion of the URI.

Richardson & Ruby, in Chapter 5 of RESTful Web Services were a bit more specific

I recommend using commas when the order of the scoping information is important, and semicolons when the order doesn't matter.

I'm not actually sure if the template of the user story should be considered ordered or not. There are a lot of references to the Cohn template, a few that point out that maybe the business value should get pride of place.

Me? I'm going to represent that the ordering of these elements matters, because that allows me to use the delimiter that looks like the punctuation on the story card itself

/userStories/asA=emailScammer,iWantTo=mineEmailAddresses,soThat=iCanBroadcastMoreSpam/card

Better, but I don't like equals, and none of the other sub-delims improve the outcome. Today, I learned that RFC 3986 offers me an out -- the production rules for path segments explicitly include colon!

/userStories/asA:emailScammer,iWantTo:mineEmailAddresses,soThat:iCanBroadcastMoreSpam/card

In all, an educational exercise. Didn't learn whether or not I was kidding.

Wednesday, April 20, 2016

Shopping Carts and the Book of Record.

If I'm shopping for books on Amazon, I can look at my shopping list, click on a book, and have it added to my shopping cart. For some items, Amazon will decline to add the item to my cart, and inform me, perhaps, that the item is no longer available.

At the grocery store, no matter how many times I click on the shopping list, the Fruit Loops don't appear in my cart. I have to place the box in my cart by hand, next to the salad dressing that my phone says I can't put in the cart because it has been discontinued, and the milk that I can't put in my cart because it has expired.

If creating a user isn't a lot more fun than sending a command message to an aggregate, you are doing it wrong.

We often want representations of entities that we don't control, because the approximation we get by querying our representations is close enough to the answer we would get by going off to inspect the entities in question, while being much more convenient.

But if the entities aren't under our control, we have no business sending commands to the representations. Our representations don't have veto power over the book of record.

Aggregates only make sense when your domain model is the book of record.

Which means that you have no ability to enforce an invariant outside of the book of record. You can only query the available history, detect inconsistencies, and perhaps initiate a remediation process.

On Read Models

I learned something new about read models today.

Most discussions I have found so far emphasize the separation of the read model(s) from the write model.

For example, in an event sourced solution, the write model will update the event history in the book of record. Asynchronously, new events are read out of the book of record, and published. Event handlers process these new events, and produce updated projections. The read model answers queries using the most recently published projection. Because we are freed from the constraints of the write model, we can store the data we need in whatever format gives us the best performance; reads are "fast".

But by the time the read models can access the data; the data is old -- there's always going to be some latency between "then" and "now". In this example, we've had to wait for the events to be published (separately from the write), and then for the event handlers to consume them, construct the new projections, and store them.

What if we need the data to be younger?

A possible answer; have the read models pull the events from the book of record directly, then consume the events directly. It's not free to do this -- the read model has to do its own processing, which adds to its own latency. There record book is doing more work (answering more queries), which may make your writes slower, and so on.

But it's a choice you can make; selecting which parts of the system get the freshest data, and which parts of the system can trade off freshness for other benefits.

Example

In some use cases, after handling a command, you will want to refresh the client with a projection that includes the effect of the command that just completed; think Post/Redirect/Get.

In an event sourced solution, one option is to return, from the command, the version number of the aggregate that was just updated. This version number becomes part of the query used to refresh the view.

In handling the query, you compare the version number in the query with that used to generate the projection. If the projection was assembled from a history that includes that version, you're done -- just return the existing projection.

But when the query is describing a version that is still in the future (from the perspective of the projection), one option is to suspend the query until history "catches up", and a new projection is assembled. An alternative approach is to query the book of record directly, to retrieve the missing history and update a copy of the projection "by hand". More work, more contention, less latency.

If the client is sophisticated enough to be carrying a local copy of the model, it can apply its own changes to that model, provisionally; reconciling the model when the next updates arrive from the server. That supports the illusion of low latency without creating additional work for the book of record (but might involve later reconciliation).

Tuesday, February 9, 2016

Event Sourcing: on Event Handlers

One of the things I've been doing in my toy "study" problem, has been to implement an in memory event store. That means no persistence, per se, but all of the block and tackle of getting data to move from the "write model" to the "read model".

In particular, I've been taking pains to ensure that the asynchronous points in the data transfer are modeled that way -- I'm using a DirectExecutorService to run the asynchronous tasks, but I want to make sure that I'm getting them "right".

So, for this toy event store, I use the streamIds as keys to a hash; the object that comes out is a description of the stream, including a complete list of the events in that stream. Each write is implemented as a task submitted to the executor service, which uses a lock to ensure that only one thread writes to the event store at a time. The commit method replaces a volatile reference to the hash with a reference to an updated copy, producing an atomic commit. As the toy problem has very forgiving SLAs, writes are not merely appends to the stream, but actually check for conflicts, duplication, and so on.

Riddle: how to now update the read model. The transaction is the write to the volatile memory location, and if that part succeeds the client should be informed. So we really can't do any sort of synchronous notification of the read model. Instead, another task is scheduled to perform the update.

What should that task do? Pub/sub! which is right, but deceptively so. The basic idea is fine - we're going to asynchronously dispatch a message to an event queue, and all the subscribers will pick up that update and react.

What's the message though. I had been thinking that we could just enumerate the events, or possible the collection of events, but that makes a mess on the downstream side. The two basic issues being (a) the broadcast is asynchronous, so you really need the message handling to be idempotent, and (b) being asynchronous, the messages can arrive out of order.

Which means that simply publishing each of the domain events onto an asynchronous bus is going to make a mess of the event handlers, which all need a bunch of sequencing logic to repair the inevitable ordering edge cases.

Too much work.

The key clue is that the event sourced projections, process managers, and so on aren't really interested in a stream of events, so much as they are interested in a sequence of events. That sequence already exists in the write model, so the key idea is to not screw it up; we should be pushing/polling for updates to the sequence, rather than trying to track things at the level of the individual domain events.

The answer is to think in terms of publishing the cursor position for each stream.

In the write model, we push the events to the store as before. But we keep track of the positions in the stream that we have just written. After the transaction has been committed, we schedule an asynchronous task to push an event describing the new cursor position to the pub/sub system. Each event handler subscribes to that queue, and on each message compares the cursor position to its own high water mark; if there is further progress to be made, the handler fetches an ordered sub sequence of the events from the stream.

A potentially interesting byproduct of this idea: the write can return the cursor position to the caller, which can then use that position to rebuild it's next view. A reader that knows the specific position that it is waiting for can block until the read model has been updated to that point.

Because each of the event handlers is tracking its own high water mark, the cursor update messages are trivial to handle idempotently; the incorrectly ordered update messages are trivial to recognize and drop.

Monday, February 8, 2016

More thoughts on REST

Continuing to explore REST...

Lesson: REST is not about building nice little web applications like blogs, or Amazon, or Google. REST is about building web scale applications like...

the Web.

Lesson: the Web is already web scale. If you are building your application on the back of HTTP, all you need to do is not screw it up...

for example, by replacing text/html with some other media type that doesn't include controls. Oops.

Lesson: a friend of mine teaches his students that, when they are asked to implement a protocol, their first step should be to obtain the appropriate state machine compiler. In reverse, if an expert writes a thesis on Representational STATE Transfer, you should be thinking about how that maps to your application protocols...

and maybe not so much mapping it to your data model. Or your persistence layer.

Lesson: HTML is a perfectly cromulent media-type. GET and POST will serve as hypermedia controls. application/www-form-urlencoded is a little bit clumsy for hierarchical data, but can be made to serve.

So it should be straight forward to use a browser to navigate your application protocol?

Lesson: the modern web works. You point your browser at a url, and the browser downloads a bunch of java script that renders a single page application, and starts pinging some json api and using the responses to update the DOM.... That's REST. Event if those json endpoints aren't actually providing hypermedia controls - neither do images in html documents.

Friday, January 22, 2016

REST: on Resources

I've been trying to make more progress with REST by reviewing the questions that other programmers are asking about it

http://programmers.stackexchange.com/questions/tagged/rest

Sidestepping for the moment the whole riddle of hypermedia representations, the most common issue seems to come about when the programmer assumes that an entity in the domain must have a single URI that does everything with nothing but Four Verbs and the Truth.

The internal mapping seems to be that entities are nouns, and resources are nouns, and therefore entities are resources -- since the URI is "the" identifier for the resource, it must also be "the" identifier for the entity, and suddenly everything looks like a nail.

Clearly, identifiers should be 1-to-1 with resources -- you really need to pervert the definition of "uniform resource identifier" to reach any other conclusion. Although, it turns out that is something that we had to learn over time.

For instance, in 1999, the definition of the PUT method described resources this way

A single resource MAY be identified by many different URIs. For example, an article might have a URI for identifying "the current version" which is separate from the URI identifying each particular version. In this case, a PUT request on a general URI might result in several other URIs being defined by the origin server.

In 2014, the definition of the PUT method changed.

A PUT request applied to the target resource can have side effects on other resources. For example, an article might have a URI for identifying "the current version" (a resource) that is separate from the URIs identifying each particular version (different resources that at one point shared the same state as the current version resource). A successful PUT request on "the current version" URI might therefore create a new version resource in addition to changing the state of the target resource, and might also cause links to be added between the related resources.

The latter interpretation allows you to sidestep one of the complicating issues with interpreting a PUT method -- the body is supposed to be a replacement for the resource, but nothing in "the rules" prevents you from inventing a resource with the specific purpose of being replaced.

Jim Webber hinted at the same thing:

You should expect to have many many more resources in your integration domain than you do business objects in your business domain.

I think the CQRS language helps here -- from our event history, we build lots of different projections that are each especially suitable to a particular use case.

Another idea that helped clear things for me is that a view of an immutable resource can be cached. Trying to cache a projection of a mutable entity brings you face to face with one of the two hard problems, but caching the immutable history of a mutable entity is fine. In other words, we have different resources that describe the state of an entity at different times. Couple that with the idea that there are many ways of expressing a moment of time, that each in turn maps to a different resource, and you get an explosion of possible resources that you can exploit.

Friday, January 8, 2016

Coordinated Entities

Question: how do you decide if two different entities belong in the same aggregate?

I've been puzzling over this for a while now, looking for the right sorts of heuristics to apply.

The book answer is straight forward, without being useful. The aggregate boundary encompasses all of the state required to maintain the business invariant. So if you know what the business invariant, then the problem is easy. You start with an aggregate of a single entity, then you fold in all of the business rules that reference the state of the entity, then you fold in all of the entities touched by those rules, and then fold in more rules... it's turtles until you reach a steady state. Then that aggregate, at least, is complete. You set it aside, pick a new entity, and repeat the process until all the entities in the domain have been assigned to an aggregate.

In any interesting problem space, the invariant is not so clearly defined. Most of the discussions describing the evolution of a model talk about the discovery that the model is missing some element of the Ubiquitous Language, and that inspires someone to recognize why some use case has been broken, or incredibly difficult to implement. Or that the Ubiquitous Language has actually been missing some important concept, that -- one expressed -- brings new clarity to the actual requirements of the business. Most of the refactoring exercises I have seen have described cases where entities were too tightly coupled; contention between unrelated entities was making the system harder to use.

Lesson I learned today:

Thinking about the happy path doesn't inform anything. Any composition of the objects will do when the command history never violates any business rules. The interesting cases are partial failures.

Contention, as noted previously, is a primary pressure to separate entities. Commands are being applied to different entities, where there should be no interplay between the affected states. Yet if both commands are being run through the same aggregate root, then one otherwise satisfactory command will fail because it happened to be trying to commit after a different command has already advanced the history of the aggregate. This is a failure of interference between uncoordinated commands. The inverse problem are two coordinated commands are broadcast to separate entities, where one command succeeds and the other fails.

Thought experiment: suppose that we were to model these two entities in separate aggregates, so that they are participating in different transactions. What would this coordination failure look like in the event stream? Well, you would be watching the events go by, and you would see the history of the successful command, and then you would wait, and wait, and you would never see the history from the other aggregate.

Let's put a scope on it - we have a coordination contingency if some specified amount of time passes without seeing the missing history. That we are watching the event history, and thinking about the passage of time, announces at once that we are considering a process manager; which is an entity that implements a state machine. Within their own transactions, a process manager will emit events describing the changes to the state machine, asynchronously schedule calls to itself (a time trigger), and perhaps dispatch asynchronous commands to the domain model.

There's some block and tackle to be done at this point -- the processManager is an entity in its own right, and we need to be sure that the observed events are dispatched to "the right one". We're going to need some meta data in the events to ensure that they are all going to the right destination.

Back to our experiment; the history of the first command arrives. We load a process manager and pass the event to it. The process manager uses its SLA to schedule a message to itself at some time in the future. Time passes; the scheduled message is delivered. The process manager fires the timeout trigger into its state machine, arrives at the Contingency state, and writes that event into the log.

How does that help?

It gives us something to look for in the Ubiquitous Language. If the coordinated entities really do belong in separate aggregates, then this contingency is a thing that really happens in the business, and so somebody should know about it, know the requirements for mitigating the contingency, what events should appear in the log to track the mitigation progress, and so on.

On the other hand, if the domain experts begins saying "that can't happen", "that MUST NOT happen", "that is too expensive when it happens, which is why we have you writing software to prevent it", and so forth; then that is strong evidence that the two entities in question need to be modeled as part of the same aggregate.