Cascade Faliure: Event Sourcing: Lessons on failure, part one.

I've written a couple of solo projects using "event sourcing", failing miserably at them because I failed to properly understand how to properly apply that pattern to the problem I was attempting to solve.

Part 1: Fantasy Draft Automation

I came into event sourcing somewhat sideways - I had first discovered the LMAX disruptor around March of 2013. That gave me my entry into the idea that state could be message driven. I decided, after some reading and experimenting, that a message driven approach could be used to implement a tool I needed for my fantasy baseball draft.

My preparation for the draft was relatively straight forward - I would attempt to build many ranked lists of players that I was potentially interested in claiming for my team, and during the draft I would look at these lists, filtering out the players that had already been selected.

So what I needed was a simple way to track lists of all of the players that had already been drafted, so that they could be excluded from my lists. Easy.

Failure #1: Scope creep

My real ambition for this system was that it would support all of the owners, including helping them to track what was going on in the draft while they were away. So web pages, and twitter, and atom feeds, and REST, and so on.

Getting all of this support right requires being able to accurately report on all of the players who were drafted. Which in turn means managing a database of players, and keeping it up to date when somebody chooses to draft a player that I hadn't been tracking, and dealing with the variations in spellings, and the fact that players change names and so on.

But for MVP, I didn't need this grief. I had already uniquely identified all of the players that I was specifically interested in. I just needed to keep track of those players; so long as I had all of the people I was considering in the player registry, and could track which of those had been taken (no need to worry about order, and I was tracking my own choices separately anyway).

Failure #2: Where is the book of record?

A second place where I failed was in understanding that my system wasn't the book of record for the actions of the draft. I should have noticed that we had been drafting for years without this database. And over the years we've worked out protocols for unwinding duplicated picks, and resolving ambiguity.

What I was really doing was caching outcomes from the real world process into my system. In other words, I should have been thinking of my inputs as a stream of events, not commands, and arranging for the system to detect and warn about conflicts, rather than rejecting messages that would introduce a conflict.

There was no particular urgency about matching picks with identifiers of players in the registry, or in registering players who were not part of the registry. All of that math could be delayed a hundred milliseconds without anybody noticing.

Failure #3: Temporal queries

The constraints that the system with were trying to enforce the rules that only players in the player registry could be selected, and that each player in the registry could only be selected once. In addition to the fact that wasn't the responsibility of the system, it was complicated by the fact that the player registry wasn't static.

Because I was trying to track the draft faithfully (not realizing until later that doing so wasn't strictly necessary for my use case), I would stop the program when my registry had a data error. The registry itself was just dumb bytes on disk; any query I ran against the database was a query against "now". So changing the entries in the registry would change the behavior of my program during "replay".

Failure #4: Compatibility

Somewhat related to the above - I wasn't always being careful to ensure that the domain logic was backwards compatible with the app that wrote the messages, nor did my message journal have any explicit markers in it to track when message traffic should switch to the new handlers.

So old messages would break, or do something new, screwing up the replay until I went into the "immutable" journal to fix the input errors by hand.

Failure #5: Messages

My message schemas, such as they were, were just single lines of text - really just a transcript of what I was typing at the interactive shell. And my typing sucks, so I was deliberately making choices to minimize typing. Which again made it harder to support change.

Cascade Faliure

Wednesday, October 10, 2018

Event Sourcing: Lessons on failure, part one.