Cascade Faliure: Event Sourcing: lessons in failure, part 2

I've written a couple of solo projects using "event sourcing", failing miserably at them because I failed to properly understand how to properly apply that pattern to the problem I was attempting to solve.

Part 2: Fantasy League Scoring

Our fantasy league used a bespoke scoring system, so I decided to try my hand at creating a report for myself to track how every player in baseball was doing. This gave me extra insights about how I might improve my team by replacing players in the middle of the season.

And to a large extent, it worked - I was able to pick up a number of useful pieces that would otherwise have slipped under the radar, turning over my team much more aggressively than I would have otherwise.

It was still pretty much a crap shoot -- past performance does not promise future results. But it did have the benefit of keeping me more engaged.

Failure #1: Where are the events?

Again, I had a book of record issue - the events were things happening in the real world, and I didn't have direct access to them. What I had was a sort of proxy - after a game ended, a log of the game would become available. So I could take that log, transform it into events, and the proceed happily.

Well, to be honest, that approach is pretty drunk.

The problem is that the game log isn't a stream of events, it is a projection. Taking what is effectively a snapshot and decomposing it reverses cause and effect. There were two particular ways that this problem would be exposed in practice.

First, it would sometimes happen that the logs themselves would go away. Not permanently, usually, but at least for a time. Alternatively, they might be later than expected. And so it would happen that data would get lost - because a stale copy of a projection was delivered instead of a live one. Again, the projections aren't my data, they are cached copies of somebody else's data, that I might use in my own computations.

Second, when these projections appear, they aren't immutable. That's a reflection of both problems with data entry upstream (a typo needs to be fixed), and also the fact that within the domain, the interpretation of the facts can change over time -- the human official scorers will sometimes reverse an earlier decision.

In other words, for what I was doing, events didn't gain me anything over just copying the data into a RDBMS, or for that matter writing the data updates into a git repository on disk.

Failure #2: Caching

The entire data processing engine depended on representations of data that changed on a slow cadence (once a day, typically) and I wasn't tracking any meta data about how fresh the data was, how stable it ought to be, whether the new data in question was a regression from what had been seen earlier, and so on.

In an autonomous system, this is effectively a sort of background task - managing local static copies of remote data.

To make this even more embarrassing; I was of course downloading this data from the web, and the HTTP specification has a lot to say about caching, that I didn't even consider.

(I also failed to consider the advantages I might get from using a headless browser, rather than just an html parser. This bit me hard, and made a big contribution toward the abandoning of the project.)

Failure #3: What's missing?

The process I was working from only considered logs that were available; there was no monitoring of logs that might be missing, or that might have been removed. This introduced small errors in data comparisons.

I needed to be able to distinguish "here are Bob Shortstop's 6 scores from last week" from "here are Bob Shortstop's 5 scores from last week, and there is a game unaccounted for".

Again, I was thinking of events as things that happen, rather than as events as a way of representing state over time.

Failure #4: Process telemetry

What I wanted, when the wheels were done turning, was the collection of reports at the end. And that meant that I wasn't paying enough attention to the processes I was running. Deciding when to check for updated data, on what grain, and orchestrating the consequences of changes to the fetched representations was the real work, and instead I was thinking of that as just "update files on disk". I didn't have any reports I could look at to see if things were working.

Again, everything was flat, everything was now, absolutely no indications of when data had appeared, or which representation was the source of a particular bit of data.

Solving the Wrong Problem

In effect, what happened is that I would throw away all of my "events" every morning, then regenerate them all from the updated copies of the data in the cache. If all of your events are disposable, then something is going badly wrong.

The interesting things to keep track of were all related to the process, and discovering that I wanted to refresh the caches slowly, but in a particular priority.

What I should have been looking toward was Rinat's model of a process manager; how would I support a dashboard showing a list of decisions to be made, could I then capture the priorities of the "domain expert" and automate the work. Could I capture time as a first class concern driving the elements of the system forward?

Some of the reason that I missed this is that I had too much time -- I was deliberately hobbling the fetch of the data, which meant that the cost of redoing all of the work was lost in the noise. On the other hand, that doubly emphasizes the point that all of the value add was in the bookkeeping, which I never addressed.

Key Question:

Did I need temporal queries? No.

Cascade Faliure

Friday, October 12, 2018

Event Sourcing: lessons in failure, part 2