Saturday, October 20, 2018

Wumpus RNG and the functional core

I've been experimenting with refactoring the wumpus, and observed an interesting interaction with the functional core.

For illustration, let's consider the shooting of an arrow.  After fighting a bit with the console, the player submits a firing solution.  This, can pass along to the game module, but there are two additional twists to consider. 

First: if the firing solution requires a tunnel between rooms that doesn't exist, the arrow instead travels randomly -- and might end up bouncing several more times. 

Second: if the arrow should miss, the wumpus may move to another room.  Again, randomly.

In my refactorings, the basic shape of the method is a function of three arguments: the firing solution, and some source of "random" numbers.  At this abstraction level, shooting the arrow looks to be atomic, but in some scenarios the random numbers are consumed.

I took some time today to sketch the state machine, and a slightly different picture emerges.  The game is Hunting, and invoking the shootArrow method takes the game to some other state; an incorrectly specified firing solution takes us to a BouncingArrow state, and invoking the bounceArrow method.  When the arrow has finished bouncing, we may end up in the ArrowMissed state, in which case we wakeWumpus, and discover whether or not the hunter has met his end.



The internal core knows nothing about sources of data - no command line, no random numbers, just computing transitions from one state to the next. This means that it is very easy to test, everything just sits in memory. The orchestration logic gets to worry about where the data comes from, but is completely decoupled from the core logic of the game.

Wednesday, October 17, 2018

Thought on Mocks

Based on a discussion with @rglover.
Any veterans have a clear reason why not to use mocks?
"Mocks" are fun, because without a shared context you can't always be certain that the other side of the debate understands your arguments the way that you intend them.  Is the conversation specifically about behavior verification ? about all flavors of test doubles ? Are we talking about the test scaffolding that developers use to drive their designs, or the tests we run at choke points to prevent mistakes from propagating to the next stage of the pipeline?

Both Katrina Owen and Sandi Metz  share a model of what should be asserted by tests; incoming commands and queries are checked by asserting the result returned by a query of the system under test; but checking outgoing commands is tricker.

So let's assume for the moment that last problem is the one we are trying to solve.


When I was learning poker, John Finn and Frank Wallace introduced me to the concept of investment odds: the key idea being that our play should be based on a calculus derived from
  • the payoff
  • the probability of success
  • the cost to play
and a comparison of these values against the alternative plays available.


Now, let's look to the Three Laws of TDD, as described by Robert Martin.
You are not allowed to write any production code unless it is to make a failing unit test pass.
If we were to accept this discipline, then we would absolutely need to be able to satisfy a failing check.  That would mean that either we would need to include in the composition of the system under tests the real, production-grade collaborator, or we need to fake it.  Or we can tell Uncle Bob to get stuffed, and write the code we need.

Does the mock give us the best investment odds?

That's the first question, and the answer depends on circumstance.  Simply writing the code requires no additional investment, but does nothing to mitigate risk.

Using the real collaborator gives the most accurate results.  But that's not without cost - you'll pay for latency and instability in the collaborator.  For collaborators within the functional core, that may be acceptable - but the odds of it becoming unacceptable increase if the collaborator is coupled to the imperative shell.  Latency at an automated choke point may be acceptable, but paying that tax during the development cycle is not ideal.  Unstable tests incur investigation costs if the results are taken seriously, and are a distraction if no one can take the initiative to delete them.

There can also be additional taxes if the setup of the collaborator is complicated enough that it distracts from the information encoded into the test itself.  And of course the test becomes sensitive to changes in the setup of the collaborator.

One can reasonably posit circumstances in which the mock is a better play than either of all or nothing.

But there is another candidate: what about a null object?

A null object alone doesn't give you much beyond not testing at all.  It does suggest a line of inquiry; what are you getting from the mock that you aren't getting from the null object... and is the added benefit of the mock something you should be getting from the production code instead?

In other words, are the tests trying to direct your attention to a instrumentation requirement that should be part of your production code?

Especially in the case where you are making off a command that would normally cross a process boundary, having some counters in place to track how many calls were made, or something about the distribution of the arguments passed, could easily have value to the operators of the system.

And, having introduced that idea into the system, do the investment odds still favor the mock?

Even away from the boundary, I think the question is an interesting one; suppose the system under test is collaborating with another module within the functional core.  Is the element of the test that is prejudicing us toward using a mock actually trying to signal a different design concern?  Is the test trying to tell us that the composition is too much work, or that the code is slow, and we we need smarter logic at the boundary of slow?

I don't think the answer is "never mock"; but rather that there is a decision tree, and the exercise of questioning the choice at each fork is good practice.



Red Blue Refactor

As far as I can tell, the mantra "Red Green Refactor" is derived from the colors of the UI controls in the early GuiTestRunners, which were influenced by the conventions of traffic control system, which perhaps were influenced by railroad control systems.

(Image from JSquaredZ).

As labels, `Red` and  `Green` are fine - they are after all just hints at a richer semantic meaning. 

As colors, they suck as a discriminator for the percentage of the population with Deuteranopia.

Looking for another color pair to use, I'm currently thinking that past and future are reasonable metaphors for code improvement, so I'm leaning toward a red/blue pairing.

(Image from Wikipedia)

Of course, in English, "past" and "passed" are homophones, so this perhaps only trades one problem for another. 


Friday, October 12, 2018

Event Sourcing: lessons in failure, part 2

I've written a couple of solo projects using "event sourcing", failing miserably at them because I failed to properly understand how to properly apply that pattern to the problem I was attempting to solve.

Part 2: Fantasy League Scoring

Our fantasy league used a bespoke scoring system, so I decided to try my hand at creating a report for myself to track how every player in baseball was doing.  This gave me extra insights about how I might improve my team by replacing players in the middle of the season.

And to a large extent, it worked - I was able to pick up a number of useful pieces that would otherwise have slipped under the radar, turning over my team much more aggressively than I would have otherwise.

It was still pretty much a crap shoot -- past performance does not promise future results.  But it did have the benefit of keeping me more engaged.

Failure #1: Where are the events?

Again, I had a book of record issue - the events were things happening in the real world, and I didn't have direct access to them.  What I had was a sort of proxy - after a game ended, a log of the game would become available.  So I could take that log, transform it into events, and the proceed happily.

Well, to be honest, that approach is pretty drunk.

The problem is that the game log isn't a stream of events, it is a projection.  Taking what is effectively a snapshot and decomposing it reverses cause and effect.  There were two particular ways that this problem would be exposed in practice.

First, it would sometimes happen that the logs themselves would go away.  Not permanently, usually, but at least for a time.  Alternatively, they might be later than expected.  And so it would happen that data would get lost - because a stale copy of a projection was delivered instead of a live one.  Again, the projections aren't my data, they are cached copies of somebody else's data, that I might use in my own computations.

Second, when these projections appear, they aren't immutable.  That's a reflection of both problems with data entry upstream (a typo needs to be fixed), and also the fact that within the domain, the interpretation of the facts can change over time -- the human official scorers will sometimes reverse an earlier decision.

In other words, for what I was doing, events didn't gain me anything over just copying the data into a RDBMS, or for that matter writing the data updates into a git repository on disk.

Failure #2: Caching

The entire data processing engine depended on representations of data that changed on a slow cadence (once a day, typically) and I wasn't tracking any meta data about how fresh the data was, how stable it ought to be, whether the new data in question was a regression from what had been seen earlier, and so on.

In an autonomous system, this is effectively a sort of  background task - managing local static copies of remote data.

To make this even more embarrassing; I was of course downloading this data from the web, and the HTTP specification has a lot to say about caching, that I didn't even consider.

(I also failed to consider the advantages I might get from using a headless browser, rather than just an html parser.  This bit me hard, and made a big contribution toward the abandoning of the project.)

Failure #3: What's missing?

The process I was working from only considered logs that were available; there was no monitoring of logs that might be missing, or that might have been removed.  This introduced small errors in data comparisons.

I needed to be able to distinguish "here are Bob Shortstop's 6 scores from last week" from "here are Bob Shortstop's 5 scores from last week, and there is a game unaccounted for".

Again, I was thinking of events as things that happen, rather than as events as a way of representing state over time.

Failure #4: Process telemetry

What I wanted, when the wheels were done turning, was the collection of reports at the end.  And that meant that I wasn't paying enough attention to the processes I was running.  Deciding when to check for updated data, on what grain, and orchestrating the consequences of changes to the fetched representations was the real work, and instead I was thinking of that as just "update files on disk".  I didn't have any reports I could look at to see if things were working.

Again, everything was flat, everything was now, absolutely no indications of when data had appeared, or which representation was the source of a particular bit of data.

Solving the Wrong Problem

In effect, what happened is that I would throw away all of my "events" every morning, then regenerate them all from the updated copies of the data in the cache.  If all of your events are disposable, then something is going badly wrong.

The interesting things to keep track of were all related to the process, and discovering that I wanted to refresh the caches slowly, but in a particular priority.

What I should have been looking toward was Rinat's model of a process manager; how would I support a dashboard showing a list of decisions to be made, could I then capture the priorities of the "domain expert" and automate the work.  Could I capture time as a first class concern driving the elements of the system forward?

Some of the reason that I missed this is that I had too much time -- I was deliberately hobbling the fetch of the data, which meant that the cost of redoing all of the work was lost in the noise.  On the other hand, that doubly emphasizes the point that all of the value add was in the bookkeeping, which I never addressed.

Key Question:

Did I need temporal queries?  No.






Wednesday, October 10, 2018

Event Sourcing: Lessons on failure, part one.

I've written a couple of solo projects using "event sourcing", failing miserably at them because I failed to properly understand how to properly apply that pattern to the problem I was attempting to solve.

Part 1: Fantasy Draft Automation 

I came into event sourcing somewhat sideways - I had first discovered the LMAX disruptor around March of 2013.  That gave me my entry into the idea that state could be message driven.  I decided, after some reading and experimenting, that a message driven approach could be used to implement a tool I needed for my fantasy baseball draft.

My preparation for the draft was relatively straight forward - I would attempt to build many ranked lists of players that I was potentially interested in claiming for my team, and during the draft I would look at these lists, filtering out the players that had already been selected.

So what I needed was a simple way to track lists of all of the players that had already been drafted, so that they could be excluded from my lists.  Easy.

Failure #1: Scope creep

My real ambition for this system was that it would support all of the owners, including helping them to track what was going on in the draft while they were away.  So web pages, and twitter, and atom feeds, and REST, and so on.

Getting all of this support right requires being able to accurately report on all of the players who were drafted.  Which in turn means managing a database of players, and keeping it up to date when somebody chooses to draft a player that I hadn't been tracking, and dealing with the variations in spellings, and the fact that players change names and so on.

But for MVP, I didn't need this grief.  I had already uniquely identified all of the players that I was specifically interested in.  I just needed to keep track of those players; so long as I had all of the people I was considering in the player registry, and could track which of those had been taken (no need to worry about order, and I was tracking my own choices separately anyway).

Failure #2: Where is the book of record?

A second place where I failed was in understanding that my system wasn't the book of record for the actions of the draft.  I should have noticed that we had been drafting for years without this database.  And over the years we've worked out protocols for unwinding duplicated picks, and resolving ambiguity.

What I was really doing was caching outcomes from the real world process into my system.  In other words, I should have been thinking of my inputs as a stream of events, not commands, and arranging for the system to detect and warn about conflicts, rather than rejecting messages that would introduce a conflict.

There was no particular urgency about matching picks with identifiers of players in the registry, or in registering players who were not part of the registry.  All of that math could be delayed a hundred milliseconds without anybody noticing.

Failure #3: Temporal queries

The constraints that the system with were trying to enforce the rules that only players in the player registry could be selected, and that each player in the registry could only be selected once.  In addition to the fact that wasn't the responsibility of the system, it was complicated by the fact that the player registry wasn't static.

Because I was trying to track the draft faithfully (not realizing until later that doing so wasn't strictly necessary for my use case), I would stop the program when my registry had a data error.  The registry itself was just dumb bytes on disk; any query I ran against the database was a query against "now".  So changing the entries in the registry would change the behavior of my program during "replay".

Failure #4: Compatibility

Somewhat related to the above - I wasn't always being careful to ensure that the domain logic was backwards compatible with the app that wrote the messages, nor did my message journal have any explicit markers in it to track when message traffic should switch to the new handlers.

So old messages would break, or do something new, screwing up the replay until I went into the "immutable" journal to fix the input errors by hand.

Failure #5: Messages

My message schemas, such as they were, were just single lines of text - really just a transcript of what I was typing at the interactive shell.  And my typing sucks, so I was deliberately making choices to minimize typing.  Which again made it harder to support change.


Refactoring toward stateless systems?

The other day, I skimmed the video's of J.B. Rainsberger's Point of Sale Exercise.

In the videos, he talks about the design dynamo, and made a point in passing that removing duplication pushes the data toward the test.

For testing pure functions, that's clearly true - once you map the outputs to the inputs, then the function is fully general, and the specific examples to be tried, along with the answer crib, lives in the test code.  Fine.

When testing a stateful system, it's a similar idea.  The system under test is composed in its ground state, and then the test pushes additional data in to drive the system to the target state, and then we ask the big question.  But looked at from the high level, we're still dealing with a function.

But there are a number of cases where it feels natural to keep internal state within the object; especially if that state is immutable, or deliberately excluded from the API.  Wumpus has examples of each of these.  99 Bottles has duplicated immutable state, in the form of the verse templates.  Horses for courses, I suppose.  Plus the ratchet that teaches us that test data should not flow toward the production code.

But it kicked an idea loose...

If we are moving the state out of the production code, then we are largely moving it toward a database.  The composition root is responsible for wiring up the system to the correct database; in our tests, the test itself takes on this responsibility.

That in turn got me thinking about objects versus "APIs".  When we ported our systems to the web, sessions became a lot shorter - the REST architectural constraint calls for sessions that are but a single request long.

So testing such a system, where our domain starts in its default state, and then we "arrange" the preconditions of our test; this is analogous to a sequence of sessions, rather than one single session that handles multiple messages.

If you were to reflect that honestly in your test, you would have a lot of code in the test reading state out of the "objects" and copying it to the database, then pulling it out again for the next session, and so on.

I wonder if that would break us out of the object framing?

Kata: Refactor the Wumpus

I've shared a Java port of Hunt the Wumpus on Github.

https://github.com/DanilSuits/refactor-the-wumpus

The port is deliberately dreadful -- I tried to simulate a legacy code base by adhering as closely as I could manage, both in structure and in style, to the original.

Java doesn't have a useful goto, and I needed line number hints to keep track of where I was in the source code, so I've introduced a few awful names as well.

But there is a battery of tests available: record-and-playback of my implementation through a number of potentially interesting transcripts.
The key thing is that correct behavior is defined by what the set of classes did yesterday -- Michael Feathers
Producing stable, repeatable behaviors took a bit of thinking, but I was able to work out (eventually) that feature flags were the right approach.  By accident, I got part of the way there early; I had elected to make the record-and-playback tests an optional part of the build via maven profiles.

The argument for a feature flag goes something like this: what I'm really doing is introducing a change in behavior that should not be visible at run time - a dark deploy, so to speak.  Therefore, all of the techniques for doing that are in bounds.

It took a bit of trial and error before I hit on the right mechanism for implementing the changed behavior.  The code in the repository is only a sketch (the object of this exercise is _wumpus_, not feature flags), but if you squint you may be able to see Mark Seemann's ideas taking form.

With the tests in place to manage expectations, you can then engage the Simple Design Dynamo and get to work. I think in practice this particular exercise is biased more toward improve names than it is toward remove duplication, because of the style of the original.
Make the change easy, then make the easy change.  -- Kent Beck
My guess is that rather than trying to attack the code base as a whole, that it may be more effective to work toward particular goals.  Parnas taught us to limit the visibility of design decisions, so that we might more easily change them.  So look through the code for decisions that we might want to change.
  • The existing code implements its own interactive shell; how would we change the code to replace it with a library?
  • The interface for making a move or shooting an arrow is a bit clumsy, can it be replaced?
  • What changes do we need to support a web version, where each input from the player occurs in its own session.
  • Knowing the layout of the tunnel system gives the hunter a significant advantage when shooting arrows.  We could disguise the hunter's location by using randomized names.
  • Can we change the system to support mazes of different size? Mazes with more tunnels?  Mazes where the rooms are not symmetric, with "missing" tunnels? Mazes with more/fewer hazards?
  • Does the language need to be English?