Saturday, September 7, 2019

TDD: On Fake Code

This past spring, David Tanzer published a short essay on transitioning from fake implementations to real ones.
When you do TDD, “fake implementations” or “wrong code” are OK, as long as they pass all the tests you have so far
But when do you stop to fake? When do you start writing “real code”?
Tanzer is using this as a stepping stone to introduce Uncle Bob's heuristic: as the tests get more specific, the code gets more generic.

But there is another answer, which I eventually learned from a comment written by Kent Beck:
Do you have some refactoring to do first?
Here is Tanzer's passing implementation:
And that's fine for our test calibration; we have successfully demonstrated that the test can distinguish the correct behavior from an incorrect behaviod in this specific case.

But... the current implementation implicitly describes two pieces of domain knowledge that we can make explicit.
  • The length of the hint should be the same as the length of the secret word.
  • The initial representation of the hint should conceal all of the letters in the secret word, which is to say it should be entirely composed of the unrevealed letter token "_".
We don't have to wait for permission to introduce these ideas; they are always going to appear in a refactoring step, so we can cut to the chase and introduce them immediately.

From there, we might notice that the secretWord we are using in the hint method is the same that was passed to the constructor, and extract that duplication. Or we might decide that the creation of the hint of the correct length is a single idea that can be extracted into another function, and do that.

You can start writing the real code as soon as you have a green bar.

Because I was reviewing Saff and Boshernitsan today, I have been thinking about Beck's Money demonstration.  Translated into Python, Beck's first test looks like

Riddle: what's the simplest implementation that will pass this test? There are probably several different answers, but the simplest I can come up with looks like:

No implementation, no variable names. Just 10. It's clear to me that this is "wrong code", in Tanzer's sense. But we don't need more tests to make it better, we can immediately refactor (in Beck's sense) to restore sanity to the implementation.

If we were being very small and deliberate in our refactoring, the refactoring sequence might look like:
Like "triangulation", small and deliberate steps are not required - they are a technique to practice so that you can get small when larger steps aren't working.

Monday, September 2, 2019

Thoughts on an Acceptance Tests

In my recent experiments with Hunt the Wumpus, I started thinking about what an "acceptance" test might look like.

To get started, I reviewed the walking skeleton example in Growing Object Oriented Software.  Freeman and Pryce wrote that the initial iteration should include delivery of a completely automated checkout/build/deploy/test pipeline, front loading the work of solving a number of critical system and political issues.  The acceptance test, in their example, launches the application and uses the user interface to probe and measure the app.

For an interactive shell app like Wumpus, the test harness is relatively straight forward; we control stdin to pass data to the app and control stdout to read data from the app.

What I struggle with, at this point in the narrative, is the amount of work required to create stable acceptance tests.

A point of view: automated checks are mistake detectors.  They don't provide value to the user - you can delete all of your automated checks and the behavior of your production code doesn't change.  Economically, the justification for the tests is that they reduce the costs of future work.  More precisely, we adopt processes that shutdown when a mistake is detected, ensuring that the mistakes cannot be overlooked, and that we don't expose our test subjects to expensive evaluation when the more cost effective checks have already detected problems.

There's another potential benefit to checks, which the TDD ritual seeks to exploit: thinking about how the checks will validate the behavior of your application creates space to discover important ideas in your app before you start coding it.




The acceptance tests, what with all the work we need to do to set them up, are expensive relative to other mechanisms for checking the correctness of the program.  In the case of the Auction Sniper, those tests included measuring that the app could talk to other processes.

In the case of Wumpus, there really aren't other processes to talk to unless we choose a particularly contrived design.  Only the interface to the user is interesting.  So there isn't a lot of complexity that
needs to be evaluated from the outside.

Which is good, because that evaluation is painful.

Wumpus has three awkward aspects to it; hidden information, non-deterministic behavior, and message schema.

The hidden information aspect is what introduces uncertainty in the game - with complete knowledge of the hazards in the maze, the game can be won trivially by shooting the wumpus in its lair.  But without that hidden information, one cannot know the correct outcome of any action by the hunter.


The location of the hazards in the game is non-deterministic - that's part of the mechanism for hiding that information from the player.  In addition, each of the hunters actions can induce random behavior by the hazards in the game.  These random effects mean that any given action by the player can have multiple candidate responses, depending on how the dice fall.

The feedback from the game to the player is all via messages written to the console.  Those messages were designed (such as it is) for human readability, rather than machine readability.  Understanding the semantics of those messages requires introducing a parser into the acceptance test.

What this means is that we have some work cut out for us if we want anything more than a trivial verification that some message was written to standard out.

One possibility is that we can introduce the idea of specifying a seed for the non-deterministic behavior from outside the program.  The acceptance test can fix the seed, then perform a domain agnostic comparison of the output to some golden master that we specify.  This is somewhat brittle: the current mapping of random values to representations is arbitrary, and the domain agnostic match over fits the representation of the messages.

Another possibility is to introduce an affordance that allows the specification of a message schema to use; the acceptance test simply switches the application into a mode where the responses are easy to parse, much like an http request might distinguish between text/plain and application/json.  Even without fixing the seed, our acceptance test can still easily identify that all of the messages are well formed.

The schema approach, while straight forward, feels like a lot of work that will not pay off.  I think the issue here is that, while wumpus is a more interesting toy exercise than the bowling game or a Fibonacci calculator, it is still fundamentally a toy problem -- one with an arbitrary and limited scope.

My null-design port of Wumpus from basic to Java is only 375 lines long; it's hard to envision that project having a lifetime that justifies heavy upfront investment in acceptance tests.

What we can do, from the outset, is decide that the behaviors that the acceptance test needs to control - the random seed, the interpretation of the random values, the message schema - can be controlled from the outside, and that the idiom for changing those behaviors in the future is to extend the application with new selectable behaviors, rather than replacing the existing behaviors.

Saturday, August 10, 2019

Purchase Approval

One problem I've had with Domain Driven Design is coming up with good realistic examples that exhibit the sorts of complexity we need to be thinking about, without getting lost in the weeds.

My latest idea is to try working through a purchase approval in an analog office.

Bob wants the company to pay for something.  So he gets a blank form, and fills in the details, and drops off the form with Alice.  Alice does the work of comparing the details to the current policies, and approving / rejecting the request.  The resolved request is returned to Bob, so that he can act on the decision that has been made.

There are a lot of requests, and checking the details is a lot of work.  So Alice has become a bottleneck.  She offloads some of the work to Terry the intern; Terry does the legwork for requests when the approval doesn't require Alice's domain expertise.

As a proxy for easy, we'll use a trivial condition like "amount less than 100 USD".

The form acts as a sort of lock; an actor in this protocol can only change the paper when they have physical control of it.  So the process is serial, only one person can record information at a time.

That means we need to think more precisely about how the requests are shared by Alice and Terry.  Perhaps all requests go first to Alice, and she passes the easy requests to Terry; or perhaps the requests all go to Terry, and the hard cases are forwarded to Alice.

We can think of the request as a single form, that gets modified.  Alternatively, we can think of an envelope filled with "immutable" documents; each actor adds new paperwork to the envelope.

The process is asynchronous, in this sense - the request can be in Alice's office even though Alice herself is out at lunch, or home sick.  The movement of paper allows the people to communicate, even though they aren't necessarily in the office at the same time.

The paperwork is anemic - all of the domain knowledge is locked in the heads of Alice and Terry (and, to some degree, Bob).  The paperwork is just the bookkeeping.

It's also worth noting that the paper is immutable, in this sense: once the paperwork has left Bob's control, he cannot correct errors until the paperwork is returned to him.

Bob's "view" of this process is the stack of blank forms, and his collection of resolved requests.  Alice and Terry have similar views: stacks of pending requests.

Exercise 1: what changes when we take this process digital?  So instead of physical paperwork moving from place to place, we now have information being copied from one place to another.

Exercise 2: what changes when we extend the digital process to automate Terry's work?

Saturday, August 3, 2019

TDD: Random is Arbitrary

I happened across a Yahtzee kata today. Although I didn't work that exercise today, it got me thinking again about random behaviors.
Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin.
If your goal is to write fast, deterministic tests that you can use to detect unintended changes during a refactoring, then you need to treat a random number generator the way you would a clock.  The test itself has to decide which random sequence to provide, and pass it to the test subject.

But there's a second problem, which is this -- if a random choice from a list is an acceptable behavior, then so to must be the same random choice from every permutation of that list.

If we can map a sequence of random numbers to [HEADS, TAILS], and produce from that an acceptable behavior, then it must also be true that mapping the same sequence of random numbers to [TAILS, HEADS] must also be acceptable.  You can replace one with the other any time you want.

But doing that isn't a refactoring; when we make this change, the same inputs produce different outputs, which are measured by the test harness.

The choice of which permutation to use is an implementation detail; any tests that depend on the specific requirements are overfit.

Can you beat this by passing in an ordered list of choices?  Not really, for the same reason - if the permutation of items that you pass in produces an acceptable behavior, then it sill also be acceptable for the code under test to re-order those items before using them.

Anoher related problem is that there are different ways to interact with the random number generator that produce equally acceptable results.  If we want to randomly toss three coins, we can pull three random numbers and project each onto [0,1], and encode the result, or we can pull a single random number, project onto [0,7], and then use a squashed encoding to interpret the result.  If one is valid, then so to is the other -- but they leave the RNG in different states, and therefore we have additional risks of overfitting.

What does this mean?  That simply decoupling the random number generator isn't enough.  We need to be passing the result of the random choice - pass in the coin after it has been flipped, pass in the dice after they have been rolled, pass in the deck after it has been shuffled.

It's not enough to inject the random number generator into the test; you need to leave the arbitrary mapping of the random number to some result value in the imperative shell as well.

Sunday, June 30, 2019

Usage Kata

The usage kata is intended as an experiment in applying Test Driven Development at the program boundary.

Create a command line application with the following behavior:

The command supports a single option: --help

When the command is invoked with the help option, a usage message is written to STDOUT, and the program exits successfully.

When the command is invoked without the help option, a diagnostic is written to STDERR, and the program exits unsuccessfully.

Example:

Sunday, June 23, 2019

Variations on the Simplest Thing That Could Possibly Work

Simplest Thing that Could Possibly Work is a unblocking technique

Once we get something on the screen, we can look at it. If it needs to be more we can make it more. Our problem is we've got nothing.

Hunt The Wumpus begins with a simple prompt, which asks the human operator whether she would like to review the game instructions before play begins.


That's a decent approximation of the legacy behavior.

But other behaviors may also be of interest, either as replacements for the legacy behavior, or supported alternatives.

For instance, we might discover that the prompt should behave like a diagnostic message, rather than like data output, in which case we'd be interested in something like

Or we might decide that UPPERCASE lacks readability

Or that the input hints should appear in a different order


And so on.

The point here being that even this simple line of code spans many different decisions.  If those decisions aren't stable, then neither is the behavior of the whole.

When we have many tests evaluating the behavior of a unstable decision, the result it a brittle test suite.

A challenging aspect to this: within the scope of a single design session, behaviors tend to be stable.  "This is the required behavior, today."  If we are disposing of the tests at the end of our design session, then there's no great problem to solve here.


On the other hand, if the tests are expected to be viable for many design sessions, then protecting the tests from the unstable decision graph constrains our design still further.

One way to achieve a stable decision graph is to enforce a constraint that new behaviors are added by extension, and that new behaviors will be delivered beside the old, and that clients can choose which behavior they prefer.  There's some additional overhead compared with making the change in "one" place.

Another approach is to create bulkheads within the design, so that only single elements are tightly coupled to a specific decision, and the behavior of compositions is evaluated in comparison to their simpler elements.  James Shore describes this approach in more detail within his Testing Without Mocks pattern language.

What I haven't seen yet: a good discussion of when.  Do we apply YAGNI, and defend against the brittleness on demand?  Do we speculate in advance, and invest more in the design to insure against an uncertain future?  Is there a checklist that we can work, to reduce the risk that we foul our process for reducing the risk?

Thursday, June 20, 2019

Design Decisions After the First Unit Test

Recently, I turned my attention back to one of my early "unit" tests in Hunt The Wumpus.

This is an "outside-in" test, because I'm still curious to learn different ways that the tests can drive the designs in our underlying implementations.

Getting this first test to pass is about Hello World difficulty level -- just write out the expected string.

At that point, we have this odd code smell - the same string literal is appearing in two different places. What does that mean, and what do we need to do about it?

One answer, of course, is to ignore it, and move on to some more interesting behavior.

Another point of attack is to go after the duplication directly. If you can see it, then you can tease it apart and start naming it. JBrains has described the design dynamo that we can use to weave a better design out of the available parts and a bit of imagination.

To be honest, I find this duplication a bit difficult to attack that way. I need another tool for bootstrapping.

It's unlikely to be a surprise that my tool of choice is Parnas. Can we identify the decision, or chain of decisions, that contribute to this particular string being the right answer?

In this case, the UPPERCASE spelling of the prompt helps me to discover some decisions; what if, instead of shouting, we were to use mixed case?

This hints that, perhaps, somewhere in the design is a string table, that defines what "the" correct representation of this prompt might be.

Given such a table, we can then take this test and divide it evenly into two parts - a sociable test that verifies that the interactive shell behaves like the string table, and a second solitary test that the string table is correct.

If we squint a bit, we might see that the prompt itself is composed of a number of decisions -- the new line terminator for the prompt, the format for displaying the hints about acceptable responses, the ordering of those responses, the ordering of the prompt elements (which we might want to change if the displayed language were right to left instead of left to right).

The prompt itself looks like a single opaque string, but there is duplication between the spelling of the hints and the checks that the shell will perform against the user input.

Only a single line of output, and already there is a lot of candidate variation that we will want to have under control.

Do we need to capture all of these variations in our tests? I believe that depends on the stability of the behavior. If what we are creating is a clone of the original behavior -- well, that behavior has been stable for forty years. It is pretty well understood, and the risk that we will need to change it after all this time is pretty low. On the other hand, if we are intending an internationalized implementation, then the English spellings are only the first increment, and we will want a test design that doesn't require a massive rewrite after each change.