Sunday, June 30, 2019

Usage Kata

The usage kata is intended as an experiment in applying Test Driven Development at the program boundary.

Create a command line application with the following behavior:

The command supports a single option: --help

When the command is invoked with the help option, a usage message is written to STDOUT, and the program exits successfully.

When the command is invoked without the help option, a diagnostic is written to STDERR, and the program exits unsuccessfully.

Example:

Sunday, June 23, 2019

Variations on the Simplest Thing That Could Possibly Work

Simplest Thing that Could Possibly Work is a unblocking technique

Once we get something on the screen, we can look at it. If it needs to be more we can make it more. Our problem is we've got nothing.

Hunt The Wumpus begins with a simple prompt, which asks the human operator whether she would like to review the game instructions before play begins.


That's a decent approximation of the legacy behavior.

But other behaviors may also be of interest, either as replacements for the legacy behavior, or supported alternatives.

For instance, we might discover that the prompt should behave like a diagnostic message, rather than like data output, in which case we'd be interested in something like

Or we might decide that UPPERCASE lacks readability

Or that the input hints should appear in a different order


And so on.

The point here being that even this simple line of code spans many different decisions.  If those decisions aren't stable, then neither is the behavior of the whole.

When we have many tests evaluating the behavior of a unstable decision, the result it a brittle test suite.

A challenging aspect to this: within the scope of a single design session, behaviors tend to be stable.  "This is the required behavior, today."  If we are disposing of the tests at the end of our design session, then there's no great problem to solve here.


On the other hand, if the tests are expected to be viable for many design sessions, then protecting the tests from the unstable decision graph constrains our design still further.

One way to achieve a stable decision graph is to enforce a constraint that new behaviors are added by extension, and that new behaviors will be delivered beside the old, and that clients can choose which behavior they prefer.  There's some additional overhead compared with making the change in "one" place.

Another approach is to create bulkheads within the design, so that only single elements are tightly coupled to a specific decision, and the behavior of compositions is evaluated in comparison to their simpler elements.  James Shore describes this approach in more detail within his Testing Without Mocks pattern language.

What I haven't seen yet: a good discussion of when.  Do we apply YAGNI, and defend against the brittleness on demand?  Do we speculate in advance, and invest more in the design to insure against an uncertain future?  Is there a checklist that we can work, to reduce the risk that we foul our process for reducing the risk?

Thursday, June 20, 2019

Design Decisions After the First Unit Test

Recently, I turned my attention back to one of my early "unit" tests in Hunt The Wumpus.

This is an "outside-in" test, because I'm still curious to learn different ways that the tests can drive the designs in our underlying implementations.

Getting this first test to pass is about Hello World difficulty level -- just write out the expected string.

At that point, we have this odd code smell - the same string literal is appearing in two different places. What does that mean, and what do we need to do about it?

One answer, of course, is to ignore it, and move on to some more interesting behavior.

Another point of attack is to go after the duplication directly. If you can see it, then you can tease it apart and start naming it. JBrains has described the design dynamo that we can use to weave a better design out of the available parts and a bit of imagination.

To be honest, I find this duplication a bit difficult to attack that way. I need another tool for bootstrapping.

It's unlikely to be a surprise that my tool of choice is Parnas. Can we identify the decision, or chain of decisions, that contribute to this particular string being the right answer?

In this case, the UPPERCASE spelling of the prompt helps me to discover some decisions; what if, instead of shouting, we were to use mixed case?

This hints that, perhaps, somewhere in the design is a string table, that defines what "the" correct representation of this prompt might be.

Given such a table, we can then take this test and divide it evenly into two parts - a sociable test that verifies that the interactive shell behaves like the string table, and a second solitary test that the string table is correct.

If we squint a bit, we might see that the prompt itself is composed of a number of decisions -- the new line terminator for the prompt, the format for displaying the hints about acceptable responses, the ordering of those responses, the ordering of the prompt elements (which we might want to change if the displayed language were right to left instead of left to right).

The prompt itself looks like a single opaque string, but there is duplication between the spelling of the hints and the checks that the shell will perform against the user input.

Only a single line of output, and already there is a lot of candidate variation that we will want to have under control.

Do we need to capture all of these variations in our tests? I believe that depends on the stability of the behavior. If what we are creating is a clone of the original behavior -- well, that behavior has been stable for forty years. It is pretty well understood, and the risk that we will need to change it after all this time is pretty low. On the other hand, if we are intending an internationalized implementation, then the English spellings are only the first increment, and we will want a test design that doesn't require a massive rewrite after each change.

Sunday, May 5, 2019

Testing at the seams

A seam is a place where you can alter behaviour in your program without editing in that place -- Michael Feathers, Working Effectively with Legacy Code
When I'm practicing Test Driven Development, I prefer to begin on the outside of the problem, and work my way inwards.  This gives me the illusion that I am discovering the pieces that I need; no abstraction is introduced in the code without at least one consumer, and the "ease of use" concern gets an immediate evaluation.

As an example, let's consider the case of an interactive shell.  We can implement a simple shell using java.lang.System, which gives us access to System.in, System.out, System::getenv, System.::currentTimeMillis, and so on.

We probably don't want our test subjects to be coupled to System, however, because that's a shared resource.  Developer tests should be embarrassingly parallel; shared mutable resources get in the way of that.

By introducing a seam that we can plug System into, we get the decoupling that we need in our tests.


If we want to introduce indirection, then we ought to introduce the smallest indirection possible. And we absolutely must try to introduce better abstraction. -- JB Rainsberger, 2013
My thinking is that this is correctly understood as two different steps; we introduce the indirection, and we also try to discovered the better abstraction.

But I am deliberately trying to avoid committing to an abstraction prematurely.  In particular, I don't want to invest in an abstraction without first accumulating evidence that it is a good one.  I don't want to make changes expensive when change is still likely - the investment odds are all wrong.

Tuesday, April 23, 2019

Saturday, April 20, 2019

Sketching Evil Hangman, featuring GeePaw

GeePaw Hill christened his twitch channel this week to a presentation of his approach to TDD, featuring an implementation of Evil Hangman.

Evil Hangman is a mimic of the traditional word guessing game, with a twist -- evil hangman doesn't commit to a solution immediately.  It's a good mimic - the observable behavior of the game is entirely consistent with a fair implementation that has committed to some word in the corpus.  But it will beat you unfairly if it can.

So, as a greenfield problem, how do you get started?

From what I can see, there are three approaches that you might take:
  • You can start with a walking skeleton, making some arbitrary choices about I/O, and work your way inward.
  • You can start with the functional core, and work your way inward
  • You can start with an element from a candidate design, and work your way outward.
GeePaw, it seems, is a lot more comfortable when he has pieces that he can microtest.  I got badly spooked on that approach years ago when I looked into the sudoku exercise.  So my preference is to choose an observable behavior that I understand, fake it, remove the duplication, and then add additional tests to the suite that force me to extend the solution's design.

Ultimately, if you examine the complexity of the test subjects, you might decide that I'm writing "integrated" or "acceptance" tests.  From my perspective, I don't care - the tests are fast, decoupled from the environment, and embarrassingly parallel.  Furthermore, the purpose of the tests is to facilitate my design work, not to prove correctness.

What this produces, if I do it right, is tests that are resilient to changes in the the design, but which may be brittle to changes in requirements.

My early tests, especially in greenfield work, tend to be primitive obsessed.  All I have in the beginning are domain agnostic constructs to work with, so how could they be anything but?  I don't view this as a failing, because I'm working outside in -- which is to say that my tests are describing the boundary of my system, where things aren't object oriented.  Primitives are the simplest thing that could possibly work, and allow me to move past my writer's block into having arguments with the code.

As a practice exercise, I would normally choose to start from the boundary of the functional core -- we aren't being asked to integrate with anything in particular, and my experiences so far haven't suggested that there is a lot of novelty there.
One should not ride in the buggy all the time. One has the fun of it and then gets out.
So, where to begin?

I'm looking for a function - something that will accept some domain agnostic arguments and return a domain agnostic value that I can measure.

Here, we know that the basic currency is that the player will be making guesses, and the game will be responding with clues.  So we could think in terms of a list of string inputs and a list of string outputs.  But the game also has hidden state, and I know from hard lessons that making that state an input to the test function will make certain kinds of verification easier.

The tricky bit, of course, is that I won't always know what that hidden state is until I get into the details.  I may end up discovering that my initial behaviors depend on some hidden variable that I hadn't considered as part of the API, and I'll need to remediate that later.

In this case, one of the bits of state is the corpus - the list of words that the game has to choose from.  Using a restricted word list makes it easier to specify the behavior of the implementation.  For instance, if all of the words in the corpus are the same length, then we know exactly how many dashes are going to be displayed in the initial hint.  If there is only a single word in the corpus, then we know exactly how the app will respond to any guess.

Making the corpus an input is the affordance that we need to specify degenerate cases.

Another place where degeneracy might be helpful is allowing the test to control the players mistake allotment.  Giving the human player no tolerance for errors allows us to explore endgame behavior much more easily.

And if we don't think of these affordances in our initial guess?  That's fine - we introduce a new entry point with an "extract method refactoring", eliminating duplication by having the original test subject delegate its behavior to our improved API, deprecating the original, and eliminating it when it is no longer required.

Simplest thing that can possibly work is a proposal, not a promise.

For the most part, that's what my sketches tend to look like: some exploration of the problem space, a guess at the boundary, and very little early speculation about the internals.

Friday, April 19, 2019

TDD and incremental delivery

I spend a lot of time thinking about breaking tests, and what that means about TDD as a development ritual.

I recently found a 2012 essay by Steven Thomas reviewing Jeff Patton's 2007 Mona Lisa analogy.  This in turn got me thinking about iteration specifically.

A lot of the early reports of Extreme Programming came out of Chrysler Comprehensive Compensation, and there's a very interesting remark in the post mortem
Subsequent launches of additional pay populations were wanted by top management within a year.
To me, that sounds like shorthand for the idea that the same (very broad) use case was to be extended to cover a larger and more interesting range of inputs with only minor changes to the behaviors already delivered.

The tests that we leave behind serve to describe the constraints necessary to harvest the low hanging fruit, as best we understood them at the time, to identify regressions when the next layer of complexity was introduced to the mix.

We're writing more living documentation because we are expecting to come back to this code next year, or next month, or next sprint.

I envision something like a ten case switch statement -- we'll implement the first two cases now, to cover perhaps a third of the traffic, and then defer the rest of the work until "later", defined as far enough away that the context has been evicted from our short term memory.

If the requirements for the behaviors that you implemented in the first increment are not stable, then there is non trivial risk that you'll need several iterations to get those requirements right.  Decisions change, and the implications of those changes are going to ripple to the nearest bulkhead, in which case we may need a finer grain testing strategy than we would if the requirements were stable.

At the other extreme, I'm working an an old legacy code base; this code base has a lot of corners that are "done" -- modules that haven't changed in many years.  Are we still profiting by running those tests?

This is something we should keep in mind as we kata.  If we want to be preparing for increments with significant time intervals between them, then we need bigger input spaces with stable requirements.

A number of the kata are decent on the stable requirements bit -- Roman numerals haven't changed in a long time -- but tend to be too small to justify not solving the whole thing in one go.  Having done that, you can thank your tests and let them go.

The best of the kata I'm familiar with for this approach would be the Gilded Rose - we have a potentially unlimited catalog of pricing rules for items, so we'll incrementally adapt the pricing logic until the entire product catalog is covered.

But - to do that without breaking tests, we need stable pricing rules, and we need to know in advance which products follow which rules.  If we were to naively assume, for example, that Sulfuras is a normal item, and we used it as part of our early test suite, then the updated behavior would break those tests.  (Not an expensive break, in this case -- we'd likely be able to replace Sulfuras with some other normal item, and get on with it).

Expressing the same idea somewhat differently: in an iterative approach, we might assume that Sulfuras is priced normally, and then make adjustments to the tests until they finally describe the right pricing constraints; in an incremental approach, Sulfuras would be out of scope until we were ready to address it.

I think the scheduling of refactoring gets interesting in an incremental approach - how much refactoring do you do now, when it is still uncertain which increment of work you will address next?  Is Shameless Green the right conclusion for a design sessions?

The Sudoku problem is one that is one that I struggle to classify.  One the one hand, the I/O is simple, and the requirements are stable, so it ought to hit the sweet spot for TDD.  You can readily imagine partitioning the space of sudoku problems into trivial, easy, medium, hard, diabolical sets, and working on one grouping at a time, and delivering each increment in turn.

On the other hand, Dr Norvig showed that, if you understand the problem, you can simply draw the rest of the fucking owl. The boundaries between the categories are not inherent to the problem space.