Sunday, September 23, 2018

TDD: What do tests describe?

The second thing I want to highlight is that refactoring does not change the observable behavior of the software. The software still carries out the same function that it did before. Any user, whether an end user or another programmer, cannot tell that things have changed. -- Martin Fowler.
Most TDD kata feature functions as the system under test. Sure, the underlying implementation might be classes, or monads, or whatever the flavor of the month happens to be. But in the Bowling Game, or Fizz Buzz, or Mars Rover, the inputs completely determine the output.

"1", "2", "Fizz"... each question has one and only one right answer. There is precisely one observable behavior that satisfies the requirements.

But that's not generally true - there are a lot of ways that the requirements may not completely constrain the system. For instance, in the fractions kata, unless you introduce a constraint that further restricts the behavior in some way, adding two fractions can produce any of a number of distinguishable denominators.

Systems with "random" behaviors, or hidden information, face this complexity. My approach to the Fischer Chess kata usually involves isolating the random number generator from a function -- but there are still 960 different layouts that could count as row(0).

So - what's really going on?

Let's step back a moment -- what are the tests for? In the case of TDD, we are normally restricting our focus to developer tests; and developer test are being written in support of red, green, refactor. That third word is a big hint, in light of Fowler's comments above. The tests are run to detect instances where our attempt to refactor the code actually changed the observable behavior.

We run our tests, and they pass. We refactor the solution. We run our tests, and they fail. That tells us immediately that our refactoring was corrupted, and that we need to revert and try again.

How do the different styles of testing achieve the sensitivity we need for this to be true?

Property based tests achieve this collectively. Each test constrains the test subject along one dimension; to pass all of the tests, the implementation must be somewhere within the intersection of the acceptable behaviors. Apply enough of these constraints, and you end up with a solution that is unique.

Thus, the test suite describes the requirements of the system.

Example based tests are typically much sparser than property tests (we don't exercise the input space as rigorously), and constrain the solution to a specific observed behavior.

The mindset is different.

I get paid for code that works, not for tests, so my philosophy is to test as little as possible to reach a given level of confidence -- Kent Beck

I'm reminded of curve fitting. Here are a number of points that the solution has to pass through, and there are infinitely many curves to choose from. But only a few of those curves are simple, where "simple" is an implicit heuristic of the problem.

With example tests, the requirements are in the head of the developer, along with the goal of producing the appropriate solution for those requirements. The tests themselves are scaffolding that assist in the refactoring effort.

Example tests don't describe the requirements, they describe a candidate behavior that is consistent with the requirements.

And that is the piece I've been missing; a recognition that in some cases there is more than one candidate behavior that is satisfactory, and that when that happens we just choose one.

For example, suppose that we need a random stream of coin flips. We have a random stream of uniformly distributed integers. So we can create a lookup table that takes each integer and maps it to heads or tails. Provided that we exhibit some care in the design of the table, we get a perfectly satisfactory representation of a true coin. In our tests, we replace the random stream with a deterministic equivalent, and that gives us a fixed set of behaviors.

What happens if we rework the lookup table, swapping the index of the first heads entry with the index of the first tails entry? The observable behavior changes to a different behavior that is also compatible with the requirements.

I think I first ran across this when working on the Fischer chess kata. Once you hit on the idea of using a lookup table, you have to choose which outcomes are assigned to which indexes. At the time, I happened to know how I expected the implementation to work; with many behaviors to choose from, I selected the behavior that would refactor most easily into the design I had in mind.

Generalizing: the lookup table was an example of hidden state; an input provided by something other than the public interface. It cannot be derived from the test inputs.

And there's a form of analysis paralysis that can set in at this point -- which hidden state do you choose?

My answer today: what's the simplest thing that could possibly work? There isn't a rule, or a constraint, that says the hidden state must depend on something else. An’ ye harm none....

The examples, then, become a description of the behaviors that follow from that hidden state. The observable effects of that hidden state are what you are protecting when you refactor.

No comments:

Post a Comment