Saturday, April 18, 2020

TDD Test Bulk Heads

I wanted to illustrate some of the ideas in my previous essay with some code examples, in an attempt to spell out more clearly what I think can, and perhaps even should, be going on.

So let's imagine that we are working on a Mars Rover exercise, and we want to start from a trivial behavior: if we don't send any instructions to the rover, then it will report back its initial position.

Such a use case could be spelled a lot of different ways. But if we are free of the constraints of legacy code, then we might reasonably begin with a primitive obsessed representation of this trivial behavior; the simple reason being that such a representation is easy to type.

The fastest way I can see to get from RED to GREEN is to simply wire in the correct response.

At this point in most TDD demonstrations, a second test will be introduced, so that one can begin triangulating on a solution. Alternatively, one could begin removing duplication right away.

For my purposes here, it is convenient to attack this problem in a very specific way - which is to recognize that the representation of the answer is a separate concern form the calculation of the answer. I assert that this sort of "extract method" refactoring should occur to the programmer at some point during the exercise.

I'm going to accelerate that refactoring to appear at the beginning of the exercise, before the function becomes too cluttered. The motivation here is to keep the example clear of distractions; please forgive that I am willing to give up some verisimilitude to achieve that.

That means that we can produce an equivalent behavior (and therefore pass the test), with this implementation:

The naming is dreadful, of course, but we've taken a step forward in separating calculation from presentation.

If we happen to notice that we've committed this refactoring, then an interesting possibility appears; we can decompose our existing assertion into two distinct pieces.

This version, with the double assertion, doesn't improve our real circumstances at all - but it begins to pave the way to a third version, which does have some interest:

With this change, we're constraining our behavior exactly as before... except that the new arrangement is somewhat more resilient when the required behaviors change in the future. For example, if we needed to replace this somewhat ad hoc string representation with a JSON document, only one of the two tests -- that test which constrains the behavior of the formatting logic -- fails and needs to be rewritten/retired.

I tend to use the pronunciation "behaves like" when considering tests of this form; the test doesn't assert that rover.run calls extractedMethod, only that it returns the same thing you would get by calling that method with the assigned arguments. For instance, if we were to inline that particular call (leaving the extracted method in place), the tests would still correctly pass.

That sounds pretty good. Where's the catch?

As best I can tell, the main catch is that we have to be confident that the signature of the new method is stable. By expressing the intended behavior in this way, we're increasing the coupling between the tests and the implementation, making that part of the implementation harder to change than if it were a hidden detail.

TDD: Return to Mars

The Boston Software Crafters meetup this week featured a variant on the Mars Rover exercise. I decided to take another swing at it on my own; Paul Reilly had introduced in the problem a clean notation for obstacles, and I wanted to try my hand at introducing that innovation at a late stage.

As is common in "Classic TDD", all of the usual I/O and network concerns are abstracted away in the pre-game, leaving only raw calculation and logic; given this input, produce the corresponding output.

My preferred approach it to work with tests that are representations of the complete behavior, and then introduce implementation elements as I need them. Another way of saying the same thing: rather than trying to guess where my "units" should be right out of the gate, I guess only at a single candidate interface that allows me to describe the problem in a way that the machine can understand, and then flesh out further details as additional details become clear.

So for this exercise session, I began by creating a test checklist. The "acceptance" tests provided by Paul went to the top of the list; these help to drive questions like "what should the API look like". Rather than diving directly into trying to solve those cases, I took the time to add to the checklist a bunch of simple behaviors to write - what should the solution output when the input is an empty program? or if the program contains only turn commands? or if the program describes an orbit around the origin (which, in this exercise, meant worming into each corner of the finite grid).

I came up with the empty program and the turns right away; the orbit of the origin occurred to me later when I was thinking about how to properly test the relation between move and turn. If the problem had included cases of rovers starting away from the origin, I might not have addressed the coordinate wrapping effects quite so soon.

Along the way, one "extra" acceptance test occurred to me.

The empty program obviously had the simplest behavior, and the reporting format was separated from the calculation right away. After passing a test with two left turns, there was a period of refactoring to introduce the notion of working through program instructions (which are just primitive strings at this point), the remaining lefts were trivial to add, and rights were painless because they are simply a reflection of the lefts.

It happened that the handling of right and lefts suggested a similar pattern for handing moves in X and Y, so those four tests were fine with only one bobble where I had inadvertently transposed X and Y. And then TA-DA; put in all of the acceptance tests that ignore the question of obstacles, and they are all passing.

At this point, what do things look like? I've about 20 "high level" tests, a representation of the solution that the machine understands, but no particular fondness for the human properties of the solution - the design is easy to work in (primarily, I suspect, because it is still familiar), but it doesn't describe the domain very well, or support other alternative interfaces or problems. In short, its a function monolith.

I had expected the obstacle to add a challenge; but a quirk in the Java language actually made the API change I needed trivial; with that, adding a bit more to the monolith solved the whole mess.

In short, the code I've got most resembles the Gilded Rose; not quite such a rats' nest, but that's more a reflection of the fact that the rules I'm working in are a lot more regular.

The good news is that I have test coverage; so I've no concerns that making changes to the implementation.

The disappointing news - the modules that I want to have aren't really teased out. Borrowing from the language of Parnas, the tests as written span quite a few decisions that might change later. As a consequence, the tests as written are brittle.

For instance, we might imagine that there had been a misunderstanding of the output message format; our code reports x:y:orientation, but the requirement is orientation:x:y. This isn't a difficult fix to make -- in my implementation, this is one line of code to change. But the tests as written are all tightly coupled to the wrong output representation, so those tests are all going to require some modification to reflect the changed requirements.

This can certainly be fixed - the scope of work is small, we've caught the problem in time, and so on.

But what I do want to emphasize is that this is, in a sense, extra work. The design that I want didn't organically appear during the refactoring steps.

Why is that? My thinking is that I was "optimizing" my design for wall clock time of getting all of the constraints in place; having the constraints means that other design work is safe. But if I had caught the design error sooner, I would have taken slightly longer to finish implementing the constraints, but would also have less work to do to be "finished", in the sense of having good bulkheads to guard against future change.

Short exercises don't, in my experience, express changing requirements very well. Can we decouple the logic from the input representation? from the output representation? Can we decouple the algorithm from the data representations? Is it even clear where these boundaries are, so that we can approach the problem 6 months later?

And most importantly - do you effortlessly produce these boundaries on your first pass through the code?

Saturday, April 11, 2020

Testing in the Time of Time

Programmer tests - meaning those tests that we run between edits to estimate how many new mistakes we have added to the code - run eye blink fast.

Which means that the lifetime of an entity within our test scenario is an instant.

But the scenario that motivates our simulation often has measurable duration.

If this is to reflect in our design; if the test is to communicate this duration to its future maintainer, then the design of the test needs to imply measurable time.

Thursday, April 9, 2020

TDD: Roman Bowling Discussion

I discovered the Roman Bowling concept because I was looking for a familiar example to draw upon when discussing classifications of tests.

Let's start with a trivial test - we want to demonstrate that our code produces the correct output when when input is a representation of a perfect game. Many TDD exercises like to start with a trivial example, but my preference is to skip ahead to something with a bit of meat on it. Certain kinds of mistakes are easier to detect when working with real data.

def test_perfect_game(self):
    from roman import bowling

    pinfalls = ["X", "X", "X", "X", "X", "X", "X", "X", "X", "X",
                "X", "X"]
    score = bowling.score(pinfalls)
    self.assertEqual("CCC", score)

There's nothing here that's particularly beautiful, or enduring. We have a straight forward representation of the behavior that we want, written at a scaffolding level of quality. It's enough to get started, and certainly for this demonstration; the machine understands exactly the check we want to make, but a human being coming into this code cold might want a more informative prose style.

Is this a "unit test"?

The motivation for this exercise is to give some extra thought to what that question means, and which parts of that meaning are actually important.

To get to green from red, we can lean on the simplest thing - simply hard coding the desired output.

def score(pinfalls):
    return "CCC"

This arrangement has a lot of the properties we want of a test to be run after each refactoring: the test shares no mutable state with any other code that might be executing, it has no I/O or network dependencies. The test is eye blink quick to run, and the behavior is stable - if the test reports a problem, we really have introduced a mistake in the code somewhere.

Let's try a quick bit of refactoring, we're going to change this implementation here so that is justly slightly better aligned with our domain model. It's going to be a bit more convenient to work internally with integer values, and treat the romanization of that value as a presentation problem. So what I want to do here is introduce a 300, somehow, and then use that 300 to craft the desired representation.

def score(pinfalls):
    return "".join(["C" for _ in range(300 // 100)])

I find the syntax for joins in python a bit clumsy (I'm not fluent, and certainly don't think in the language), but I don't believe that the code written here would startle a maintainer with working knowledge of the language. Executing the test shows that the machine still understands.

Is it a "unit test"?

Now I want to perform a simple "extract method" refactoring. It may be obvious what's coming, but for this demonstration let's take exaggeratedly small steps.

def score(pinfalls):
    def to_roman(score):
        return "".join(["C" for _ in range(score // 100)])

    return to_roman(300)

Is it a "unit test"?

The implementation of to_roman hasn't become noticeably better; there's a lot more cleanup to be done. But having created the function, I can now start exploring the concept of distance.

def to_roman(score):
    return "".join(["C" for _ in range(score // 100)])

def score(pinfalls):
    return to_roman(300)

Is it a "unit test"?

from previous.exercise import to_roman

def score(pinfalls):
    return to_roman(300)

Is it a "unit test"?

In the first refactoring, the to_roman method definition was nested. Does it matter to us if we nest the import?

def score(pinfalls):
    from previous.exercise import to_roman

    return to_roman(300)

Is it a "unit test"?

Is there any point in this sequence of refactorings where you find yourself thinking "gee, I'm not really comfortable with this; I need to introduce a test double?"

My answers, today? I don't care if it is a unit test or not. I care whether it is still suitable for running between edits, and it still is. The test is still fast and isolated. Sure, roman.bowling.score depends on previous.exercise.to_roman, but to_roman has stable requirements and lots of tests; it's not a significantly riskier dependency than the python library itself.

Introducing a test double to "mitigate the risks" of an external dependency is a lousy trade when the risk of the external dependency is already low.

If we are concerned about having a suite of tests that help us to locate bugs, an interesting alternative to using a test double is write a test that uses a more complicated description of the behavior. In this case, for instance, we might write

def test_perfect_game(self):
    from roman import bowling

    pinfalls = ["X", "X", "X", "X", "X", "X", "X", "X", "X", "X",
                "X", "X"]
    score = bowling.score(pinfalls)
    from previous.exercise import to_roman
    self.assertEqual(to_roman(300), score)

Is it a "unit test"?

I think of this pattern as "behaves like". We aren't saying that roman.bowling.score calls previous.exercise.to_roman with a specific input; that's coupling to an implementation detail that we might reasonably want to change later. Instead, we are adopting a looser constraint that the code should produce the same behavior. The test and the code might both depend on the same external dependency, or not.

Taken to extremes, you end up with something like an oracle - a complete reference implementation in the test, and an assertion that the test subject behaves the same way for all outputs. But be careful, somewhere in the mix you want at least one test to ensure that you haven't inadvertently introduced the same bug into both implementations.

Wednesday, April 8, 2020

TDD: Roman Bowling Challenge

An adaptation of familiar coding problems.

Martin and Koss introduced bowling scoring as a demonstration piece for how tests could fit into the programming process. In the distilled version, which helps to limit the required time, we focus our attention on computing the score of the game after it has been completed.

In May of 2001, Kent Beck and Alan Francis demonstrated the Roman numerals conversion exercise to a live audience.

So...

Let's design a program that compute the final score of bowling games, where the inputs (pin falls) and the output (final score) are expressed using Roman numerals.

Discussion

Notice the character of the tests that you write. Are they unit tests? acceptance tests? something else?

Are you working top down? bottom up? something else?

Bonus Challenge

Bowling is much more exciting in the original Klingon. Can you extend your solution quickly enough to capture this vibrant new market?

Discussion....