Tuesday, December 8, 2020

Jason Gorman: Outside In

Jason Gorman recently published a short video where he discusses outside in TDD. It's good, you should watch it.

That said, I do not like thing.

So I want to start by first focusing on a few things that I did like:

First, it is a very clean presentation. He had a clear vision of the story he wanted to tell about his movement through the design, and his choice to mirror that movement helps to re-enforce the lesson he is sharing.

Second, I really like his emphasis on the consistency within an abstraction; that everything within a module is at the same abstraction level.

Third, I really liked the additional emphasis he placed on the composition root. It's an honest presentation of some of the trade offs that we face. I myself tend to eschew "unit tests" in part because I don't like the number of additional edges it adds to the graph - here, he doesn't sweep those edges under the rug; they are right there where you can see them.

I learned TDD during an early part of the Look ma, no hands! era, which has left me somewhat prone to question whether the ritual is providing the value, or if instead it is the prior experience of the operator that should get the credit for the designs we see on the page.

Let us imagine, for a moment, that we were to implement the first test that he shows, and furthermore we were going to write that code "assert first". Our initial attempt might look like the implementation below

Of course, this code does not compile. Let's try "just enough code to compile". So we need a Product with a constructor, and an Alert interface with a send method, and then to create a product instance and an alert mock. In the interest of a clean presentation, I'll take advantage of the fact that Java allows me to have more than one implementation within the same source file.

What more do we need to make this test pass? Not very much

This is not, by any means, completed code - there is certainly refactoring to be done, and duplication to remove, and all the good stuff.

The text of Jason Gorman's test, and his presentation of it, lead me to believe that the test was not supporting refactoring. What I believe happened instead is that Gorman had a modular design in his mind, and he typed it in. The behavior of ReorderLevel, in particular, has no motivation at all in the statement of the problem - it is an element introduced in the test because we've already decided that there should be a boundary there.

This isn't a criticism of the design, but rather the notion that TDD can lead to modular code. I'm not seeing leading here; but something more akin to the fact that TDD happened to be near by while outside-in modular design was happening.

The assignment of credit is completely suspect.

The second thing that caught my eye is expressed within this assert method itself. The presentation at the end of the exercise shows us tests, and code coverage, and contract tests, all great stuff... but after all that is done, we still ended up with a "data class" that doesn't implement its own version of `equals`. The mockito verification shown in the first test is an identity check - the test asks did this instance handle move through the code from where we put it in to where we took it out?

Product, here, is a data structure that holds information that was copied over the web. The notion that we need a specific instance of it are suspect.

Is it an oversight? I've been thinking on that, and my current position is that it is indistinguishable from an oversight. We're looking at a snapshot of a toy project, so it's very hard to know from the outside whether or not `Product` should override the behavior of `Object.equals` -- but there's no obvious point in the demonstration that you can point to and say "if that were a mistake, we would catch it here".

In other words: what kinds of design mistakes are the mistake detectors detecting? Do we actually make those mistakes? How long must we wait until the savings from detecting the mistakes offsets the cost of implementing and maintaining the mistake detectors?

Myself, I like designing from the outside in. I have a lot of respect for endotesting. But at the end of the day, I find that the demonstrated technique doesn't lead to a modular design, but rather locks in a specific modular design. And that makes me question whether the benefits of locking in offset the costs of maintaining the tests.

Thursday, November 26, 2020

TDD: Controlled Data

I have been paying more attention to parsing of late, and that in turn led me to some new understandings of TDD.

Back in the day when I was first learning about RED/GREEN/REFACTOR, the common approach to attacking a problem was to think about what the design might look like at the end, choose a candidate class from that design, and see what could be done to get it working.

Frequently, people would begin from a leaf class, expecting to build their way up, and that seemed wrong to me. The practical flaw was the amount of work they would need to do before you could finally point to something that delivered value. The theoretical flaw was this: I already knew how to guess what the design should be. What I wanted, and what I thought I was being offered, was emergent design -- or failing that, experiments that would honestly test the proposition that a sound design could emerge from tests.

The approach I preferred would start from the boundary; let's create a sample client, and using that requirement begin enumerating the different behaviors we expect, and discover all of the richness underneath by rearranging the implementation until it is well aligned with our design sense.

Of course, we're still starting with a guess, but it's a small guess -- we know that if our code is going to be useful there must be a way to talk to it. Bootstrapping can be a challenge -- what does the interface to our code look like.

And in group exercises, I've had a fair amount of success with this simple heuristic: choose something that's easy to type.

Let's take a look at the bowling game, in the original Klingon. These days, I most commonly see the bowling game exercise presented as a series of drills (practice these tests and refactorings until it becomes second nature); but in an early incarnation it was a re-enactment of a pair programming episode.

Reviewing their description, two things stand out. First, that they had initially guessed at an interface with extra data types, but rejected it when they realized the code that they "wanted to type" would not compile. And second, that they defer the question of what the implementation should do with an input that doesn't belong to the domain

tests in the rest of the system will catch an invalid argument.

I want to take a moment to frame that idea using a different language. Bertrand Meyer, in Object-oriented Software Construction, begins with a discussion of Correctness and Robustness. What I see here is that Koss and Martin chose to defer Robustness concerns in order to concentrate on Correctness. For the implementation that they are currently writing, properly sanitized inputs are a pre-condition imposed on the client code.

But when you are out near an unsanitized boundary, you aren't guaranteed that all inputs can be mapped to sanitized inputs. If your solution is going to be robust, then you are going to need graceful affordances for handling abnormal inputs.

For an implementation that can assume sanitized inputs, you can measure the code under test with a relatively straight forward client like

But near the boundary?

I don't find that this produces a satisfactory usage example. Even if we were to accept that throwing a unchecked exception is a satisfactory response to an invalid input, this example doesn't demonstrate graceful handling of the thrown exception.

Let's look at this first example again. What is the example really showing us?

I would argue that what this first example is showing us is that we can create a bowling score from a sanitized sequence of inputs. It's a recipe, that requires a single external ingredient.

Can we do the same with our unsanitized inputs? Yes, if we allow ourselves to be a little bit flexible in the face of ambiguity.

A parser is just a function that consumes less-structured input and produces more-structured output. -- Alexis King

I want to offer a definition that is consistent in spirit, but with a slightly different spelling: a parser is just a function that consumes less-structured input and produces ambiguous more-structured output.

When we parse our unsanitized sequence of throws, what gets produced is either a sanitized sequence of throws or an abnormal sequence of throws, but without extra constraints on the input sequence we don't necessarily know which.

In effect, we have a container with ambiguous, but sanitized, contents.

That ambiguity is part of the essential complexity of the problem when we have to include sanitizing our inputs. And therefore it follows that we should be expressing that idea explicitly in our design.

We don't have to guess all of the complexity at once, because we can start out limiting our focus to those controlled inputs that should always produce a bowling score; which means that all of the other cases that we haven't considered yet can be lumped into an "unknown" category -- that's going to be safe, because a correct implementation must not use the unknown code path when provided with pre-sanitized inputs.

When we later replace the two alternative parser with one that produces more alternatives -- that's just more dead branches of code that can again be expressed as unknown.

In the simple version of the bowling game exercise, we need three ingredients

Our unsantizied input
a transform to use when we terminate the example with a valid score
a transform to use when we terminate the example with abnormal inputs.

So we can immediately reach for whatever our favorite pattern for transforming different results might be.

Following these ideas, you can spin up something like this in your first pull:

And now that everything is compiling, you can dig into the RED/GREEN/REFACTOR cycle and start exploring possible designs.

Now, I've palmed a card here, and I want to show it because it is all part of the same lesson. The interface provided here embraces the complexity of unsanitized data, but it drops the ball on timing -- I build into this interface an assumption that the unsanitized data is all arrives at the same time. If we are defining for a context where unsanitized throws arrive one at a time, then our examples may need to show how we explicitly handle memory; and we may have to make good guesses about whether we need to announce the abnormal input at once, or if we can let it wait until the score method is invoked.

The good news: often we have over simplified because our attention was on the most common case; so in the case that we discover our errors late, there can be a lot of re-use in our implemetation.

The bad news: if our clients are over simplified, then the implementations that are using our clients are going to need rewrites.

If we're aware, we can still tackle the problem in slices - first releasing a client that delivers business value in the simplest possible context, and then expanding our market share with new clients the deliver that value to other contexts.

Friday, May 8, 2020

HTML5 forms with base.href

I'm working on a REST API, and to keep things simple for myself I'm using text/html for my primary media type. The big advantage is that I can use a general purpose web browser to test my API by hand.

In this particular case, the client communicates with the API via a services of web form submissions. That allows my server to control which cached representations will be invalidated when the form is submitted.

As an experiment, instead of copying the target-uri into the form actions, I decided to try using a BASE element, with no action at all, expecting that the form would be submitted to the base href.

But what the HTML 5 specification says is:

If action is the empty string, let action be the URL of the form document.

So that doesn't work. Via stack overflow, I discovered a candidate work around - the target of the form can be a fragment; the fragment delimiter itself means the form action is not empty, and therefore relative lookup via the base element is enabled.

Somewhat worrisome: I haven't managed to find clear evidence that form behavior with fragments is well defined. It occurred to me that perhaps the standard was playing clever with URL vs URI, but I abandoned that idea after clicking a few more links to review the examples of URLs, and discovered fragments there.

I suspect that I need to let go of this idea - it's not that much extra work to copy the base uri into the forms that want it, and I've got a much clearer picture of what happens in that case.

Saturday, April 18, 2020

TDD Test Bulk Heads

I wanted to illustrate some of the ideas in my previous essay with some code examples, in an attempt to spell out more clearly what I think can, and perhaps even should, be going on.

So let's imagine that we are working on a Mars Rover exercise, and we want to start from a trivial behavior: if we don't send any instructions to the rover, then it will report back its initial position.

Such a use case could be spelled a lot of different ways. But if we are free of the constraints of legacy code, then we might reasonably begin with a primitive obsessed representation of this trivial behavior; the simple reason being that such a representation is easy to type.

The fastest way I can see to get from RED to GREEN is to simply wire in the correct response.

At this point in most TDD demonstrations, a second test will be introduced, so that one can begin triangulating on a solution. Alternatively, one could begin removing duplication right away.

For my purposes here, it is convenient to attack this problem in a very specific way - which is to recognize that the representation of the answer is a separate concern form the calculation of the answer. I assert that this sort of "extract method" refactoring should occur to the programmer at some point during the exercise.

I'm going to accelerate that refactoring to appear at the beginning of the exercise, before the function becomes too cluttered. The motivation here is to keep the example clear of distractions; please forgive that I am willing to give up some verisimilitude to achieve that.

That means that we can produce an equivalent behavior (and therefore pass the test), with this implementation:

The naming is dreadful, of course, but we've taken a step forward in separating calculation from presentation.

If we happen to notice that we've committed this refactoring, then an interesting possibility appears; we can decompose our existing assertion into two distinct pieces.

This version, with the double assertion, doesn't improve our real circumstances at all - but it begins to pave the way to a third version, which does have some interest:

With this change, we're constraining our behavior exactly as before... except that the new arrangement is somewhat more resilient when the required behaviors change in the future. For example, if we needed to replace this somewhat ad hoc string representation with a JSON document, only one of the two tests -- that test which constrains the behavior of the formatting logic -- fails and needs to be rewritten/retired.

I tend to use the pronunciation "behaves like" when considering tests of this form; the test doesn't assert that rover.run calls extractedMethod, only that it returns the same thing you would get by calling that method with the assigned arguments. For instance, if we were to inline that particular call (leaving the extracted method in place), the tests would still correctly pass.

That sounds pretty good. Where's the catch?

As best I can tell, the main catch is that we have to be confident that the signature of the new method is stable. By expressing the intended behavior in this way, we're increasing the coupling between the tests and the implementation, making that part of the implementation harder to change than if it were a hidden detail.

TDD: Return to Mars

The Boston Software Crafters meetup this week featured a variant on the Mars Rover exercise. I decided to take another swing at it on my own; Paul Reilly had introduced in the problem a clean notation for obstacles, and I wanted to try my hand at introducing that innovation at a late stage.

As is common in "Classic TDD", all of the usual I/O and network concerns are abstracted away in the pre-game, leaving only raw calculation and logic; given this input, produce the corresponding output.

My preferred approach it to work with tests that are representations of the complete behavior, and then introduce implementation elements as I need them. Another way of saying the same thing: rather than trying to guess where my "units" should be right out of the gate, I guess only at a single candidate interface that allows me to describe the problem in a way that the machine can understand, and then flesh out further details as additional details become clear.

So for this exercise session, I began by creating a test checklist. The "acceptance" tests provided by Paul went to the top of the list; these help to drive questions like "what should the API look like". Rather than diving directly into trying to solve those cases, I took the time to add to the checklist a bunch of simple behaviors to write - what should the solution output when the input is an empty program? or if the program contains only turn commands? or if the program describes an orbit around the origin (which, in this exercise, meant worming into each corner of the finite grid).

I came up with the empty program and the turns right away; the orbit of the origin occurred to me later when I was thinking about how to properly test the relation between move and turn. If the problem had included cases of rovers starting away from the origin, I might not have addressed the coordinate wrapping effects quite so soon.

Along the way, one "extra" acceptance test occurred to me.

The empty program obviously had the simplest behavior, and the reporting format was separated from the calculation right away. After passing a test with two left turns, there was a period of refactoring to introduce the notion of working through program instructions (which are just primitive strings at this point), the remaining lefts were trivial to add, and rights were painless because they are simply a reflection of the lefts.

It happened that the handling of right and lefts suggested a similar pattern for handing moves in X and Y, so those four tests were fine with only one bobble where I had inadvertently transposed X and Y. And then TA-DA; put in all of the acceptance tests that ignore the question of obstacles, and they are all passing.

At this point, what do things look like? I've about 20 "high level" tests, a representation of the solution that the machine understands, but no particular fondness for the human properties of the solution - the design is easy to work in (primarily, I suspect, because it is still familiar), but it doesn't describe the domain very well, or support other alternative interfaces or problems. In short, its a function monolith.

I had expected the obstacle to add a challenge; but a quirk in the Java language actually made the API change I needed trivial; with that, adding a bit more to the monolith solved the whole mess.

In short, the code I've got most resembles the Gilded Rose; not quite such a rats' nest, but that's more a reflection of the fact that the rules I'm working in are a lot more regular.

The good news is that I have test coverage; so I've no concerns that making changes to the implementation.

The disappointing news - the modules that I want to have aren't really teased out. Borrowing from the language of Parnas, the tests as written span quite a few decisions that might change later. As a consequence, the tests as written are brittle.

For instance, we might imagine that there had been a misunderstanding of the output message format; our code reports x:y:orientation, but the requirement is orientation:x:y. This isn't a difficult fix to make -- in my implementation, this is one line of code to change. But the tests as written are all tightly coupled to the wrong output representation, so those tests are all going to require some modification to reflect the changed requirements.

This can certainly be fixed - the scope of work is small, we've caught the problem in time, and so on.

But what I do want to emphasize is that this is, in a sense, extra work. The design that I want didn't organically appear during the refactoring steps.

Why is that? My thinking is that I was "optimizing" my design for wall clock time of getting all of the constraints in place; having the constraints means that other design work is safe. But if I had caught the design error sooner, I would have taken slightly longer to finish implementing the constraints, but would also have less work to do to be "finished", in the sense of having good bulkheads to guard against future change.

Short exercises don't, in my experience, express changing requirements very well. Can we decouple the logic from the input representation? from the output representation? Can we decouple the algorithm from the data representations? Is it even clear where these boundaries are, so that we can approach the problem 6 months later?

And most importantly - do you effortlessly produce these boundaries on your first pass through the code?

Saturday, April 11, 2020

Testing in the Time of Time

Programmer tests - meaning those tests that we run between edits to estimate how many new mistakes we have added to the code - run eye blink fast.

Which means that the lifetime of an entity within our test scenario is an instant.

But the scenario that motivates our simulation often has measurable duration.

If this is to reflect in our design; if the test is to communicate this duration to its future maintainer, then the design of the test needs to imply measurable time.

Thursday, April 9, 2020

TDD: Roman Bowling Discussion

I discovered the Roman Bowling concept because I was looking for a familiar example to draw upon when discussing classifications of tests.

Let's start with a trivial test - we want to demonstrate that our code produces the correct output when when input is a representation of a perfect game. Many TDD exercises like to start with a trivial example, but my preference is to skip ahead to something with a bit of meat on it. Certain kinds of mistakes are easier to detect when working with real data.

def test_perfect_game(self):
    from roman import bowling

    pinfalls = ["X", "X", "X", "X", "X", "X", "X", "X", "X", "X",
                "X", "X"]
    score = bowling.score(pinfalls)
    self.assertEqual("CCC", score)

There's nothing here that's particularly beautiful, or enduring. We have a straight forward representation of the behavior that we want, written at a scaffolding level of quality. It's enough to get started, and certainly for this demonstration; the machine understands exactly the check we want to make, but a human being coming into this code cold might want a more informative prose style.

Is this a "unit test"?

The motivation for this exercise is to give some extra thought to what that question means, and which parts of that meaning are actually important.

To get to green from red, we can lean on the simplest thing - simply hard coding the desired output.

def score(pinfalls):
    return "CCC"

This arrangement has a lot of the properties we want of a test to be run after each refactoring: the test shares no mutable state with any other code that might be executing, it has no I/O or network dependencies. The test is eye blink quick to run, and the behavior is stable - if the test reports a problem, we really have introduced a mistake in the code somewhere.

Let's try a quick bit of refactoring, we're going to change this implementation here so that is justly slightly better aligned with our domain model. It's going to be a bit more convenient to work internally with integer values, and treat the romanization of that value as a presentation problem. So what I want to do here is introduce a 300, somehow, and then use that 300 to craft the desired representation.

def score(pinfalls):
    return "".join(["C" for _ in range(300 // 100)])

I find the syntax for joins in python a bit clumsy (I'm not fluent, and certainly don't think in the language), but I don't believe that the code written here would startle a maintainer with working knowledge of the language. Executing the test shows that the machine still understands.

Is it a "unit test"?

Now I want to perform a simple "extract method" refactoring. It may be obvious what's coming, but for this demonstration let's take exaggeratedly small steps.

def score(pinfalls):
    def to_roman(score):
        return "".join(["C" for _ in range(score // 100)])

    return to_roman(300)

Is it a "unit test"?

The implementation of to_roman hasn't become noticeably better; there's a lot more cleanup to be done. But having created the function, I can now start exploring the concept of distance.

def to_roman(score):
    return "".join(["C" for _ in range(score // 100)])

def score(pinfalls):
    return to_roman(300)

Is it a "unit test"?

from previous.exercise import to_roman

def score(pinfalls):
    return to_roman(300)

Is it a "unit test"?

In the first refactoring, the to_roman method definition was nested. Does it matter to us if we nest the import?

def score(pinfalls):
    from previous.exercise import to_roman

    return to_roman(300)

Is it a "unit test"?

Is there any point in this sequence of refactorings where you find yourself thinking "gee, I'm not really comfortable with this; I need to introduce a test double?"

My answers, today? I don't care if it is a unit test or not. I care whether it is still suitable for running between edits, and it still is. The test is still fast and isolated. Sure, roman.bowling.score depends on previous.exercise.to_roman, but to_roman has stable requirements and lots of tests; it's not a significantly riskier dependency than the python library itself.

Introducing a test double to "mitigate the risks" of an external dependency is a lousy trade when the risk of the external dependency is already low.

If we are concerned about having a suite of tests that help us to locate bugs, an interesting alternative to using a test double is write a test that uses a more complicated description of the behavior. In this case, for instance, we might write

def test_perfect_game(self):
    from roman import bowling

    pinfalls = ["X", "X", "X", "X", "X", "X", "X", "X", "X", "X",
                "X", "X"]
    score = bowling.score(pinfalls)
    from previous.exercise import to_roman
    self.assertEqual(to_roman(300), score)

Is it a "unit test"?

I think of this pattern as "behaves like". We aren't saying that roman.bowling.score calls previous.exercise.to_roman with a specific input; that's coupling to an implementation detail that we might reasonably want to change later. Instead, we are adopting a looser constraint that the code should produce the same behavior. The test and the code might both depend on the same external dependency, or not.

Taken to extremes, you end up with something like an oracle - a complete reference implementation in the test, and an assertion that the test subject behaves the same way for all outputs. But be careful, somewhere in the mix you want at least one test to ensure that you haven't inadvertently introduced the same bug into both implementations.

Wednesday, April 8, 2020

TDD: Roman Bowling Challenge

An adaptation of familiar coding problems.

Martin and Koss introduced bowling scoring as a demonstration piece for how tests could fit into the programming process. In the distilled version, which helps to limit the required time, we focus our attention on computing the score of the game after it has been completed.

In May of 2001, Kent Beck and Alan Francis demonstrated the Roman numerals conversion exercise to a live audience.

So...

Let's design a program that compute the final score of bowling games, where the inputs (pin falls) and the output (final score) are expressed using Roman numerals.

Discussion

Notice the character of the tests that you write. Are they unit tests? acceptance tests? something else?

Are you working top down? bottom up? something else?

Bonus Challenge

Bowling is much more exciting in the original Klingon. Can you extend your solution quickly enough to capture this vibrant new market?

Discussion....

Tuesday, March 31, 2020

Lessons from a skeleton

I've been working on a vanity project recently; as part of the exercise, I decided to try to built out a walking skeleton. The basic idea behind a walking skeleton is to front load a bunch of plumbing concerns. The Ur skeleton is probably Hello World - in effect a warm up exercise to ensure that we can compile a program, link with our I/O library, run the result. In 2020, we normally add into this a CI or CD pipeline, source control, and so on. In my case, I'm working toward a REST API with serverless reference implementation, so the walking skeleton includes configuring a number of AWS offerings. First lesson: I targeted an transient deliverable. I had considered obeying the testing goat and writing a django implementation, then doing a lift and shift. The problem is that we are now front loading the wrong pain -- I don't care, in the long term, about the leaky abstractions I'll encounter in Django because that is just a temporary structure. The lessons of Django are unlikely to transfer to the intended staging, so I might as well skip that pain and move more directly toward my destination. Second lesson: in so far as is possible, work inside out.

Everything should be built top-down, except the first time.

Walking skeleton is very much a "the first time" situation; there's a lot of comfort to be had in introducing a single new element to something that has been proven to be sound. Working outward tends to concentrate the uncertainty cloud in the new elements, because the lower level traps have already been discovered. Third lesson: manual first. We're not done until we've achieved our goal of a one click deploy. BUT working through the skeleton manually first helps to clarify the automation requirements. This is a form of separation of concerns; we start out with two problems -- one is to align our own custom design into the general purpose model of our supplier, the other is to automate the steps of the general purpose model. Keeping these two ideas separate allows us to bring our concentration to bear on a single concern.