Friday, December 6, 2019

Banking: Skip to the Owl

I've been looking at the Codurance Banking exercise this evening:

Given a client makes a deposit of 1000 on 10-01-2012
And a deposit of 2000 on 13-01-2012
And a withdrawal of 500 on 14-01-2012
When they print their bank statement
Then they would see:

Date       || Amount || Balance
14/01/2012 || -500   || 2500
13/01/2012 || 2000   || 3000
10/01/2012 || 1000   || 1000

which is introduced with an additional interface constraint
public interface AccountService
    void deposit(int amount); 
    void withdraw(int amount); 
    void printStatement();

Here are the things that I see when I look at this problem.

First, I notice that we have dates. That implies that somewhere I'm going to need a calendar or a clock, because my interface doesn't have way to provide a date. Similarly, I'm going to need some sort of output device that accepts the statement. I'm going to need some way to capture the dates and amounts so that they are available to be printed - a storage data structure of some form. And I'm going to need some functional compute to actually do the work.

Let's draw the rest of the owl!

class TheRestOfTheOwl implements AccountService {
    interface StatelessCore {
        void deposit(int amount, Clock clock, Storage storage);
        void withdraw(int amount, Clock clock, Storage storage);
        void printStatement(Storage storage, Console console);
    private final Clock clock;
    private final Storage storage;
    private final Console console;
    private final StatelessCore core;

    TheRestOfTheOwl(Clock clock, Storage storage, Console console, StatelessCore core) {
        this.clock = clock; = storage;
        this.console = console;
        this.core = core;

    public void deposit(int amount) {
        core.deposit(amount, clock, storage);

    public void withdraw(int amount) {
        core.withdraw(amount, clock, storage);

    public void printStatement() {
        core.printStatement(storage, console);

You could write tests for that, but I don't see how you are going to get a lot of return out of that investment.

Now, let's suppose you didn't have the insight that Storage is a separate concept from the working core; you might have guessed that you could keep everything in an in-memory data structure, for example, because in your independent tests the lifetime of "the" account is coincident with the test universe itself. So you might start with a different owl:

class SimpleOwl implements AccountService {
    interface StatelessCore {
        void deposit(int amount, Clock clock);
        void withdraw(int amount, Clock clock);
        void printStatement(Console console);

    private final Clock clock;
    private final Console console;
    private final StatelessCore core;

    SimpleOwl(Clock clock, Console console, StatelessCore core) {
        this.clock = clock;
        this.console = console;
        this.core = core;
    public void deposit(int amount) {
        core.deposit(amount, clock);

    public void withdraw(int amount) {
        core.withdraw(amount, clock);

    public void printStatement() {

Riddle: what happens now when you discover that storage should be a separate component?

One possible answer is to recognize that SimpleOwl is simple enough to throw away when it isn't useful any more.

Another possibility is to introduce an adapter....

class Adapter implements SimpleOwl.StatelessCore {
    final Storage storage;
    final TheRestOfTheOwl.StatelessCore core;

    Adapter(Storage storage, TheRestOfTheOwl.StatelessCore core) { = storage;
        this.core = core;

    public void deposit(int amount, Clock clock) {
        core.deposit(amount, clock, storage);

    public void withdraw(int amount, Clock clock) {
        core.withdraw(amount, clock, storage);

    public void printStatement(Console console) {
        core.printStatement(storage, console);

Note that this illustrates that "Stateless Core" is a really lousy interface name; Adapter is a core that has state. Naming is one of the two hard problems.

You'd see a similar scramble when you realize that the description of your use case doesn't have any sort of representation of a unique account identifier to ensure that you retrieve the correct information from shared storage.

Thursday, November 14, 2019

Refactoring: Reads Before Writes

One of the advantages of practice coding is that it gives you permission to pause your progress and investigate process smells; did I just make a mistake?

Today, I was working on a refactoring exercise; I had code, and tests in place, and all of my tests are passing.  I made a change, ran the tests, and tests failed.  Startled, I reverted the change.  The tests are now passing again.

But I had to stop here, because I happened to notice a problem: the change that I reverted to pass the test was correct.

What had happened?  Well, that's easy -- one of my earlier edits included a mistake, because of the mistake, the behavior of the code was incorrect.  But the test didn't detect the mistake at the point that it was introduced.  Instead the test appeared later.

This concerns me, because one of the illusions that I carry around with me is that there is a payoff in TDD that you spend less time in the debugger because mistakes are caught when you make them.   Test-commit-revert is, to some degree, founded on this idea -- that you can discard the mistake by simply reverting the code.

Assuming, for the moment, that all other things are equal, I prefer habits that produce short intervals between the mistake and its detection to those that produce longer intervals.

So, what happened today?

I was working on a greenfield project; write a test, make it pass, clean it up, repeat. And my "imagined design" was not expressed in the code. The changes I want to introduce appear to be getting harder, so I'm interrupting my "progress" to make the next change easy.

My imagined design is a finite state machine, and I'm trying to organize the code in a way that introducing a new state (with new behavior) will be easy.

I'm writing a game, and tracking the token as it moves from the starting square to the second square; the tests verify the descriptions of the squares against a fixed transcript.  The test that I am anticipating moves the token back to the original square, with a different description than was initially displayed.

My thought was to introduce a predicate which is always true (preserving the current behavior), and then to introduce the new test which would pass if the predicate were false.  Thus, I would meet one of my goals: minimizing the time required to complete the "Green" task.

And, having practice that approach several times this morning, it works fine.

But not the first time that I tried it.

The first time through, there were two differences in the technique that I used which both contributed.  One difference was that I typed my code, rather than using the refactoring tools.  That made it possible to introduce an error - a line of code I thought was assigning a value to a variable was in fact a no-op.

The other difference is that I started out my refactoring by introducing the writes; creating the new variable, assigning the data to it.  I introduced the error during this phase, but because the variable is not yet being read anywhere, the coding error I've introduced has no impact at all on the behavior.  So all of the tests continue to pass.  When I introduced the new variable into the predicate, now the bug appears.

On the other hand, on the interations where I started from "If True", mistakes were detected at the moment I introduced them.

It felt a bit like TDD as if you meant it: if True is trivially correct, then true is rewritten into an equivalent comparison between two literal values, not we extract variable on one of those literals, and then that variable can be moved around in the source code.  At each mistake, a simple revert/undo would take me back to not merely a passing state, but actual bug free code that passed its tests.

Another way of considering this lesson is that we should start as close to the actual constraints of the tests, and work backwards from there, gradually moving our changes into the parts of our implementation where we have more freedom.

It's similar in nature to starting with a hard coded return value, and then removing duplication as your refactor your way toward the function arguments.

Sunday, November 10, 2019

TDD: Transport Tycoon, Exercise 1

I decided that I wanted to take a quick swing at this as a TDD exercise...

Because what we are being presented is the behavior of some pure function, I used a facade as my test subject -- a single function accepting a string argument and returning an integer.  Behind that facade I can refactor my way towards a fully buzzword compliant domain model, but these behavior tests aren't coupled to the model.

That means, incidentally, that these tests aren't some beautiful enduring artifact that describes the model in exquisite detail; they are disposable scaffolding tests.

Because I wasn't sure where I was going, I simply implemented everything in a straight forward way within the facade.  I went three examples in with `if` statements before I started trying to tease out the implicit duplication of the model.  Eventually, all three branches turned into the same code, and then they were simplified to remove that duplication.

It became clear in working on some of the longer problems that I really wanted to have more confidence in the intermediate state.  That insight led me back to the idea that I wanted a "pure function" that could handle each piece of cargo one at time, so that I could track the evolution of the system at each step.  In theory, such a thing can be created via refactoring, but since I wanted confidence in the implementation, I decided to perform a separate TDD run on just that piece, and then verified that the tests against the original facade continued to pass when I applied the refactoring.

When that refactoring was complete, the implementation behind the facade was broken into two pieces -- a state machine to manage the bookkeeping of my "fleet", and pure function that computed transitions from one state to another.

I deliberately declined to maintain CQS discipline; separating the queries from the commands in my bookkeeping component appeared to be ceremony with no particular payoff in the exercise.

Friday, October 25, 2019

Following a better path

I intend to follow a better path, now that @marlenac has brought it to my attention.

When you are able to condense your software expertise into a practice that takes <10min and a relative beginner can mimic, and repeating that exercise thousands of times brings fresh insight to everyone at every level, you can call it a code kata. -- @jtu

The good TDD exercises do offer fresh insights at many levels, but less face it; the creation of these exercises doesn't demonstrate a lifetime of mastery so much as a clever idea that survived a few rough drafts.  Maybe.

The time limit is a really interesting constraint -- much of my complaints with the TDD exercises that I've found is that the scope of them is much too brief; that an hour of exploration isn't enough time to have the problems that TDD purports to solve.


Friday, October 18, 2019

Refactoring Lessons at the Gilded Rose

The month, the local meetup took on a variant of the Gilded Rose kata.  As I was in a facilitator's chair,  I didn't get to explore any new ideas this time around.  So I decided to work through the exercise this evening.

In doing so, I picked up a couple interesting tricks from Intellij IDEA.

First, I learned that IntelliJ understands the idea of "run all tests in this package", which saves some of the headache of creating a regression suite when your tests are spread out across multiple files.

Second, I learned that IntelliJ can be quite clever about reducing predicates if it has a clear idea which invariant holds.  In various methods I introduced, an assert that described the preconditions for the Strings that were in scope allowed intellisense to remove a bunch of redundancy in the conditionals.

Either it wasn't entirely clever, or I made some bad choices in chosing which refactoring option to take when presented with a choice, but in a number of cases I ended up with ifs in front of empty blocks.  IntelliJ was also able to perform the refactoring to remove them, but I had to ask.

The grand strategy wasn't quite what I had expected it to be.  Back in the "Look, Ma, no hands" era, we tended to focus on micro refactorings, from which some Platonic ideal design would eventually emerge.  But for the Gilded Rose problem, the game seems to be to origami the code into a shape that intellisense recognizes.  So that means a bit of preparatory judo followed by swinging large blocks of code around so that the static code analysis can consider one domain problem at a time.

In short, instead of chasing a good design, I'm first chasing a design that the machine understands well enough to manipulate and simplify without requiring that I type anywhere near the complexity.  Let's face it, the carbonware is barely qualified to type at all -- its role is to point, and let the sand do the dirty work.

Tuesday, October 1, 2019

TDD: Safety in Numbers - a Bowling Game Adventure

Last night, I decided to work through a bowling game exercise, but it didn't quite turn out as I had expected.

The goal was eliminate duplication; how much intention revealing code could I introduce before moving onto the second test?

As suggested by Uncle Bob, I started with the degenerate case:

It is, of course, trivial to get this test passing. We simply hard code the required answer into the score method.

The step that I expected to follow was to immediately start introducing domain concepts, like frames, into the production code while there was still but a single test constraint in place.

And it was a very uncomfortable experience - I realized fairly quickly that a simple pass signal wasn't enough to give me confidence that the match I was introducing was actually manipulating the figures correctly -- not enough to give me confidence that I wasn't introducing silly fence post errors.

After discarding the work and contemplating the ceiling for a time, I decided that I was having trouble because zero is the additive identity -- I couldn't look at my actual result and deduce how many numbers had been added together, because 10, 20, 100 zeros all sum to the same amount.

This evening, I tried a different initial test:

The results are much better - fence post mistakes change the observable behavior in this circumstance, and are therefore easy to catch. The deviations from the expected results give an immediate hint at the error. We know from taking small steps which edit introduced a fault, but the distinct behaviors mean that it is easy to recognize the precise nature of the fault.

In this evenings ending, I managed to get all the way to make the next change easy: using the gutter game as my second test produced a trivial pass, because the faults I introduced during refactoring had already been detected and mitigated.

Saturday, September 7, 2019

TDD: On Fake Code

This past spring, David Tanzer published a short essay on transitioning from fake implementations to real ones.
When you do TDD, “fake implementations” or “wrong code” are OK, as long as they pass all the tests you have so far
But when do you stop to fake? When do you start writing “real code”?
Tanzer is using this as a stepping stone to introduce Uncle Bob's heuristic: as the tests get more specific, the code gets more generic.

But there is another answer, which I eventually learned from a comment written by Kent Beck:
Do you have some refactoring to do first?
Here is Tanzer's passing implementation:
And that's fine for our test calibration; we have successfully demonstrated that the test can distinguish the correct behavior from an incorrect behaviod in this specific case.

But... the current implementation implicitly describes two pieces of domain knowledge that we can make explicit.
  • The length of the hint should be the same as the length of the secret word.
  • The initial representation of the hint should conceal all of the letters in the secret word, which is to say it should be entirely composed of the unrevealed letter token "_".
We don't have to wait for permission to introduce these ideas; they are always going to appear in a refactoring step, so we can cut to the chase and introduce them immediately.

From there, we might notice that the secretWord we are using in the hint method is the same that was passed to the constructor, and extract that duplication. Or we might decide that the creation of the hint of the correct length is a single idea that can be extracted into another function, and do that.

You can start writing the real code as soon as you have a green bar.

Because I was reviewing Saff and Boshernitsan today, I have been thinking about Beck's Money demonstration.  Translated into Python, Beck's first test looks like

Riddle: what's the simplest implementation that will pass this test? There are probably several different answers, but the simplest I can come up with looks like:

No implementation, no variable names. Just 10. It's clear to me that this is "wrong code", in Tanzer's sense. But we don't need more tests to make it better, we can immediately refactor (in Beck's sense) to restore sanity to the implementation.

If we were being very small and deliberate in our refactoring, the refactoring sequence might look like:
Like "triangulation", small and deliberate steps are not required - they are a technique to practice so that you can get small when larger steps aren't working.

Monday, September 2, 2019

Thoughts on an Acceptance Tests

In my recent experiments with Hunt the Wumpus, I started thinking about what an "acceptance" test might look like.

To get started, I reviewed the walking skeleton example in Growing Object Oriented Software.  Freeman and Pryce wrote that the initial iteration should include delivery of a completely automated checkout/build/deploy/test pipeline, front loading the work of solving a number of critical system and political issues.  The acceptance test, in their example, launches the application and uses the user interface to probe and measure the app.

For an interactive shell app like Wumpus, the test harness is relatively straight forward; we control stdin to pass data to the app and control stdout to read data from the app.

What I struggle with, at this point in the narrative, is the amount of work required to create stable acceptance tests.

A point of view: automated checks are mistake detectors.  They don't provide value to the user - you can delete all of your automated checks and the behavior of your production code doesn't change.  Economically, the justification for the tests is that they reduce the costs of future work.  More precisely, we adopt processes that shutdown when a mistake is detected, ensuring that the mistakes cannot be overlooked, and that we don't expose our test subjects to expensive evaluation when the more cost effective checks have already detected problems.

There's another potential benefit to checks, which the TDD ritual seeks to exploit: thinking about how the checks will validate the behavior of your application creates space to discover important ideas in your app before you start coding it.

The acceptance tests, what with all the work we need to do to set them up, are expensive relative to other mechanisms for checking the correctness of the program.  In the case of the Auction Sniper, those tests included measuring that the app could talk to other processes.

In the case of Wumpus, there really aren't other processes to talk to unless we choose a particularly contrived design.  Only the interface to the user is interesting.  So there isn't a lot of complexity that
needs to be evaluated from the outside.

Which is good, because that evaluation is painful.

Wumpus has three awkward aspects to it; hidden information, non-deterministic behavior, and message schema.

The hidden information aspect is what introduces uncertainty in the game - with complete knowledge of the hazards in the maze, the game can be won trivially by shooting the wumpus in its lair.  But without that hidden information, one cannot know the correct outcome of any action by the hunter.

The location of the hazards in the game is non-deterministic - that's part of the mechanism for hiding that information from the player.  In addition, each of the hunters actions can induce random behavior by the hazards in the game.  These random effects mean that any given action by the player can have multiple candidate responses, depending on how the dice fall.

The feedback from the game to the player is all via messages written to the console.  Those messages were designed (such as it is) for human readability, rather than machine readability.  Understanding the semantics of those messages requires introducing a parser into the acceptance test.

What this means is that we have some work cut out for us if we want anything more than a trivial verification that some message was written to standard out.

One possibility is that we can introduce the idea of specifying a seed for the non-deterministic behavior from outside the program.  The acceptance test can fix the seed, then perform a domain agnostic comparison of the output to some golden master that we specify.  This is somewhat brittle: the current mapping of random values to representations is arbitrary, and the domain agnostic match over fits the representation of the messages.

Another possibility is to introduce an affordance that allows the specification of a message schema to use; the acceptance test simply switches the application into a mode where the responses are easy to parse, much like an http request might distinguish between text/plain and application/json.  Even without fixing the seed, our acceptance test can still easily identify that all of the messages are well formed.

The schema approach, while straight forward, feels like a lot of work that will not pay off.  I think the issue here is that, while wumpus is a more interesting toy exercise than the bowling game or a Fibonacci calculator, it is still fundamentally a toy problem -- one with an arbitrary and limited scope.

My null-design port of Wumpus from basic to Java is only 375 lines long; it's hard to envision that project having a lifetime that justifies heavy upfront investment in acceptance tests.

What we can do, from the outset, is decide that the behaviors that the acceptance test needs to control - the random seed, the interpretation of the random values, the message schema - can be controlled from the outside, and that the idiom for changing those behaviors in the future is to extend the application with new selectable behaviors, rather than replacing the existing behaviors.

Saturday, August 10, 2019

Purchase Approval

One problem I've had with Domain Driven Design is coming up with good realistic examples that exhibit the sorts of complexity we need to be thinking about, without getting lost in the weeds.

My latest idea is to try working through a purchase approval in an analog office.

Bob wants the company to pay for something.  So he gets a blank form, and fills in the details, and drops off the form with Alice.  Alice does the work of comparing the details to the current policies, and approving / rejecting the request.  The resolved request is returned to Bob, so that he can act on the decision that has been made.

There are a lot of requests, and checking the details is a lot of work.  So Alice has become a bottleneck.  She offloads some of the work to Terry the intern; Terry does the legwork for requests when the approval doesn't require Alice's domain expertise.

As a proxy for easy, we'll use a trivial condition like "amount less than 100 USD".

The form acts as a sort of lock; an actor in this protocol can only change the paper when they have physical control of it.  So the process is serial, only one person can record information at a time.

That means we need to think more precisely about how the requests are shared by Alice and Terry.  Perhaps all requests go first to Alice, and she passes the easy requests to Terry; or perhaps the requests all go to Terry, and the hard cases are forwarded to Alice.

We can think of the request as a single form, that gets modified.  Alternatively, we can think of an envelope filled with "immutable" documents; each actor adds new paperwork to the envelope.

The process is asynchronous, in this sense - the request can be in Alice's office even though Alice herself is out at lunch, or home sick.  The movement of paper allows the people to communicate, even though they aren't necessarily in the office at the same time.

The paperwork is anemic - all of the domain knowledge is locked in the heads of Alice and Terry (and, to some degree, Bob).  The paperwork is just the bookkeeping.

It's also worth noting that the paper is immutable, in this sense: once the paperwork has left Bob's control, he cannot correct errors until the paperwork is returned to him.

Bob's "view" of this process is the stack of blank forms, and his collection of resolved requests.  Alice and Terry have similar views: stacks of pending requests.

Exercise 1: what changes when we take this process digital?  So instead of physical paperwork moving from place to place, we now have information being copied from one place to another.

Exercise 2: what changes when we extend the digital process to automate Terry's work?

Saturday, August 3, 2019

TDD: Random is Arbitrary

I happened across a Yahtzee kata today. Although I didn't work that exercise today, it got me thinking again about random behaviors.
Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin.
If your goal is to write fast, deterministic tests that you can use to detect unintended changes during a refactoring, then you need to treat a random number generator the way you would a clock.  The test itself has to decide which random sequence to provide, and pass it to the test subject.

But there's a second problem, which is this -- if a random choice from a list is an acceptable behavior, then so to must be the same random choice from every permutation of that list.

If we can map a sequence of random numbers to [HEADS, TAILS], and produce from that an acceptable behavior, then it must also be true that mapping the same sequence of random numbers to [TAILS, HEADS] must also be acceptable.  You can replace one with the other any time you want.

But doing that isn't a refactoring; when we make this change, the same inputs produce different outputs, which are measured by the test harness.

The choice of which permutation to use is an implementation detail; any tests that depend on the specific requirements are overfit.

Can you beat this by passing in an ordered list of choices?  Not really, for the same reason - if the permutation of items that you pass in produces an acceptable behavior, then it sill also be acceptable for the code under test to re-order those items before using them.

Anoher related problem is that there are different ways to interact with the random number generator that produce equally acceptable results.  If we want to randomly toss three coins, we can pull three random numbers and project each onto [0,1], and encode the result, or we can pull a single random number, project onto [0,7], and then use a squashed encoding to interpret the result.  If one is valid, then so to is the other -- but they leave the RNG in different states, and therefore we have additional risks of overfitting.

What does this mean?  That simply decoupling the random number generator isn't enough.  We need to be passing the result of the random choice - pass in the coin after it has been flipped, pass in the dice after they have been rolled, pass in the deck after it has been shuffled.

It's not enough to inject the random number generator into the test; you need to leave the arbitrary mapping of the random number to some result value in the imperative shell as well.

Sunday, June 30, 2019

Usage Kata

The usage kata is intended as an experiment in applying Test Driven Development at the program boundary.

Create a command line application with the following behavior:

The command supports a single option: --help

When the command is invoked with the help option, a usage message is written to STDOUT, and the program exits successfully.

When the command is invoked without the help option, a diagnostic is written to STDERR, and the program exits unsuccessfully.


Sunday, June 23, 2019

Variations on the Simplest Thing That Could Possibly Work

Simplest Thing that Could Possibly Work is a unblocking technique

Once we get something on the screen, we can look at it. If it needs to be more we can make it more. Our problem is we've got nothing.

Hunt The Wumpus begins with a simple prompt, which asks the human operator whether she would like to review the game instructions before play begins.

That's a decent approximation of the legacy behavior.

But other behaviors may also be of interest, either as replacements for the legacy behavior, or supported alternatives.

For instance, we might discover that the prompt should behave like a diagnostic message, rather than like data output, in which case we'd be interested in something like

Or we might decide that UPPERCASE lacks readability

Or that the input hints should appear in a different order

And so on.

The point here being that even this simple line of code spans many different decisions.  If those decisions aren't stable, then neither is the behavior of the whole.

When we have many tests evaluating the behavior of a unstable decision, the result it a brittle test suite.

A challenging aspect to this: within the scope of a single design session, behaviors tend to be stable.  "This is the required behavior, today."  If we are disposing of the tests at the end of our design session, then there's no great problem to solve here.

On the other hand, if the tests are expected to be viable for many design sessions, then protecting the tests from the unstable decision graph constrains our design still further.

One way to achieve a stable decision graph is to enforce a constraint that new behaviors are added by extension, and that new behaviors will be delivered beside the old, and that clients can choose which behavior they prefer.  There's some additional overhead compared with making the change in "one" place.

Another approach is to create bulkheads within the design, so that only single elements are tightly coupled to a specific decision, and the behavior of compositions is evaluated in comparison to their simpler elements.  James Shore describes this approach in more detail within his Testing Without Mocks pattern language.

What I haven't seen yet: a good discussion of when.  Do we apply YAGNI, and defend against the brittleness on demand?  Do we speculate in advance, and invest more in the design to insure against an uncertain future?  Is there a checklist that we can work, to reduce the risk that we foul our process for reducing the risk?

Thursday, June 20, 2019

Design Decisions After the First Unit Test

Recently, I turned my attention back to one of my early "unit" tests in Hunt The Wumpus.

This is an "outside-in" test, because I'm still curious to learn different ways that the tests can drive the designs in our underlying implementations.

Getting this first test to pass is about Hello World difficulty level -- just write out the expected string.

At that point, we have this odd code smell - the same string literal is appearing in two different places. What does that mean, and what do we need to do about it?

One answer, of course, is to ignore it, and move on to some more interesting behavior.

Another point of attack is to go after the duplication directly. If you can see it, then you can tease it apart and start naming it. JBrains has described the design dynamo that we can use to weave a better design out of the available parts and a bit of imagination.

To be honest, I find this duplication a bit difficult to attack that way. I need another tool for bootstrapping.

It's unlikely to be a surprise that my tool of choice is Parnas. Can we identify the decision, or chain of decisions, that contribute to this particular string being the right answer?

In this case, the UPPERCASE spelling of the prompt helps me to discover some decisions; what if, instead of shouting, we were to use mixed case?

This hints that, perhaps, somewhere in the design is a string table, that defines what "the" correct representation of this prompt might be.

Given such a table, we can then take this test and divide it evenly into two parts - a sociable test that verifies that the interactive shell behaves like the string table, and a second solitary test that the string table is correct.

If we squint a bit, we might see that the prompt itself is composed of a number of decisions -- the new line terminator for the prompt, the format for displaying the hints about acceptable responses, the ordering of those responses, the ordering of the prompt elements (which we might want to change if the displayed language were right to left instead of left to right).

The prompt itself looks like a single opaque string, but there is duplication between the spelling of the hints and the checks that the shell will perform against the user input.

Only a single line of output, and already there is a lot of candidate variation that we will want to have under control.

Do we need to capture all of these variations in our tests? I believe that depends on the stability of the behavior. If what we are creating is a clone of the original behavior -- well, that behavior has been stable for forty years. It is pretty well understood, and the risk that we will need to change it after all this time is pretty low. On the other hand, if we are intending an internationalized implementation, then the English spellings are only the first increment, and we will want a test design that doesn't require a massive rewrite after each change.

Sunday, May 5, 2019

Testing at the seams

A seam is a place where you can alter behaviour in your program without editing in that place -- Michael Feathers, Working Effectively with Legacy Code
When I'm practicing Test Driven Development, I prefer to begin on the outside of the problem, and work my way inwards.  This gives me the illusion that I am discovering the pieces that I need; no abstraction is introduced in the code without at least one consumer, and the "ease of use" concern gets an immediate evaluation.

As an example, let's consider the case of an interactive shell.  We can implement a simple shell using java.lang.System, which gives us access to, System.out, System::getenv, System.::currentTimeMillis, and so on.

We probably don't want our test subjects to be coupled to System, however, because that's a shared resource.  Developer tests should be embarrassingly parallel; shared mutable resources get in the way of that.

By introducing a seam that we can plug System into, we get the decoupling that we need in our tests.

If we want to introduce indirection, then we ought to introduce the smallest indirection possible. And we absolutely must try to introduce better abstraction. -- J. B. Rainsberger, 2013
My thinking is that this is correctly understood as two different steps; we introduce the indirection, and we also try to discovered the better abstraction.

But I am deliberately trying to avoid committing to an abstraction prematurely.  In particular, I don't want to invest in an abstraction without first accumulating evidence that it is a good one.  I don't want to make changes expensive when change is still likely - the investment odds are all wrong.

Tuesday, April 23, 2019

Saturday, April 20, 2019

Sketching Evil Hangman, featuring GeePaw

GeePaw Hill christened his twitch channel this week to a presentation of his approach to TDD, featuring an implementation of Evil Hangman.

Evil Hangman is a mimic of the traditional word guessing game, with a twist -- evil hangman doesn't commit to a solution immediately.  It's a good mimic - the observable behavior of the game is entirely consistent with a fair implementation that has committed to some word in the corpus.  But it will beat you unfairly if it can.

So, as a greenfield problem, how do you get started?

From what I can see, there are three approaches that you might take:
  • You can start with a walking skeleton, making some arbitrary choices about I/O, and work your way inward.
  • You can start with the functional core, and work your way inward
  • You can start with an element from a candidate design, and work your way outward.
GeePaw, it seems, is a lot more comfortable when he has pieces that he can microtest.  I got badly spooked on that approach years ago when I looked into the sudoku exercise.  So my preference is to choose an observable behavior that I understand, fake it, remove the duplication, and then add additional tests to the suite that force me to extend the solution's design.

Ultimately, if you examine the complexity of the test subjects, you might decide that I'm writing "integrated" or "acceptance" tests.  From my perspective, I don't care - the tests are fast, decoupled from the environment, and embarrassingly parallel.  Furthermore, the purpose of the tests is to facilitate my design work, not to prove correctness.

What this produces, if I do it right, is tests that are resilient to changes in the the design, but which may be brittle to changes in requirements.

My early tests, especially in greenfield work, tend to be primitive obsessed.  All I have in the beginning are domain agnostic constructs to work with, so how could they be anything but?  I don't view this as a failing, because I'm working outside in -- which is to say that my tests are describing the boundary of my system, where things aren't object oriented.  Primitives are the simplest thing that could possibly work, and allow me to move past my writer's block into having arguments with the code.

As a practice exercise, I would normally choose to start from the boundary of the functional core -- we aren't being asked to integrate with anything in particular, and my experiences so far haven't suggested that there is a lot of novelty there.
One should not ride in the buggy all the time. One has the fun of it and then gets out.
So, where to begin?

I'm looking for a function - something that will accept some domain agnostic arguments and return a domain agnostic value that I can measure.

Here, we know that the basic currency is that the player will be making guesses, and the game will be responding with clues.  So we could think in terms of a list of string inputs and a list of string outputs.  But the game also has hidden state, and I know from hard lessons that making that state an input to the test function will make certain kinds of verification easier.

The tricky bit, of course, is that I won't always know what that hidden state is until I get into the details.  I may end up discovering that my initial behaviors depend on some hidden variable that I hadn't considered as part of the API, and I'll need to remediate that later.

In this case, one of the bits of state is the corpus - the list of words that the game has to choose from.  Using a restricted word list makes it easier to specify the behavior of the implementation.  For instance, if all of the words in the corpus are the same length, then we know exactly how many dashes are going to be displayed in the initial hint.  If there is only a single word in the corpus, then we know exactly how the app will respond to any guess.

Making the corpus an input is the affordance that we need to specify degenerate cases.

Another place where degeneracy might be helpful is allowing the test to control the players mistake allotment.  Giving the human player no tolerance for errors allows us to explore endgame behavior much more easily.

And if we don't think of these affordances in our initial guess?  That's fine - we introduce a new entry point with an "extract method refactoring", eliminating duplication by having the original test subject delegate its behavior to our improved API, deprecating the original, and eliminating it when it is no longer required.

Simplest thing that can possibly work is a proposal, not a promise.

For the most part, that's what my sketches tend to look like: some exploration of the problem space, a guess at the boundary, and very little early speculation about the internals.

Friday, April 19, 2019

TDD and incremental delivery

I spend a lot of time thinking about breaking tests, and what that means about TDD as a development ritual.

I recently found a 2012 essay by Steven Thomas reviewing Jeff Patton's 2007 Mona Lisa analogy.  This in turn got me thinking about iteration specifically.

A lot of the early reports of Extreme Programming came out of Chrysler Comprehensive Compensation, and there's a very interesting remark in the post mortem
Subsequent launches of additional pay populations were wanted by top management within a year.
To me, that sounds like shorthand for the idea that the same (very broad) use case was to be extended to cover a larger and more interesting range of inputs with only minor changes to the behaviors already delivered.

The tests that we leave behind serve to describe the constraints necessary to harvest the low hanging fruit, as best we understood them at the time, to identify regressions when the next layer of complexity was introduced to the mix.

We're writing more living documentation because we are expecting to come back to this code next year, or next month, or next sprint.

I envision something like a ten case switch statement -- we'll implement the first two cases now, to cover perhaps a third of the traffic, and then defer the rest of the work until "later", defined as far enough away that the context has been evicted from our short term memory.

If the requirements for the behaviors that you implemented in the first increment are not stable, then there is non trivial risk that you'll need several iterations to get those requirements right.  Decisions change, and the implications of those changes are going to ripple to the nearest bulkhead, in which case we may need a finer grain testing strategy than we would if the requirements were stable.

At the other extreme, I'm working an an old legacy code base; this code base has a lot of corners that are "done" -- modules that haven't changed in many years.  Are we still profiting by running those tests?

This is something we should keep in mind as we kata.  If we want to be preparing for increments with significant time intervals between them, then we need bigger input spaces with stable requirements.

A number of the kata are decent on the stable requirements bit -- Roman numerals haven't changed in a long time -- but tend to be too small to justify not solving the whole thing in one go.  Having done that, you can thank your tests and let them go.

The best of the kata I'm familiar with for this approach would be the Gilded Rose - we have a potentially unlimited catalog of pricing rules for items, so we'll incrementally adapt the pricing logic until the entire product catalog is covered.

But - to do that without breaking tests, we need stable pricing rules, and we need to know in advance which products follow which rules.  If we were to naively assume, for example, that Sulfuras is a normal item, and we used it as part of our early test suite, then the updated behavior would break those tests.  (Not an expensive break, in this case -- we'd likely be able to replace Sulfuras with some other normal item, and get on with it).

Expressing the same idea somewhat differently: in an iterative approach, we might assume that Sulfuras is priced normally, and then make adjustments to the tests until they finally describe the right pricing constraints; in an incremental approach, Sulfuras would be out of scope until we were ready to address it.

I think the scheduling of refactoring gets interesting in an incremental approach - how much refactoring do you do now, when it is still uncertain which increment of work you will address next?  Is Shameless Green the right conclusion for a design sessions?

The Sudoku problem is one that is one that I struggle to classify.  One the one hand, the I/O is simple, and the requirements are stable, so it ought to hit the sweet spot for TDD.  You can readily imagine partitioning the space of sudoku problems into trivial, easy, medium, hard, diabolical sets, and working on one grouping at a time, and delivering each increment in turn.

On the other hand, Dr Norvig showed that, if you understand the problem, you can simply draw the rest of the fucking owl. The boundaries between the categories are not inherent to the problem space.

Wednesday, April 10, 2019

Read Models vs Write Models

At Stack Exchange, I answered a question about DDD vs SQL, which resulted in a question about CQRS that I think requires more detail than is appropriate for that setting.
The "read model" is not the domain model (or part of it)? I am not an expert on CQRS, but I always thought the command model is quite different from the classic domain model, but not the read model. So maybe you can give an example for this?
So let's lay some ground work

A domain model is not a particular diagram; it is the idea that the diagram is intended to convey.  It is not just the knowledge in the domain expert's head; it is a rigorously organized and selective abstraction of that knowledge.  -- Eric Evans, 2003.
A Domain Model creates a web of interconnected objects, where each object represents some meaningful individual, whether as large as a corporation or as small as a single line on an order form.  -- Martin Fowler, 2003.
I think that Fowler's definition is a bit tight; there's no reason that we should need to use a different term when modeling with values and functions, rather than objects.

I think it is important to be sensitive to the fact that in some contexts we are talking about the abstraction of expert knowledge, and in others we are talking about an implementation that approximates that abstraction.

Discussions of "read model" and "write model" almost always refer to the implemented approximations.  We take a single abstraction of domain knowledge, and divide our approximation of it into two parts - one that handles our read use cases, and another that handles our write use cases.

When we are handling a write, there are usually constraints to ensure the integrity of the information that we are modeling.  That might be as simple as a constraint that we not overwrite information that was previously written, or it might mean that we need to ensure that new writes are consistent with the information already written.

So to handle a write, we will often take information from our durable store, load it into volatile memory, then create from that information a structure in memory into which the new information will be integrated.  The "domain logic" calculates new information, which is written back to the durable store.

On the other hand, reads are safe; "asking the question shouldn't change the answer".  In that case, we don't need the domain logic, because we aren't going to integrate new information. We can take the on disk representation of the information, and transform it directly into our query response, without passing through the intermediate representations we would use when writing.

We'll still want input sanitation, and message semantics that reflect our understanding of the domain experts abstraction, but we aren't going to need "aggregate roots", or "locks" or the other patterns that prevent the introduction of errors when changing information.  We still need the data, and the semantics that approximate our abstraction, but we don't need the rules.

We don't need the parts of our implementation that manage change.

When I answer "the query itself is unlikely to pass through the domain model", that's shorthand for the idea that we don't need to build domain specific data structures as we translate the information we retrieved from our durable store into our response message.

Monday, April 1, 2019

TDD from Edge to Edge

In my recent TDD practice, I've been continuing to explore the implications of edge to edge tests.

The core idea being, is this - if we design the constraints on our system at the edge, then we maximize the degrees of freedom we have in our design.

Within a single design session, this works just fine. My demonstration of the Mars Rover kata limits its assumptions to the edge of the system, and the initial test is made to pass by simply removing the duplication.

The advantage of such a test is that it is resilient to changes in the design. You can change the arrangement of the internals of the test subject, and the test itself remains relevant.

The disadvantage of such a test is that it is not resilient to changes in the requirements.

It's common in TDD demonstrations to work with a fixed set of constraints throughout the design session. Yes, we tend to introduce the constraints in increments, but taken as a set they tend to be consistent.

The Golden Master approach works just fine under those conditions; we can extend our transcript with descriptions of extensions, and then amend the test subject to match.

But a change in behavior? And suddenly an opaque comparison to the Golden Master fails, and we have to discard all of the bath water in addition to the baby.

We might describe the problem this way: the edge to edge test spans many different behaviors, and a change to a single behavior in the system may butterfly all the way out to the observable behavior. In other words, the same property that makes the test useful when refactoring acts against us when we introduce a modification to the behavior.

One way to side step this is to take as given that a new behavior means a new test subject. We'll flesh out a element from scratch, using the refactoring task in the TDD cycle to absorb our previous work into the new solution. I haven't learned that this is particularly convenient for consumers. "Please update your dependency to the latest version of my library and also change the name you use to call it" isn't a message I expect to be well received by maintainers that haven't already introduced design affordances for this sort of change.

So what else? How do we arrange our tests so that we don't need to start from scratch each time we get a request for a breaking change?

Recently, I happened to be thinking about this check in one of my tests.

When this check fails, we "know" that there is a bug in the test subject.  But why do we know that?

If you squint a bit, you might realize that we aren't really testing the subject in isolation, but rather whether or not the behavior of the subject is consistent with these other elements that we have high confidence in.  "When you hear hoof beats, expect horses not zebras".

Kent Beck describes similar ideas in his discussion of unit tests.

There is a sort of transitive assertion that we can make: if the behavior of the subject is consistent with some other behavior, and we are confident that the other behavior is correct, then we can assume the behavior of the test subject is correct.

What this affords is that we can take the edge to edge test and express the desired behavior as a composition of other smaller behaviors that we are confident in.  The Golden Master can be dynamically generated from the behavior of the smaller elements.

Of course, the confidence in those smaller elements comes from having tests of their own, verifying that those behaviors are consistent with simpler, more trusted elements.  It's turtles all the way down.

In this sort of design, the smaller components in the system act as bulkheads for change.

I feel that I should call out the fact that some care is required in the description of the checks we create in this style.  We should not be trying to verify that the larger component is implemented using some smaller component, but only that its behavior is consistent with that of the smaller component.

Wednesday, March 20, 2019

Isolation at the boundary

Recently, I was looking through Beyond Mock Objects.  Rainsberger invokes one of my favorite dependencies - the system clock.

We’ve made an implicit dependency more explicit, and I value that a lot, but Clock somehow feels like indirection without abstraction. By this I mean that we’ve introduced a seam3 to improve testability, but that the resulting code exposes details rather than hiding them, and abstractions, by definition, hide details.
If we want to introduce indirection, then we ought to introduce the smallest indirection possible. And we absolutely must try to introduce better abstraction. 
I agree with this idea - but I would emphasize that the indirection and the better abstraction are different elements in the design.

The boundary represents the part of our design where things become uncertain - we're interacting with elements that aren't under our control.  Because of the uncertainty, measuring risk becomes more difficult.  Therefore, we want to squeeze risk out of the boundary and back toward the core.

What I'm describing here is an adapter: at the outer end, the adapter is plugable with the system clock; the inner end satisfies the better abstraction -- perhaps the stream of timestamps briefly described by Rainsberger, perhaps a higher abstraction more directly related to your domain.

In other words, one of my design constraints is that I should be able to isolate and exercise my adapters in a controlled test environment.

Let's consider Unusual Spending; our system needs to interact with the vendor environnment - reading payments from a payments database, dispatching emails to a gateway.  Since the trigger is supposed to produce "current" reports, we need some kind of input to tell us when "now" is.  So three external ports.  My test environment, therefore, needs substitutes for those three ports; my composition needs the ability to specify the ports.  Because the real external reports aren't going to be part of the development test suite, we want the risk squeezed out of them.

The API of the external port is tightly coupled to the live implementation -- if that changes on us, then we're going to need a new port and a new adapter.  If our inner port abstraction is good, then the adapter acts as a bulkhead, protecting the rest of the solution from the change.

Somewhere in the solution, our composition root will describe how to hook up all of the pieces together.  For instance, if we were constrained to use the default constructor as the composition root, then we might end up with something like:

None of this is "driven" by the tests, except in the loose sense that we've noticed that we're going to be running the tests a lot, and therefore need a stable controlled environment.  For example, in his demonstration, Justin Searls decided that couple his temporal logic to java.util.GregorianCalendar rather than java.lang.System.  With the right abstractions in place, the cost of reversing the decision is pretty small - try the simplest thing that could possibly work, prepare to change your mind later.

Sunday, March 17, 2019

TDD: Probes

I've been thinking about TDD this week through the lens of the Unusual Spending kata.

The unusual spending kata is superficially similar to Thomas Mayrhofer's employee report: the behavior of the system is to produce a human readable report. In the case of unusual spending, the interesting part of the report is the body of the email message.

At the API boundary, the body of the email message is a String, which is to say it is an opaque sequence of bytes.  We're preparing to pass the email across a boundary, so it's normal that we transition from domain specific data representations to domain agnostic data representations.

But there's a consequence associated with that -- we're pretty much limited to testing the domain agnostic properties of the value.  We can test the length, the prefix, the suffix; we can match the entire String against a golden master.

What we cannot easily do is extract domain specific semantics back out of the value.  It's not impossible, of course; but the ROI is pretty lousy.

Writing tests that are coupled to an opaque representation like this isn't necessarily a problem; but as Mayrhofer showed, it's awkward to have a lot of tests that are tightly coupled to unstable behaviors.

In the case of the Unusual Spending kata, we "know" that the email representation is unstable because it is up near the top of the value chain; it's close to the human beings in the system - the one's we want to delight so that they keep paying us to do good work.

It's not much of a stretch to extend the Unusual Spending kata with Mayrhofer's changing requirements.  What happens to us after we have shipped our solution, the customers tell us that the items in the email need to be reordered?  What happens after we ship that patch, when we discover that the capitalization of categories also needs to be changed?

Our automated checks provide inputs to some test subject, and measure the outputs.  Between the two lie any number of design decisions.  The long term stability of the check is going to be some sort of sum over the stability of each of the decisions we have made.

Therefore, when we set our probes at the outer boundary of the system, the stability of the checks is weakest.

A way to attack this problem is to have probes at different levels.  One advantage is that we can test subsets of decisions in isolation; aka "unit tests".

Less familiar, but still effective, is that we can use the probes to compare the behaviors we see on different paths.  Scott Wlaschin describes a related idea in his work on property based testing.  Is behavior A consistent with the composition of behaviors B and C?  Is behavior B consistent with the composition of D, E, and F?

There's a little bit of care to be taken here, because we aren't trying to duplicate the implementation in our tests, nor are we trying to enforce a specific implementation.  The "actual" value in our check will be the result of plugging some inputs into a black box function; the "expected" value will plug some (possibly different) inputs into a pipeline.

Following this through to its logical conclusion: the tests are trying to encourage a design where pipeline segments can be used independently of our branching logic.

Parnas taught us to encapsulate decisions in modules; it's a strategy we can use to mitigate the risk of change.  Similarly, when we are writing tests, we need designs that do not overfit those same decisions.

That in turn suggests a interesting feedback loop - the arrival of requirement to change behavior may require the addition of a number of "redundant" tests of the existing implementation, so that older tests that were too tightly coupled to the unstable behavior can be decommissioned.

Perhaps this pattern of pipeline equivalencies is useful in creating those separate tests.

Tuesday, March 12, 2019

TDD and the Sunday Roster

I got access to an early draft of this discussion, so I've been chewing on it for a while.

The basic outline is this: given a list of employees, produce a report of which employees are eligible to work on Sunday.

What makes this example interesting to me is that the requirements of the report evolve.  The sample employee roster is the same, the name of the function under test is the same, the shape of the report is the same, but the actual values change over time.

This is very different from exercises like the bowling game, or roman numerals, or Fibonacci, where once a given input is producing the correct result, that behavior gets locked in until the end of the exercise where we live happily ever after.

If we look at Mayrhofer's demonstration, and use a little bit of imagination, the implementation goes through roughly this lifecycle
  1. It begins as an identity function
  2. The identity function is refactored into a list filter using a predicate that accepts all entries
  3. The predicate is replaced
  4. The result is refactored to include a sort using the stable comparator
  5. The comparator is replaced
  6. The result is refactored to include a map using an identity function
  7. The map function is replaced
  8. The comparator is reversed
It's really hard to see, in the ceremony of patching up the tests, how the tests are paying for themselves -- in the short term, they are overhead costs, in the long term, they are being discarded.

It also doesn't feel like investing more up front on the tests helps matters very much.  One can imagine, for instance, breaking up the unsatisfactory checks that are coupled directly to the report with collections of coarse grained constraints that can be evaluated independently.  But unless that language flows out of you naturally, that's extra work to amortize once again.

It's not clear that TDD is leading to a testing strategy that has costs of change commiserate with the value of the change, nor is it clear to me that the testing strategy is significantly reducing the risk.  Maybe we're supposed to make it up on volume?  Lots of reports, each cheap because they share the same core orchestration logic?

This starts to feel like one of the cases where Coplien has the right idea: that there are more effective ways to address the risk than TDD -- for example, introducing and reviewing assertions?

Notes from my first HN surge

TDD: Hello World was shared on Hacker News.
  • 2500 "pageviews", which to be honest seems awfully small.
  • 500 "pageviews" of the immediately prior essay.
  • 2 comments on HN.
  • 0 comments locally.
A search of HN suggests that TDD isn't a very popular topic; I had to search about back about three months to find a link with much discussion going on.  Ironically enough, the subject of that link: "Why Developers Don't TDD".

Monday, March 11, 2019

TDD: Retrospective from a train

I went into Boston this evening for the Software Crafters Meetup.  Unfortunately, the train schedule and other obligations meant that I had to leave the party early.  This evenings exercise was a stripped down version of Game of Life.
Given a 3 x 3 grid, create function that will tell you whether center cell in next generation is dead or live.
 Upon review, I see that I got bit a number of different ways.

The first problem was that I skipped the step of preparing a list of test cases to work through.  The whole Proper Planning and Preparation thing is still not a habit.  In this case, I feel that I lost track of two important test cases; one that didn't matter (because we had a passing implementation already) and one where it did (I had lost track of the difference between two neighbors and three neighbors).

One of my partners suggested that "simplest way to pass the test" was to introduce some accidental duplication, and my goodness does that make a mess of things.  Sandi Metz argues that duplication is cheaper than the wrong abstraction; but I'm not even sure this counts as "wrong abstraction" -- even in a laboratory setting, the existing implementation is a powerful attractor for similar code.

Fundamentally, this problem is a simple function, which can be decomposed into three smaller parts:

But we never got particularly close to that; the closest we came was introducing a countNeighbors function in parallel, and then introducing that element as a replacement for our prior code.  We didn't "discover" anything meaningful when refactoring.

I suspect that this is, at least in part, a side effect of the accidental duplication -- the coupling of the output and the input was wrong, and therefore more difficult to refactor into something that was correct.

In retrospect, I think "remove duplication" is putting the emphasis in the wrong place.  Getting the first test to pass by hard coding the correct answer is a great way to complete test calibration in minimal wall clock time.  But before moving on there is a "show your work" step that removes the implicit duplication between the inputs and the outputs.

We talked a bit about the fact that the tests were hard to read; not a problem if they are scaffolding, because we can throw them out, but definitely a concern if we want the tests to stay behind as living documentation or examples.  Of course, making things more readable means more names -- are they part of the test, or is the fact that the tests needs them a hint that they will be useful to consumers as well?  Do we need additional tests for those names? How do we make those tests understandable?

Sunday, March 3, 2019

Constraint Driven Development and the Unusual Spending Kata

I recently discovered the Unusual Spending Kata, via testdouble.

One of the things that I liked immediately about the kata is that it introduces some hard constraints.  The most important one is that the entrypoint is fixed; it SHALL conform to some pre-determined contract.

This sort of constraint, common if you are working outside in, is a consequence of the dependency inversion principle.  The framework owns the contract, the plugin implements it.

Uncle Bob's Bowling Game kata takes a similar approach...
Write a class named Game that implements two methods...

In other words, you API design has already happened; in earlier work, in a spike, because you are committed to supporting a specific use case.

When you don't have that sort of constraint in place already, the act of writing your tests is supposed to drive the design of your API.  This is one of the places where we are really employing sapient testing; applying our human judgment to the question "is the API I'm creating suitable for consumers?".

The Unusual Spending Kata introduces two other constraints; constraints on the effects on which the core logic depends.  In order to produce any useful value, the results of the local calculation have to be published, and the kata constrains the solution to respect a particular API to do so.  Similarly, a stateless process is going to need input data from somewhere, and that API is also constrained by the kata.

So the programmer engaging in this exercise needs to align their design with the surface of the boundary.  Excellent.

Because the write constraint and the read constraint are distinct in this kata, it helps to reveal the fact that isolating the test environment is much easier for writes than it is for reads.

The EmailsUser::email is a pure sink; you can isolate yourself from the world by simply replacing the production module with a null object.  I find this realization to be particularly powerful, because it helps unlock one of the key ideas in the doctrine of useful objects -- that if you need to be able to observe a sink in test, it's likely that you also need to be able to observe the sink in production.  In other words, much of the logic that might inadvertently be encoded into your mock objects in practice may really belong in the seam that you use to protect the rest of the system from the decision you have made about which implementation to use.

In contrast, FetchesUserPaymentsByMonth::fetch is a source -- that's data coming into the system, and we need to have a much richer understanding of what that data can look like, so that we can correctly implement a test double that emits data in the right shape.

Implementing an inert source is relatively straight forward; we simply return an empty value each time the method is invoked.

An inert implementation alone doesn't give you very much coverage of the system under test, of course.  If you want to exercise anything other than the trivial code path in your solution, you are going to need substitutes that emit interesting data, which you will normally arrange within a given test case.

On the other hand, a cache has some interesting possibilities, in so far as you can load the responses that you want to use into the cache during arrange, and then they will be available to the system under test when needed.  The cache can be composed with the read behavior, so you get real production behavior even when using an inert substitute.

Caching introduces cache invalidation, which is one of the two hard problems.  Loading information into the cache requires having access to the cache, ergo either injecting a configured cache into the test subject or having cache loading methods available as part of the API of the test subject.

Therefore, we may not want to go down the rabbit hole right away.

Another aspect of the source is that the data coming out needs to be in some shared schema.  The consumer and the provider need to understand the same data the same way.

This part of the kata isn't particularly satisfactory - the fact that the constrained connection to our database allows the consumer to specify the schema, with no configuration required...?  The simplest solution is probably to define the payments API as part of the contract, rather than leaving that bit for the client to design.

Wednesday, February 20, 2019

Bowling Driven Design

The bowling game kata is odd.

I've never seen Uncle Bob demonstrate the kata himself; so I can't speak to its presentation or its effectiveness beyond the written description.  It has certainly inspired some number of readers, including myself at various times.


I'm becoming more comfortable with the idea that the practice is just a ritual... a meditation... bottle shaking....

In slide two of the Powerpoint deck, the first requirement is presented: Game must implement a specific API.

What I want to call attention to here: this API is not motivated by the tests. The requirements are described on slide #3, followed by a modeling session on slides #4-#9 describing some of the underlying classes that may appear. As of slide #52 at the end of the deck, none of the classes from the modeling session have appeared in the solution.

Furthermore, over the course of writing the tests, some helper methods are discovered; but those helper methods remain within the boundary of the test -- the API remains fixed.

All of the complexity in the problem is within the scoring logic itself, which is a pure function -- given a complete legal sequence of pin falls, compute the game score. Yet the API of the test subject isn't a function, but a state machine.  The scoring logic never manages to scape from the boundaries of the Game object -- boundaries that were set before we had any tests at all.
If you want to lean to think the way I think, to design the way I design, then you must learn to react to minutia the way I react. -- Uncle Bob
What the hell are we driving here?  Some logic, I suppose.

Monday, February 18, 2019

Aggregates: Separation of concerns

A question on Stack Overflow lead me to On Aggregates and Domain Service interaction, written by Marco Pivetta in January of 2017.  Of particular interest to me were the comments by Mathias Verraes.

What I recognized is that the description of aggregate described by Mathias is very similar to the description of protocols described by Cory Benfield.  So I wanted to try to write that out, long hand.

Aggregates are information (state), and also logic that describes how to integrate new information with the existing information.  In accordance with the usual guidelines of object oriented development, we package the data structure responsible for tracking the information with the rules for mutating the data structure and the transformations that use the data structure to answer queries.

Because the responsibility of the object is this data structure and its operations, and because this data structure is a local, in memory artifact, there's no room (in the responsibility sense) for effects.

How do we read, or write, information that isn't local to the aggregate?

The short answer is that responsibility goes out into the application layer (which in turn may delegate the responsibility to the infrastructure layer; those details aren't important here).

The aggregate incorporates information and decides what needs to be done; the application layer does it, and reports the results back to the aggregate as new information.

Spelling the same idea a different way - the aggregate is a state machine, and it supports two important queries.  One is "what representation can I use to recover your current state?", so that we can persist the work rather than needing to keep the aggregate live in memory for its entire lifetime.  The other is "what work can the application layer do for you?".

Put another way, we handle the aggregate's demands for remote data asynchronously.  The processing of the command ends when the model discovers that it needs data which isn't available.  The application queries the model, discovering the need for more data, and can then fetch the data.  Maybe it's available now? then that data is passed to the aggregate which can integrate that information into its state.

If the information isn't available now, then we can simply persist the existing work, and resume it later when the information does become available.  This might look like a scheduled callback, for example.

If your model already understands "time", then it can report its own timing requirements to the application, so that those can also be included in the scheduling.


Refactoring: paint by numbers

I've been working on a rewrite of an API, and I've been trying to ensure that my implementation of the new API has the same behavior as the existing implementation.  This has meant building up a suite of regression checks, and an adapter that allows me to use the new implementation to support a legacy client.

In this particular case, there is a one-to-one relationship between the methods in the old API and the new -- the new variant just uses different spellings for the arguments and introduces some extra seams.

The process has been "test first" (although I would not call the design "test driven").  It begins with a check, using the legacy implementation.  This stage of the exercise is to make sure that I understand how the existing behavior works.

We call a factory method in the production code, which acts as a composition root to create an instance of the legacy implementation. We pass a reference to the interface to a check, which exercises the API through a use case, validating various checks along the way.

Having done this, we then introduce a new test; this one calling a factory method that produces an instance of an adapter, that acts as a bridge between legacy clients and the new API.

The signature of the factory method here is a consequence of the pattern that follows, where I work in three distinct cycles
  • RED: begin a test calibration, verifying that the test fails
  • GREEN: complete the test calibration, verifying that the test passes
  • REPLACE: introduce the adapter into the mix, and verify that the test continues to pass.
To begin, I create an implementation of the API that is suitable for using to calibrate a test by ensuring that a broken implementation fails. This is straight forward; I just need to throw UnsupportedOperationExceptions

Then, I created an abstract decorator, implementing the legacy API by simply dispatching each method to another implementation of the same interface.

And then I define my adapter, which extends the wrapper of the legacy API, and also accepts an instance of the new API.

Finally, with all the plumbing in place, I return a new instance of this class from the factory method.

My implementation protocol then looks like this; first, I run the test using the adapter as is. With no overrides in place, each call in the api gets directed to TEST_CALIBRATION_FACADE, which throws an UnsupportedOperationException, and the check fails.

To complete the test calibration, I override the implementation of the method(s) I need locally, directing them to a local instance of the legacy implementation, like so:

The test passes, of course, because we're using the same implementation that we used to set up the use case originally.

In the replace phase, the legacy implementation gets inlined here in the factory method, so that I can see precisely what's going on, and I can start moving the implementation details to the new API.

Once I've reached the point that all of the methods have been implemented, I can ditch this scaffolding, and provide an implementation of the legacy interface that delegates all of the work directly to v2; no abstract wrapper required.

There's an interesting mirroring here; the application to model interface is v1 to v2, then then I have a bunch of coordination in the new idiom, but at the bottom of the pile, the v2 implementation is just plugging back into v1 persistence. You can see an example of that here - Booking looks fairly normal, just an orchestration with the repository element. WrappedCargo looks a little bit odd, perhaps -- it's not an "aggregate root" in the usual sense, it's not a domain entity lifted into a higher role. Instead, it's a plumbing element wrapped around the legacy root object (with some additional plumbing to deal with the translations).

Longer term, I'll create a mapping from the legacy storage schema to an entity that understands the V2 API, and eventually swap out the O/RM altogether by migrating the state from the RDBMS to a document store.

Friday, February 8, 2019

TDD: Hello World

As an experiment, I recently tried developing HelloWorld using a "test driven" approach.

You can review the commit history on GitHub.

In Java, HelloWorld is a one-liner -- except that you are trapped in the Kingdom of Nouns, so there is boilerplate to manage.

Now you can implement HelloWorld in a perfectly natural way, and test it -- System.setOut allows you to replace the stream, so the write happens to a buffer that is under the control of the test.

It's not entirely clear to me what happens, however, if you have multiple tests concurrently writing to that stream.  The synchronization primitives ensure that each write is atomic, but there is a lot of time for the stream to be corrupted with other writes by the time the test harness gets to inspect the result.

This is why we normally design our tests so that they are isolated from shared mutable state; we want predictable results.  So in HelloWorld, this means we need to be able to ensure that the write happens to an isolated, rather than a shared stream.

So instead of testing HelloWorld::main, we end up testing HelloWorld.writeTo, or some improved spelling of the same idea.

Another pressure that shows up quickly is duplication - the byte sequence we want to test needs to be written into both the test and the implementation.  Again, we've learned patterns for dealing with that -- the data should move toward the test, so we have a function that accepts a message/prompt as an argument (in addition to passing along the target stream).  As an added bonus, we get a more general implementation for free.

Did we really need a more general implementation of HelloWorld?

Another practice that I associate with TDD is using the test as an example of how the subject may be used -- if the test is clumsy, then that's a hint that maybe the API needs some work.  The test needs a mutable buffer, and a PrintStream around it, and then needs to either unpack the contents of the buffer or express the specification as a byte array, when the natural primitive to use is a String literal.

You can, indeed, simplify the API, replacing the buffer with a useful object that serves a similar role.  At which point you either have two parallel code paths in your app (duplication of idea), or you introduce a bunch of additional composition so that the main logic always sees the same interface.

Our "testable" code turns more and more into spaghetti.

Now, it's possible that I simply lack imagination, and that once all of these tests are in place, you'll be able to refactor your way to an elegant implementation.  But to me, it looks like a trash fire.

There's a lesson here, and I think it is: left-pad.

Which is to say, not only is HelloWorld "so simple that there are obviously no deficiencies", but also that it is too simple to share; which is to say, the integration cost required to share the element exceeds the costs of writing it from scratch each time you need it.

Expressed a different way: there is virtually no chance that the duplication is going to burn you, because once written the implementation will not require any kind of coordinated future change (short of a massive incompatibility being introduced in the language runtime itself, in which case you are going to have bigger fires to fight).

Tuesday, February 5, 2019

The Influence of Tests

Some years ago, I became disenchanted with the notion that TDD uses tests to "drive" design in any meaningful way.

I came to notice two things: first, that the tests were just as happy to pass whatever cut and paste hack served as "the simplest thing that could possibly work", second that all of the refactoring patterns are reversible.

So what is being test infected buying me?

One interesting constraint on tests is that we want them to be reliable.  If the test subject hasn't changed, then we should get the same collection of observations if we move the test bed in time and space.  This in turn means we need to restrict the tests interaction with unstable elements -- I/O, the clock, the network, random entropy.  Our test subjects often expect to interact with these elements, so within the test environment we need to be able to provide a substitute.

So one of the design patterns driven by testing is "dependency injection".  Somewhere recently I came across the spelling "configurable dependency", which I think is better.  It helps to sharpen my attention on the fact that we are describing something that we change when we transition from a production environment to a test environment, which in turn suggests certain approaches.

But we're really talking about something more specific: configurable effects or perhaps configurable non-determinism.

The test itself doesn't care much about how much buffer surrounds the effect; but if we allow test coverage to influence us here, then we want the substituted code to be as small as we can manage.  To lean of Gary Bernhardt's terminology, we want the test to be able to control a thin imperative shell.

But then what?  We can keep pouring inputs through the shell without introducing any new pressures on the design.
Our designs must consist of many highly cohesive, loosely coupled components, just to make testing easy. -- Kent Beck, Test Driven Development by Example
I came across this recently, and it helps.

A key problem with the outside in approach, is that the "costs" of setting up a test are disproportionate to the constraint we are trying to establish.  Composition of the test subject requires us to draw the rest of the owl when all we need is a couple of circles.

To borrow an idea from Dan North, testing all the way from the boundary makes for really lousy examples, because the noise gets in the way of the idea.

The grain of the test should match the grain of the constraint it describes - if the constraint is small, then we should expect that the composition will have low complexity.

What we have then, I think, is a version of testing, the human author applying a number of heuristics when designing an automated check to ensure that the subject(s) will exhibit the appropriate properties.  In other words, we're getting a lot of mileage out of aligning the test/subject boundaries before we even get to green.

The kinds of design improvements that we make while refactoring?
There is definitely a family of refactorings that are motivated by the idea of taking some implementation detail and "lifting" it into the testable space. I think that you can fairly say that the (future) test is influencing the design that emerges during the refactoring.

I'm not convinced that we can credit tests for the results that emerge from the Design Dynamo.  My current thinking is that they are playing only a supporting role - repeatedly evaluating compliance with the constraints after each change, but not encouraging the selection of a particular change.

Further Reading

Mark Seemann: The TDD Apostate.
Michael Feathers: Making Too Much of TDD.