Wednesday, March 20, 2019

Isolation at the boundary

Recently, I was looking through Beyond Mock Objects.  Rainsberger invokes one of my favorite dependencies - the system clock.

We’ve made an implicit dependency more explicit, and I value that a lot, but Clock somehow feels like indirection without abstraction. By this I mean that we’ve introduced a seam3 to improve testability, but that the resulting code exposes details rather than hiding them, and abstractions, by definition, hide details.
If we want to introduce indirection, then we ought to introduce the smallest indirection possible. And we absolutely must try to introduce better abstraction. 
I agree with this idea - but I would emphasize that the indirection and the better abstraction are different elements in the design.

The boundary represents the part of our design where things become uncertain - we're interacting with elements that aren't under our control.  Because of the uncertainty, measuring risk becomes more difficult.  Therefore, we want to squeeze risk out of the boundary and back toward the core.

What I'm describing here is an adapter: at the outer end, the adapter is plugable with the system clock; the inner end satisfies the better abstraction -- perhaps the stream of timestamps briefly described by Rainsberger, perhaps a higher abstraction more directly related to your domain.

In other words, one of my design constraints is that I should be able to isolate and exercise my adapters in a controlled test environment.

Let's consider Unusual Spending; our system needs to interact with the vendor environnment - reading payments from a payments database, dispatching emails to a gateway.  Since the trigger is supposed to produce "current" reports, we need some kind of input to tell us when "now" is.  So three external ports.  My test environment, therefore, needs substitutes for those three ports; my composition needs the ability to specify the ports.  Because the real external reports aren't going to be part of the development test suite, we want the risk squeezed out of them.

The API of the external port is tightly coupled to the live implementation -- if that changes on us, then we're going to need a new port and a new adapter.  If our inner port abstraction is good, then the adapter acts as a bulkhead, protecting the rest of the solution from the change.

Somewhere in the solution, our composition root will describe how to hook up all of the pieces together.  For instance, if we were constrained to use the default constructor as the composition root, then we might end up with something like:

None of this is "driven" by the tests, except in the loose sense that we've noticed that we're going to be running the tests a lot, and therefore need a stable controlled environment.  For example, in his demonstration, Justin Searls decided that couple his temporal logic to java.util.GregorianCalendar rather than java.lang.System.  With the right abstractions in place, the cost of reversing the decision is pretty small - try the simplest thing that could possibly work, prepare to change your mind later.

Sunday, March 17, 2019

TDD: Probes

I've been thinking about TDD this week through the lens of the Unusual Spending kata.

The unusual spending kata is superficially similar to Thomas Mayrhofer's employee report: the behavior of the system is to produce a human readable report. In the case of unusual spending, the interesting part of the report is the body of the email message.

At the API boundary, the body of the email message is a String, which is to say it is an opaque sequence of bytes.  We're preparing to pass the email across a boundary, so it's normal that we transition from domain specific data representations to domain agnostic data representations.

But there's a consequence associated with that -- we're pretty much limited to testing the domain agnostic properties of the value.  We can test the length, the prefix, the suffix; we can match the entire String against a golden master.

What we cannot easily do is extract domain specific semantics back out of the value.  It's not impossible, of course; but the ROI is pretty lousy.

Writing tests that are coupled to an opaque representation like this isn't necessarily a problem; but as Mayrhofer showed, it's awkward to have a lot of tests that are tightly coupled to unstable behaviors.

In the case of the Unusual Spending kata, we "know" that the email representation is unstable because it is up near the top of the value chain; it's close to the human beings in the system - the one's we want to delight so that they keep paying us to do good work.

It's not much of a stretch to extend the Unusual Spending kata with Mayrhofer's changing requirements.  What happens to us after we have shipped our solution, the customers tell us that the items in the email need to be reordered?  What happens after we ship that patch, when we discover that the capitalization of categories also needs to be changed?

Our automated checks provide inputs to some test subject, and measure the outputs.  Between the two lie any number of design decisions.  The long term stability of the check is going to be some sort of sum over the stability of each of the decisions we have made.

Therefore, when we set our probes at the outer boundary of the system, the stability of the checks is weakest.

A way to attack this problem is to have probes at different levels.  One advantage is that we can test subsets of decisions in isolation; aka "unit tests".

Less familiar, but still effective, is that we can use the probes to compare the behaviors we see on different paths.  Scott Wlaschin describes a related idea in his work on property based testing.  Is behavior A consistent with the composition of behaviors B and C?  Is behavior B consistent with the composition of D, E, and F?

There's a little bit of care to be taken here, because we aren't trying to duplicate the implementation in our tests, nor are we trying to enforce a specific implementation.  The "actual" value in our check will be the result of plugging some inputs into a black box function; the "expected" value will plug some (possibly different) inputs into a pipeline.

Following this through to its logical conclusion: the tests are trying to encourage a design where pipeline segments can be used independently of our branching logic.

Parnas taught us to encapsulate decisions in modules; it's a strategy we can use to mitigate the risk of change.  Similarly, when we are writing tests, we need designs that do not overfit those same decisions.

That in turn suggests a interesting feedback loop - the arrival of requirement to change behavior may require the addition of a number of "redundant" tests of the existing implementation, so that older tests that were too tightly coupled to the unstable behavior can be decommissioned.

Perhaps this pattern of pipeline equivalencies is useful in creating those separate tests.

Tuesday, March 12, 2019

TDD and the Sunday Roster

I got access to an early draft of this discussion, so I've been chewing on it for a while.

The basic outline is this: given a list of employees, produce a report of which employees are eligible to work on Sunday.

What makes this example interesting to me is that the requirements of the report evolve.  The sample employee roster is the same, the name of the function under test is the same, the shape of the report is the same, but the actual values change over time.

This is very different from exercises like the bowling game, or roman numerals, or Fibonacci, where once a given input is producing the correct result, that behavior gets locked in until the end of the exercise where we live happily ever after.

If we look at Mayrhofer's demonstration, and use a little bit of imagination, the implementation goes through roughly this lifecycle
  1. It begins as an identity function
  2. The identity function is refactored into a list filter using a predicate that accepts all entries
  3. The predicate is replaced
  4. The result is refactored to include a sort using the stable comparator
  5. The comparator is replaced
  6. The result is refactored to include a map using an identity function
  7. The map function is replaced
  8. The comparator is reversed
It's really hard to see, in the ceremony of patching up the tests, how the tests are paying for themselves -- in the short term, they are overhead costs, in the long term, they are being discarded.

It also doesn't feel like investing more up front on the tests helps matters very much.  One can imagine, for instance, breaking up the unsatisfactory checks that are coupled directly to the report with collections of coarse grained constraints that can be evaluated independently.  But unless that language flows out of you naturally, that's extra work to amortize once again.

It's not clear that TDD is leading to a testing strategy that has costs of change commiserate with the value of the change, nor is it clear to me that the testing strategy is significantly reducing the risk.  Maybe we're supposed to make it up on volume?  Lots of reports, each cheap because they share the same core orchestration logic?

This starts to feel like one of the cases where Coplien has the right idea: that there are more effective ways to address the risk than TDD -- for example, introducing and reviewing assertions?

Notes from my first HN surge

TDD: Hello World was shared on Hacker News.
  • 2500 "pageviews", which to be honest seems awfully small.
  • 500 "pageviews" of the immediately prior essay.
  • 2 comments on HN.
  • 0 comments locally.
A search of HN suggests that TDD isn't a very popular topic; I had to search about back about three months to find a link with much discussion going on.  Ironically enough, the subject of that link: "Why Developers Don't TDD".

Monday, March 11, 2019

TDD: Retrospective from a train

I went into Boston this evening for the Software Crafters Meetup.  Unfortunately, the train schedule and other obligations meant that I had to leave the party early.  This evenings exercise was a stripped down version of Game of Life.
Given a 3 x 3 grid, create function that will tell you whether center cell in next generation is dead or live.
 Upon review, I see that I got bit a number of different ways.

The first problem was that I skipped the step of preparing a list of test cases to work through.  The whole Proper Planning and Preparation thing is still not a habit.  In this case, I feel that I lost track of two important test cases; one that didn't matter (because we had a passing implementation already) and one where it did (I had lost track of the difference between two neighbors and three neighbors).

One of my partners suggested that "simplest way to pass the test" was to introduce some accidental duplication, and my goodness does that make a mess of things.  Sandi Metz argues that duplication is cheaper than the wrong abstraction; but I'm not even sure this counts as "wrong abstraction" -- even in a laboratory setting, the existing implementation is a powerful attractor for similar code.

Fundamentally, this problem is a simple function, which can be decomposed into three smaller parts:

But we never got particularly close to that; the closest we came was introducing a countNeighbors function in parallel, and then introducing that element as a replacement for our prior code.  We didn't "discover" anything meaningful when refactoring.

I suspect that this is, at least in part, a side effect of the accidental duplication -- the coupling of the output and the input was wrong, and therefore more difficult to refactor into something that was correct.

In retrospect, I think "remove duplication" is putting the emphasis in the wrong place.  Getting the first test to pass by hard coding the correct answer is a great way to complete test calibration in minimal wall clock time.  But before moving on there is a "show your work" step that removes the implicit duplication between the inputs and the outputs.

We talked a bit about the fact that the tests were hard to read; not a problem if they are scaffolding, because we can throw them out, but definitely a concern if we want the tests to stay behind as living documentation or examples.  Of course, making things more readable means more names -- are they part of the test, or is the fact that the tests needs them a hint that they will be useful to consumers as well?  Do we need additional tests for those names? How do we make those tests understandable?

Sunday, March 3, 2019

Constraint Driven Development and the Unusual Spending Kata

I recently discovered the Unusual Spending Kata, via testdouble.

One of the things that I liked immediately about the kata is that it introduces some hard constraints.  The most important one is that the entrypoint is fixed; it SHALL conform to some pre-determined contract.

This sort of constraint, common if you are working outside in, is a consequence of the dependency inversion principle.  The framework owns the contract, the plugin implements it.

Uncle Bob's Bowling Game kata takes a similar approach...
Write a class named Game that implements two methods...

In other words, you API design has already happened; in earlier work, in a spike, because you are committed to supporting a specific use case.

When you don't have that sort of constraint in place already, the act of writing your tests is supposed to drive the design of your API.  This is one of the places where we are really employing sapient testing; applying our human judgment to the question "is the API I'm creating suitable for consumers?".

The Unusual Spending Kata introduces two other constraints; constraints on the effects on which the core logic depends.  In order to produce any useful value, the results of the local calculation have to be published, and the kata constrains the solution to respect a particular API to do so.  Similarly, a stateless process is going to need input data from somewhere, and that API is also constrained by the kata.

So the programmer engaging in this exercise needs to align their design with the surface of the boundary.  Excellent.

Because the write constraint and the read constraint are distinct in this kata, it helps to reveal the fact that isolating the test environment is much easier for writes than it is for reads.

The EmailsUser::email is a pure sink; you can isolate yourself from the world by simply replacing the production module with a null object.  I find this realization to be particularly powerful, because it helps unlock one of the key ideas in the doctrine of useful objects -- that if you need to be able to observe a sink in test, it's likely that you also need to be able to observe the sink in production.  In other words, much of the logic that might inadvertently be encoded into your mock objects in practice may really belong in the seam that you use to protect the rest of the system from the decision you have made about which implementation to use.

In contrast, FetchesUserPaymentsByMonth::fetch is a source -- that's data coming into the system, and we need to have a much richer understanding of what that data can look like, so that we can correctly implement a test double that emits data in the right shape.

Implementing an inert source is relatively straight forward; we simply return an empty value each time the method is invoked.

An inert implementation alone doesn't give you very much coverage of the system under test, of course.  If you want to exercise anything other than the trivial code path in your solution, you are going to need substitutes that emit interesting data, which you will normally arrange within a given test case.

On the other hand, a cache has some interesting possibilities, in so far as you can load the responses that you want to use into the cache during arrange, and then they will be available to the system under test when needed.  The cache can be composed with the read behavior, so you get real production behavior even when using an inert substitute.

Caching introduces cache invalidation, which is one of the two hard problems.  Loading information into the cache requires having access to the cache, ergo either injecting a configured cache into the test subject or having cache loading methods available as part of the API of the test subject.

Therefore, we may not want to go down the rabbit hole right away.

Another aspect of the source is that the data coming out needs to be in some shared schema.  The consumer and the provider need to understand the same data the same way.

This part of the kata isn't particularly satisfactory - the fact that the constrained connection to our database allows the consumer to specify the schema, with no configuration required...?  The simplest solution is probably to define the payments API as part of the contract, rather than leaving that bit for the client to design.