Sunday, March 17, 2019

TDD: Probes

I've been thinking about TDD this week through the lens of the Unusual Spending kata.

The unusual spending kata is superficially similar to Thomas Mayrhofer's employee report: the behavior of the system is to produce a human readable report. In the case of unusual spending, the interesting part of the report is the body of the email message.

At the API boundary, the body of the email message is a String, which is to say it is an opaque sequence of bytes.  We're preparing to pass the email across a boundary, so it's normal that we transition from domain specific data representations to domain agnostic data representations.

But there's a consequence associated with that -- we're pretty much limited to testing the domain agnostic properties of the value.  We can test the length, the prefix, the suffix; we can match the entire String against a golden master.

What we cannot easily do is extract domain specific semantics back out of the value.  It's not impossible, of course; but the ROI is pretty lousy.

Writing tests that are coupled to an opaque representation like this isn't necessarily a problem; but as Mayrhofer showed, it's awkward to have a lot of tests that are tightly coupled to unstable behaviors.

In the case of the Unusual Spending kata, we "know" that the email representation is unstable because it is up near the top of the value chain; it's close to the human beings in the system - the one's we want to delight so that they keep paying us to do good work.

It's not much of a stretch to extend the Unusual Spending kata with Mayrhofer's changing requirements.  What happens to us after we have shipped our solution, the customers tell us that the items in the email need to be reordered?  What happens after we ship that patch, when we discover that the capitalization of categories also needs to be changed?

Our automated checks provide inputs to some test subject, and measure the outputs.  Between the two lie any number of design decisions.  The long term stability of the check is going to be some sort of sum over the stability of each of the decisions we have made.

Therefore, when we set our probes at the outer boundary of the system, the stability of the checks is weakest.

A way to attack this problem is to have probes at different levels.  One advantage is that we can test subsets of decisions in isolation; aka "unit tests".

Less familiar, but still effective, is that we can use the probes to compare the behaviors we see on different paths.  Scott Wlaschin describes a related idea in his work on property based testing.  Is behavior A consistent with the composition of behaviors B and C?  Is behavior B consistent with the composition of D, E, and F?

There's a little bit of care to be taken here, because we aren't trying to duplicate the implementation in our tests, nor are we trying to enforce a specific implementation.  The "actual" value in our check will be the result of plugging some inputs into a black box function; the "expected" value will plug some (possibly different) inputs into a pipeline.

Following this through to its logical conclusion: the tests are trying to encourage a design where pipeline segments can be used independently of our branching logic.

Parnas taught us to encapsulate decisions in modules; it's a strategy we can use to mitigate the risk of change.  Similarly, when we are writing tests, we need designs that do not overfit those same decisions.

That in turn suggests a interesting feedback loop - the arrival of requirement to change behavior may require the addition of a number of "redundant" tests of the existing implementation, so that older tests that were too tightly coupled to the unstable behavior can be decommissioned.

Perhaps this pattern of pipeline equivalencies is useful in creating those separate tests.

No comments:

Post a Comment