Sunday, June 30, 2019

Usage Kata

The usage kata is intended as an experiment in applying Test Driven Development at the program boundary.

Create a command line application with the following behavior:

The command supports a single option: --help

When the command is invoked with the help option, a usage message is written to STDOUT, and the program exits successfully.

When the command is invoked without the help option, a diagnostic is written to STDERR, and the program exits unsuccessfully.

Example:

Sunday, June 23, 2019

Variations on the Simplest Thing That Could Possibly Work

Simplest Thing that Could Possibly Work is a unblocking technique

Once we get something on the screen, we can look at it. If it needs to be more we can make it more. Our problem is we've got nothing.

Hunt The Wumpus begins with a simple prompt, which asks the human operator whether she would like to review the game instructions before play begins.


That's a decent approximation of the legacy behavior.

But other behaviors may also be of interest, either as replacements for the legacy behavior, or supported alternatives.

For instance, we might discover that the prompt should behave like a diagnostic message, rather than like data output, in which case we'd be interested in something like

Or we might decide that UPPERCASE lacks readability

Or that the input hints should appear in a different order


And so on.

The point here being that even this simple line of code spans many different decisions.  If those decisions aren't stable, then neither is the behavior of the whole.

When we have many tests evaluating the behavior of a unstable decision, the result it a brittle test suite.

A challenging aspect to this: within the scope of a single design session, behaviors tend to be stable.  "This is the required behavior, today."  If we are disposing of the tests at the end of our design session, then there's no great problem to solve here.


On the other hand, if the tests are expected to be viable for many design sessions, then protecting the tests from the unstable decision graph constrains our design still further.

One way to achieve a stable decision graph is to enforce a constraint that new behaviors are added by extension, and that new behaviors will be delivered beside the old, and that clients can choose which behavior they prefer.  There's some additional overhead compared with making the change in "one" place.

Another approach is to create bulkheads within the design, so that only single elements are tightly coupled to a specific decision, and the behavior of compositions is evaluated in comparison to their simpler elements.  James Shore describes this approach in more detail within his Testing Without Mocks pattern language.

What I haven't seen yet: a good discussion of when.  Do we apply YAGNI, and defend against the brittleness on demand?  Do we speculate in advance, and invest more in the design to insure against an uncertain future?  Is there a checklist that we can work, to reduce the risk that we foul our process for reducing the risk?

Thursday, June 20, 2019

Design Decisions After the First Unit Test

Recently, I turned my attention back to one of my early "unit" tests in Hunt The Wumpus.

This is an "outside-in" test, because I'm still curious to learn different ways that the tests can drive the designs in our underlying implementations.

Getting this first test to pass is about Hello World difficulty level -- just write out the expected string.

At that point, we have this odd code smell - the same string literal is appearing in two different places. What does that mean, and what do we need to do about it?

One answer, of course, is to ignore it, and move on to some more interesting behavior.

Another point of attack is to go after the duplication directly. If you can see it, then you can tease it apart and start naming it. JBrains has described the design dynamo that we can use to weave a better design out of the available parts and a bit of imagination.

To be honest, I find this duplication a bit difficult to attack that way. I need another tool for bootstrapping.

It's unlikely to be a surprise that my tool of choice is Parnas. Can we identify the decision, or chain of decisions, that contribute to this particular string being the right answer?

In this case, the UPPERCASE spelling of the prompt helps me to discover some decisions; what if, instead of shouting, we were to use mixed case?

This hints that, perhaps, somewhere in the design is a string table, that defines what "the" correct representation of this prompt might be.

Given such a table, we can then take this test and divide it evenly into two parts - a sociable test that verifies that the interactive shell behaves like the string table, and a second solitary test that the string table is correct.

If we squint a bit, we might see that the prompt itself is composed of a number of decisions -- the new line terminator for the prompt, the format for displaying the hints about acceptable responses, the ordering of those responses, the ordering of the prompt elements (which we might want to change if the displayed language were right to left instead of left to right).

The prompt itself looks like a single opaque string, but there is duplication between the spelling of the hints and the checks that the shell will perform against the user input.

Only a single line of output, and already there is a lot of candidate variation that we will want to have under control.

Do we need to capture all of these variations in our tests? I believe that depends on the stability of the behavior. If what we are creating is a clone of the original behavior -- well, that behavior has been stable for forty years. It is pretty well understood, and the risk that we will need to change it after all this time is pretty low. On the other hand, if we are intending an internationalized implementation, then the English spellings are only the first increment, and we will want a test design that doesn't require a massive rewrite after each change.

Sunday, May 5, 2019

Testing at the seams

A seam is a place where you can alter behaviour in your program without editing in that place -- Michael Feathers, Working Effectively with Legacy Code
When I'm practicing Test Driven Development, I prefer to begin on the outside of the problem, and work my way inwards.  This gives me the illusion that I am discovering the pieces that I need; no abstraction is introduced in the code without at least one consumer, and the "ease of use" concern gets an immediate evaluation.

As an example, let's consider the case of an interactive shell.  We can implement a simple shell using java.lang.System, which gives us access to System.in, System.out, System::getenv, System.::currentTimeMillis, and so on.

We probably don't want our test subjects to be coupled to System, however, because that's a shared resource.  Developer tests should be embarrassingly parallel; shared mutable resources get in the way of that.

By introducing a seam that we can plug System into, we get the decoupling that we need in our tests.


If we want to introduce indirection, then we ought to introduce the smallest indirection possible. And we absolutely must try to introduce better abstraction. -- JB Rainsberger, 2013
My thinking is that this is correctly understood as two different steps; we introduce the indirection, and we also try to discovered the better abstraction.

But I am deliberately trying to avoid committing to an abstraction prematurely.  In particular, I don't want to invest in an abstraction without first accumulating evidence that it is a good one.  I don't want to make changes expensive when change is still likely - the investment odds are all wrong.

Tuesday, April 23, 2019

Saturday, April 20, 2019

Sketching Evil Hangman, featuring GeePaw

GeePaw Hill christened his twitch channel this week to a presentation of his approach to TDD, featuring an implementation of Evil Hangman.

Evil Hangman is a mimic of the traditional word guessing game, with a twist -- evil hangman doesn't commit to a solution immediately.  It's a good mimic - the observable behavior of the game is entirely consistent with a fair implementation that has committed to some word in the corpus.  But it will beat you unfairly if it can.

So, as a greenfield problem, how do you get started?

From what I can see, there are three approaches that you might take:
  • You can start with a walking skeleton, making some arbitrary choices about I/O, and work your way inward.
  • You can start with the functional core, and work your way inward
  • You can start with an element from a candidate design, and work your way outward.
GeePaw, it seems, is a lot more comfortable when he has pieces that he can microtest.  I got badly spooked on that approach years ago when I looked into the sudoku exercise.  So my preference is to choose an observable behavior that I understand, fake it, remove the duplication, and then add additional tests to the suite that force me to extend the solution's design.

Ultimately, if you examine the complexity of the test subjects, you might decide that I'm writing "integrated" or "acceptance" tests.  From my perspective, I don't care - the tests are fast, decoupled from the environment, and embarrassingly parallel.  Furthermore, the purpose of the tests is to facilitate my design work, not to prove correctness.

What this produces, if I do it right, is tests that are resilient to changes in the the design, but which may be brittle to changes in requirements.

My early tests, especially in greenfield work, tend to be primitive obsessed.  All I have in the beginning are domain agnostic constructs to work with, so how could they be anything but?  I don't view this as a failing, because I'm working outside in -- which is to say that my tests are describing the boundary of my system, where things aren't object oriented.  Primitives are the simplest thing that could possibly work, and allow me to move past my writer's block into having arguments with the code.

As a practice exercise, I would normally choose to start from the boundary of the functional core -- we aren't being asked to integrate with anything in particular, and my experiences so far haven't suggested that there is a lot of novelty there.
One should not ride in the buggy all the time. One has the fun of it and then gets out.
So, where to begin?

I'm looking for a function - something that will accept some domain agnostic arguments and return a domain agnostic value that I can measure.

Here, we know that the basic currency is that the player will be making guesses, and the game will be responding with clues.  So we could think in terms of a list of string inputs and a list of string outputs.  But the game also has hidden state, and I know from hard lessons that making that state an input to the test function will make certain kinds of verification easier.

The tricky bit, of course, is that I won't always know what that hidden state is until I get into the details.  I may end up discovering that my initial behaviors depend on some hidden variable that I hadn't considered as part of the API, and I'll need to remediate that later.

In this case, one of the bits of state is the corpus - the list of words that the game has to choose from.  Using a restricted word list makes it easier to specify the behavior of the implementation.  For instance, if all of the words in the corpus are the same length, then we know exactly how many dashes are going to be displayed in the initial hint.  If there is only a single word in the corpus, then we know exactly how the app will respond to any guess.

Making the corpus an input is the affordance that we need to specify degenerate cases.

Another place where degeneracy might be helpful is allowing the test to control the players mistake allotment.  Giving the human player no tolerance for errors allows us to explore endgame behavior much more easily.

And if we don't think of these affordances in our initial guess?  That's fine - we introduce a new entry point with an "extract method refactoring", eliminating duplication by having the original test subject delegate its behavior to our improved API, deprecating the original, and eliminating it when it is no longer required.

Simplest thing that can possibly work is a proposal, not a promise.

For the most part, that's what my sketches tend to look like: some exploration of the problem space, a guess at the boundary, and very little early speculation about the internals.

Friday, April 19, 2019

TDD and incremental delivery

I spend a lot of time thinking about breaking tests, and what that means about TDD as a development ritual.

I recently found a 2012 essay by Steven Thomas reviewing Jeff Patton's 2007 Mona Lisa analogy.  This in turn got me thinking about iteration specifically.

A lot of the early reports of Extreme Programming came out of Chrysler Comprehensive Compensation, and there's a very interesting remark in the post mortem
Subsequent launches of additional pay populations were wanted by top management within a year.
To me, that sounds like shorthand for the idea that the same (very broad) use case was to be extended to cover a larger and more interesting range of inputs with only minor changes to the behaviors already delivered.

The tests that we leave behind serve to describe the constraints necessary to harvest the low hanging fruit, as best we understood them at the time, to identify regressions when the next layer of complexity was introduced to the mix.

We're writing more living documentation because we are expecting to come back to this code next year, or next month, or next sprint.

I envision something like a ten case switch statement -- we'll implement the first two cases now, to cover perhaps a third of the traffic, and then defer the rest of the work until "later", defined as far enough away that the context has been evicted from our short term memory.

If the requirements for the behaviors that you implemented in the first increment are not stable, then there is non trivial risk that you'll need several iterations to get those requirements right.  Decisions change, and the implications of those changes are going to ripple to the nearest bulkhead, in which case we may need a finer grain testing strategy than we would if the requirements were stable.

At the other extreme, I'm working an an old legacy code base; this code base has a lot of corners that are "done" -- modules that haven't changed in many years.  Are we still profiting by running those tests?

This is something we should keep in mind as we kata.  If we want to be preparing for increments with significant time intervals between them, then we need bigger input spaces with stable requirements.

A number of the kata are decent on the stable requirements bit -- Roman numerals haven't changed in a long time -- but tend to be too small to justify not solving the whole thing in one go.  Having done that, you can thank your tests and let them go.

The best of the kata I'm familiar with for this approach would be the Gilded Rose - we have a potentially unlimited catalog of pricing rules for items, so we'll incrementally adapt the pricing logic until the entire product catalog is covered.

But - to do that without breaking tests, we need stable pricing rules, and we need to know in advance which products follow which rules.  If we were to naively assume, for example, that Sulfuras is a normal item, and we used it as part of our early test suite, then the updated behavior would break those tests.  (Not an expensive break, in this case -- we'd likely be able to replace Sulfuras with some other normal item, and get on with it).

Expressing the same idea somewhat differently: in an iterative approach, we might assume that Sulfuras is priced normally, and then make adjustments to the tests until they finally describe the right pricing constraints; in an incremental approach, Sulfuras would be out of scope until we were ready to address it.

I think the scheduling of refactoring gets interesting in an incremental approach - how much refactoring do you do now, when it is still uncertain which increment of work you will address next?  Is Shameless Green the right conclusion for a design sessions?

The Sudoku problem is one that is one that I struggle to classify.  One the one hand, the I/O is simple, and the requirements are stable, so it ought to hit the sweet spot for TDD.  You can readily imagine partitioning the space of sudoku problems into trivial, easy, medium, hard, diabolical sets, and working on one grouping at a time, and delivering each increment in turn.

On the other hand, Dr Norvig showed that, if you understand the problem, you can simply draw the rest of the fucking owl. The boundaries between the categories are not inherent to the problem space.

Wednesday, April 10, 2019

Read Models vs Write Models

At Stack Exchange, I answered a question about DDD vs SQL, which resulted in a question about CQRS that I think requires more detail than is appropriate for that setting.
The "read model" is not the domain model (or part of it)? I am not an expert on CQRS, but I always thought the command model is quite different from the classic domain model, but not the read model. So maybe you can give an example for this?
So let's lay some ground work

A domain model is not a particular diagram; it is the idea that the diagram is intended to convey.  It is not just the knowledge in the domain expert's head; it is a rigorously organized and selective abstraction of that knowledge.  -- Eric Evans, 2003.
A Domain Model creates a web of interconnected objects, where each object represents some meaningful individual, whether as large as a corporation or as small as a single line on an order form.  -- Martin Fowler, 2003.
I think that Fowler's definition is a bit tight; there's no reason that we should need to use a different term when modeling with values and functions, rather than objects.

I think it is important to be sensitive to the fact that in some contexts we are talking about the abstraction of expert knowledge, and in others we are talking about an implementation that approximates that abstraction.

Discussions of "read model" and "write model" almost always refer to the implemented approximations.  We take a single abstraction of domain knowledge, and divide our approximation of it into two parts - one that handles our read use cases, and another that handles our write use cases.

When we are handling a write, there are usually constraints to ensure the integrity of the information that we are modeling.  That might be as simple as a constraint that we not overwrite information that was previously written, or it might mean that we need to ensure that new writes are consistent with the information already written.

So to handle a write, we will often take information from our durable store, load it into volatile memory, then create from that information a structure in memory into which the new information will be integrated.  The "domain logic" calculates new information, which is written back to the durable store.

On the other hand, reads are safe; "asking the question shouldn't change the answer".  In that case, we don't need the domain logic, because we aren't going to integrate new information. We can take the on disk representation of the information, and transform it directly into our query response, without passing through the intermediate representations we would use when writing.

We'll still want input sanitation, and message semantics that reflect our understanding of the domain experts abstraction, but we aren't going to need "aggregate roots", or "locks" or the other patterns that prevent the introduction of errors when changing information.  We still need the data, and the semantics that approximate our abstraction, but we don't need the rules.

We don't need the parts of our implementation that manage change.

When I answer "the query itself is unlikely to pass through the domain model", that's shorthand for the idea that we don't need to build domain specific data structures as we translate the information we retrieved from our durable store into our response message.


Monday, April 1, 2019

TDD from Edge to Edge

In my recent TDD practice, I've been continuing to explore the implications of edge to edge tests.

The core idea being, is this - if we design the constraints on our system at the edge, then we maximize the degrees of freedom we have in our design.

Within a single design session, this works just fine. My demonstration of the Mars Rover kata limits its assumptions to the edge of the system, and the initial test is made to pass by simply removing the duplication.

The advantage of such a test is that it is resilient to changes in the design. You can change the arrangement of the internals of the test subject, and the test itself remains relevant.

The disadvantage of such a test is that it is not resilient to changes in the requirements.

It's common in TDD demonstrations to work with a fixed set of constraints throughout the design session. Yes, we tend to introduce the constraints in increments, but taken as a set they tend to be consistent.

The Golden Master approach works just fine under those conditions; we can extend our transcript with descriptions of extensions, and then amend the test subject to match.

But a change in behavior? And suddenly an opaque comparison to the Golden Master fails, and we have to discard all of the bath water in addition to the baby.

We might describe the problem this way: the edge to edge test spans many different behaviors, and a change to a single behavior in the system may butterfly all the way out to the observable behavior. In other words, the same property that makes the test useful when refactoring acts against us when we introduce a modification to the behavior.

One way to side step this is to take as given that a new behavior means a new test subject. We'll flesh out a element from scratch, using the refactoring task in the TDD cycle to absorb our previous work into the new solution. I haven't learned that this is particularly convenient for consumers. "Please update your dependency to the latest version of my library and also change the name you use to call it" isn't a message I expect to be well received by maintainers that haven't already introduced design affordances for this sort of change.

So what else? How do we arrange our tests so that we don't need to start from scratch each time we get a request for a breaking change?

Recently, I happened to be thinking about this check in one of my tests.

When this check fails, we "know" that there is a bug in the test subject.  But why do we know that?

If you squint a bit, you might realize that we aren't really testing the subject in isolation, but rather whether or not the behavior of the subject is consistent with these other elements that we have high confidence in.  "When you hear hoof beats, expect horses not zebras".

Kent Beck describes similar ideas in his discussion of unit tests.

There is a sort of transitive assertion that we can make: if the behavior of the subject is consistent with some other behavior, and we are confident that the other behavior is correct, then we can assume the behavior of the test subject is correct.

What this affords is that we can take the edge to edge test and express the desired behavior as a composition of other smaller behaviors that we are confident in.  The Golden Master can be dynamically generated from the behavior of the smaller elements.

Of course, the confidence in those smaller elements comes from having tests of their own, verifying that those behaviors are consistent with simpler, more trusted elements.  It's turtles all the way down.

In this sort of design, the smaller components in the system act as bulkheads for change.

I feel that I should call out the fact that some care is required in the description of the checks we create in this style.  We should not be trying to verify that the larger component is implemented using some smaller component, but only that its behavior is consistent with that of the smaller component.


Wednesday, March 20, 2019

Isolation at the boundary

Recently, I was looking through Beyond Mock Objects.  Rainsberger invokes one of my favorite dependencies - the system clock.

We’ve made an implicit dependency more explicit, and I value that a lot, but Clock somehow feels like indirection without abstraction. By this I mean that we’ve introduced a seam3 to improve testability, but that the resulting code exposes details rather than hiding them, and abstractions, by definition, hide details.
If we want to introduce indirection, then we ought to introduce the smallest indirection possible. And we absolutely must try to introduce better abstraction. 
I agree with this idea - but I would emphasize that the indirection and the better abstraction are different elements in the design.

The boundary represents the part of our design where things become uncertain - we're interacting with elements that aren't under our control.  Because of the uncertainty, measuring risk becomes more difficult.  Therefore, we want to squeeze risk out of the boundary and back toward the core.


What I'm describing here is an adapter: at the outer end, the adapter is plugable with the system clock; the inner end satisfies the better abstraction -- perhaps the stream of timestamps briefly described by Rainsberger, perhaps a higher abstraction more directly related to your domain.

In other words, one of my design constraints is that I should be able to isolate and exercise my adapters in a controlled test environment.

Let's consider Unusual Spending; our system needs to interact with the vendor environnment - reading payments from a payments database, dispatching emails to a gateway.  Since the trigger is supposed to produce "current" reports, we need some kind of input to tell us when "now" is.  So three external ports.  My test environment, therefore, needs substitutes for those three ports; my composition needs the ability to specify the ports.  Because the real external reports aren't going to be part of the development test suite, we want the risk squeezed out of them.

The API of the external port is tightly coupled to the live implementation -- if that changes on us, then we're going to need a new port and a new adapter.  If our inner port abstraction is good, then the adapter acts as a bulkhead, protecting the rest of the solution from the change.

Somewhere in the solution, our composition root will describe how to hook up all of the pieces together.  For instance, if we were constrained to use the default constructor as the composition root, then we might end up with something like:

None of this is "driven" by the tests, except in the loose sense that we've noticed that we're going to be running the tests a lot, and therefore need a stable controlled environment.  For example, in his demonstration, Justin Searls decided that couple his temporal logic to java.util.GregorianCalendar rather than java.lang.System.  With the right abstractions in place, the cost of reversing the decision is pretty small - try the simplest thing that could possibly work, prepare to change your mind later.

Sunday, March 17, 2019

TDD: Probes

I've been thinking about TDD this week through the lens of the Unusual Spending kata.

The unusual spending kata is superficially similar to Thomas Mayrhofer's employee report: the behavior of the system is to produce a human readable report. In the case of unusual spending, the interesting part of the report is the body of the email message.

At the API boundary, the body of the email message is a String, which is to say it is an opaque sequence of bytes.  We're preparing to pass the email across a boundary, so it's normal that we transition from domain specific data representations to domain agnostic data representations.

But there's a consequence associated with that -- we're pretty much limited to testing the domain agnostic properties of the value.  We can test the length, the prefix, the suffix; we can match the entire String against a golden master.

What we cannot easily do is extract domain specific semantics back out of the value.  It's not impossible, of course; but the ROI is pretty lousy.

Writing tests that are coupled to an opaque representation like this isn't necessarily a problem; but as Mayrhofer showed, it's awkward to have a lot of tests that are tightly coupled to unstable behaviors.

In the case of the Unusual Spending kata, we "know" that the email representation is unstable because it is up near the top of the value chain; it's close to the human beings in the system - the one's we want to delight so that they keep paying us to do good work.

It's not much of a stretch to extend the Unusual Spending kata with Mayrhofer's changing requirements.  What happens to us after we have shipped our solution, the customers tell us that the items in the email need to be reordered?  What happens after we ship that patch, when we discover that the capitalization of categories also needs to be changed?

Our automated checks provide inputs to some test subject, and measure the outputs.  Between the two lie any number of design decisions.  The long term stability of the check is going to be some sort of sum over the stability of each of the decisions we have made.

Therefore, when we set our probes at the outer boundary of the system, the stability of the checks is weakest.


A way to attack this problem is to have probes at different levels.  One advantage is that we can test subsets of decisions in isolation; aka "unit tests".

Less familiar, but still effective, is that we can use the probes to compare the behaviors we see on different paths.  Scott Wlaschin describes a related idea in his work on property based testing.  Is behavior A consistent with the composition of behaviors B and C?  Is behavior B consistent with the composition of D, E, and F?

There's a little bit of care to be taken here, because we aren't trying to duplicate the implementation in our tests, nor are we trying to enforce a specific implementation.  The "actual" value in our check will be the result of plugging some inputs into a black box function; the "expected" value will plug some (possibly different) inputs into a pipeline.

Following this through to its logical conclusion: the tests are trying to encourage a design where pipeline segments can be used independently of our branching logic.

Parnas taught us to encapsulate decisions in modules; it's a strategy we can use to mitigate the risk of change.  Similarly, when we are writing tests, we need designs that do not overfit those same decisions.


That in turn suggests a interesting feedback loop - the arrival of requirement to change behavior may require the addition of a number of "redundant" tests of the existing implementation, so that older tests that were too tightly coupled to the unstable behavior can be decommissioned.

Perhaps this pattern of pipeline equivalencies is useful in creating those separate tests.

Tuesday, March 12, 2019

TDD and the Sunday Roster

I got access to an early draft of this discussion, so I've been chewing on it for a while.

The basic outline is this: given a list of employees, produce a report of which employees are eligible to work on Sunday.

What makes this example interesting to me is that the requirements of the report evolve.  The sample employee roster is the same, the name of the function under test is the same, the shape of the report is the same, but the actual values change over time.

This is very different from exercises like the bowling game, or roman numerals, or Fibonacci, where once a given input is producing the correct result, that behavior gets locked in until the end of the exercise where we live happily ever after.

If we look at Mayrhofer's demonstration, and use a little bit of imagination, the implementation goes through roughly this lifecycle
  1. It begins as an identity function
  2. The identity function is refactored into a list filter using a predicate that accepts all entries
  3. The predicate is replaced
  4. The result is refactored to include a sort using the stable comparator
  5. The comparator is replaced
  6. The result is refactored to include a map using an identity function
  7. The map function is replaced
  8. The comparator is reversed
It's really hard to see, in the ceremony of patching up the tests, how the tests are paying for themselves -- in the short term, they are overhead costs, in the long term, they are being discarded.

It also doesn't feel like investing more up front on the tests helps matters very much.  One can imagine, for instance, breaking up the unsatisfactory checks that are coupled directly to the report with collections of coarse grained constraints that can be evaluated independently.  But unless that language flows out of you naturally, that's extra work to amortize once again.

It's not clear that TDD is leading to a testing strategy that has costs of change commiserate with the value of the change, nor is it clear to me that the testing strategy is significantly reducing the risk.  Maybe we're supposed to make it up on volume?  Lots of reports, each cheap because they share the same core orchestration logic?

This starts to feel like one of the cases where Coplien has the right idea: that there are more effective ways to address the risk than TDD -- for example, introducing and reviewing assertions?

Notes from my first HN surge

TDD: Hello World was shared on Hacker News.
  • 2500 "pageviews", which to be honest seems awfully small.
  • 500 "pageviews" of the immediately prior essay.
  • 2 comments on HN.
  • 0 comments locally.
A search of HN suggests that TDD isn't a very popular topic; I had to search about back about three months to find a link with much discussion going on.  Ironically enough, the subject of that link: "Why Developers Don't TDD".



Monday, March 11, 2019

TDD: Retrospective from a train

I went into Boston this evening for the Software Crafters Meetup.  Unfortunately, the train schedule and other obligations meant that I had to leave the party early.  This evenings exercise was a stripped down version of Game of Life.
Given a 3 x 3 grid, create function that will tell you whether center cell in next generation is dead or live.
 Upon review, I see that I got bit a number of different ways.

The first problem was that I skipped the step of preparing a list of test cases to work through.  The whole Proper Planning and Preparation thing is still not a habit.  In this case, I feel that I lost track of two important test cases; one that didn't matter (because we had a passing implementation already) and one where it did (I had lost track of the difference between two neighbors and three neighbors).

One of my partners suggested that "simplest way to pass the test" was to introduce some accidental duplication, and my goodness does that make a mess of things.  Sandi Metz argues that duplication is cheaper than the wrong abstraction; but I'm not even sure this counts as "wrong abstraction" -- even in a laboratory setting, the existing implementation is a powerful attractor for similar code.

Fundamentally, this problem is a simple function, which can be decomposed into three smaller parts:

But we never got particularly close to that; the closest we came was introducing a countNeighbors function in parallel, and then introducing that element as a replacement for our prior code.  We didn't "discover" anything meaningful when refactoring.

I suspect that this is, at least in part, a side effect of the accidental duplication -- the coupling of the output and the input was wrong, and therefore more difficult to refactor into something that was correct.

In retrospect, I think "remove duplication" is putting the emphasis in the wrong place.  Getting the first test to pass by hard coding the correct answer is a great way to complete test calibration in minimal wall clock time.  But before moving on there is a "show your work" step that removes the implicit duplication between the inputs and the outputs.

We talked a bit about the fact that the tests were hard to read; not a problem if they are scaffolding, because we can throw them out, but definitely a concern if we want the tests to stay behind as living documentation or examples.  Of course, making things more readable means more names -- are they part of the test, or is the fact that the tests needs them a hint that they will be useful to consumers as well?  Do we need additional tests for those names? How do we make those tests understandable?




Sunday, March 3, 2019

Constraint Driven Development and the Unusual Spending Kata

I recently discovered the Unusual Spending Kata, via testdouble.

One of the things that I liked immediately about the kata is that it introduces some hard constraints.  The most important one is that the entrypoint is fixed; it SHALL conform to some pre-determined contract.

This sort of constraint, common if you are working outside in, is a consequence of the dependency inversion principle.  The framework owns the contract, the plugin implements it.

Uncle Bob's Bowling Game kata takes a similar approach...
Write a class named Game that implements two methods...

In other words, you API design has already happened; in earlier work, in a spike, because you are committed to supporting a specific use case.

When you don't have that sort of constraint in place already, the act of writing your tests is supposed to drive the design of your API.  This is one of the places where we are really employing sapient testing; applying our human judgment to the question "is the API I'm creating suitable for consumers?".

The Unusual Spending Kata introduces two other constraints; constraints on the effects on which the core logic depends.  In order to produce any useful value, the results of the local calculation have to be published, and the kata constrains the solution to respect a particular API to do so.  Similarly, a stateless process is going to need input data from somewhere, and that API is also constrained by the kata.

So the programmer engaging in this exercise needs to align their design with the surface of the boundary.  Excellent.


Because the write constraint and the read constraint are distinct in this kata, it helps to reveal the fact that isolating the test environment is much easier for writes than it is for reads.

The EmailsUser::email is a pure sink; you can isolate yourself from the world by simply replacing the production module with a null object.  I find this realization to be particularly powerful, because it helps unlock one of the key ideas in the doctrine of useful objects -- that if you need to be able to observe a sink in test, it's likely that you also need to be able to observe the sink in production.  In other words, much of the logic that might inadvertently be encoded into your mock objects in practice may really belong in the seam that you use to protect the rest of the system from the decision you have made about which implementation to use.

In contrast, FetchesUserPaymentsByMonth::fetch is a source -- that's data coming into the system, and we need to have a much richer understanding of what that data can look like, so that we can correctly implement a test double that emits data in the right shape.

Implementing an inert source is relatively straight forward; we simply return an empty value each time the method is invoked.

An inert implementation alone doesn't give you very much coverage of the system under test, of course.  If you want to exercise anything other than the trivial code path in your solution, you are going to need substitutes that emit interesting data, which you will normally arrange within a given test case.

On the other hand, a cache has some interesting possibilities, in so far as you can load the responses that you want to use into the cache during arrange, and then they will be available to the system under test when needed.  The cache can be composed with the read behavior, so you get real production behavior even when using an inert substitute.

Caching introduces cache invalidation, which is one of the two hard problems.  Loading information into the cache requires having access to the cache, ergo either injecting a configured cache into the test subject or having cache loading methods available as part of the API of the test subject.

Therefore, we may not want to go down the rabbit hole right away.

Another aspect of the source is that the data coming out needs to be in some shared schema.  The consumer and the provider need to understand the same data the same way.

This part of the kata isn't particularly satisfactory - the fact that the constrained connection to our database allows the consumer to specify the schema, with no configuration required...?  The simplest solution is probably to define the payments API as part of the contract, rather than leaving that bit for the client to design.




Wednesday, February 20, 2019

Bowling Driven Design

The bowling game kata is odd.

I've never seen Uncle Bob demonstrate the kata himself; so I can't speak to its presentation or its effectiveness beyond the written description.  It has certainly inspired some number of readers, including myself at various times.

But....


I'm becoming more comfortable with the idea that the practice is just a ritual... a meditation... bottle shaking....

In slide two of the Powerpoint deck, the first requirement is presented: Game must implement a specific API.

What I want to call attention to here: this API is not motivated by the tests. The requirements are described on slide #3, followed by a modeling session on slides #4-#9 describing some of the underlying classes that may appear. As of slide #52 at the end of the deck, none of the classes from the modeling session have appeared in the solution.

Furthermore, over the course of writing the tests, some helper methods are discovered; but those helper methods remain within the boundary of the test -- the API remains fixed.

All of the complexity in the problem is within the scoring logic itself, which is a pure function -- given a complete legal sequence of pin falls, compute the game score. Yet the API of the test subject isn't a function, but a state machine.  The scoring logic never manages to scape from the boundaries of the Game object -- boundaries that were set before we had any tests at all.
If you want to lean to think the way I think, to design the way I design, then you must learn to react to minutia the way I react. -- Uncle Bob
What the hell are we driving here?  Some logic, I suppose.


Monday, February 18, 2019

Aggregates: Separation of concerns

A question on Stack Overflow lead me to On Aggregates and Domain Service interaction, written by Marco Pivetta in January of 2017.  Of particular interest to me were the comments by Mathias Verraes.

What I recognized is that the description of aggregate described by Mathias is very similar to the description of protocols described by Cory Benfield.  So I wanted to try to write that out, long hand.

Aggregates are information (state), and also logic that describes how to integrate new information with the existing information.  In accordance with the usual guidelines of object oriented development, we package the data structure responsible for tracking the information with the rules for mutating the data structure and the transformations that use the data structure to answer queries.


Because the responsibility of the object is this data structure and its operations, and because this data structure is a local, in memory artifact, there's no room (in the responsibility sense) for effects.

How do we read, or write, information that isn't local to the aggregate?

The short answer is that responsibility goes out into the application layer (which in turn may delegate the responsibility to the infrastructure layer; those details aren't important here).

The aggregate incorporates information and decides what needs to be done; the application layer does it, and reports the results back to the aggregate as new information.

Spelling the same idea a different way - the aggregate is a state machine, and it supports two important queries.  One is "what representation can I use to recover your current state?", so that we can persist the work rather than needing to keep the aggregate live in memory for its entire lifetime.  The other is "what work can the application layer do for you?".

Put another way, we handle the aggregate's demands for remote data asynchronously.  The processing of the command ends when the model discovers that it needs data which isn't available.  The application queries the model, discovering the need for more data, and can then fetch the data.  Maybe it's available now? then that data is passed to the aggregate which can integrate that information into its state.

If the information isn't available now, then we can simply persist the existing work, and resume it later when the information does become available.  This might look like a scheduled callback, for example.

If your model already understands "time", then it can report its own timing requirements to the application, so that those can also be included in the scheduling.

 

Refactoring: paint by numbers

I've been working on a rewrite of an API, and I've been trying to ensure that my implementation of the new API has the same behavior as the existing implementation.  This has meant building up a suite of regression checks, and an adapter that allows me to use the new implementation to support a legacy client.

In this particular case, there is a one-to-one relationship between the methods in the old API and the new -- the new variant just uses different spellings for the arguments and introduces some extra seams.

The process has been "test first" (although I would not call the design "test driven").  It begins with a check, using the legacy implementation.  This stage of the exercise is to make sure that I understand how the existing behavior works.



We call a factory method in the production code, which acts as a composition root to create an instance of the legacy implementation. We pass a reference to the interface to a check, which exercises the API through a use case, validating various checks along the way.

Having done this, we then introduce a new test; this one calling a factory method that produces an instance of an adapter, that acts as a bridge between legacy clients and the new API.


The signature of the factory method here is a consequence of the pattern that follows, where I work in three distinct cycles
  • RED: begin a test calibration, verifying that the test fails
  • GREEN: complete the test calibration, verifying that the test passes
  • REPLACE: introduce the adapter into the mix, and verify that the test continues to pass.
To begin, I create an implementation of the API that is suitable for using to calibrate a test by ensuring that a broken implementation fails. This is straight forward; I just need to throw UnsupportedOperationExceptions

Then, I created an abstract decorator, implementing the legacy API by simply dispatching each method to another implementation of the same interface.

And then I define my adapter, which extends the wrapper of the legacy API, and also accepts an instance of the new API.

Finally, with all the plumbing in place, I return a new instance of this class from the factory method.

My implementation protocol then looks like this; first, I run the test using the adapter as is. With no overrides in place, each call in the api gets directed to TEST_CALIBRATION_FACADE, which throws an UnsupportedOperationException, and the check fails.

To complete the test calibration, I override the implementation of the method(s) I need locally, directing them to a local instance of the legacy implementation, like so:

The test passes, of course, because we're using the same implementation that we used to set up the use case originally.

In the replace phase, the legacy implementation gets inlined here in the factory method, so that I can see precisely what's going on, and I can start moving the implementation details to the new API.

Once I've reached the point that all of the methods have been implemented, I can ditch this scaffolding, and provide an implementation of the legacy interface that delegates all of the work directly to v2; no abstract wrapper required.

There's an interesting mirroring here; the application to model interface is v1 to v2, then then I have a bunch of coordination in the new idiom, but at the bottom of the pile, the v2 implementation is just plugging back into v1 persistence. You can see an example of that here - Booking looks fairly normal, just an orchestration with the repository element. WrappedCargo looks a little bit odd, perhaps -- it's not an "aggregate root" in the usual sense, it's not a domain entity lifted into a higher role. Instead, it's a plumbing element wrapped around the legacy root object (with some additional plumbing to deal with the translations).

Longer term, I'll create a mapping from the legacy storage schema to an entity that understands the V2 API, and eventually swap out the O/RM altogether by migrating the state from the RDBMS to a document store.

Friday, February 8, 2019

TDD: Hello World

As an experiment, I recently tried developing HelloWorld using a "test driven" approach.

You can review the commit history on GitHub.

In Java, HelloWorld is a one-liner -- except that you are trapped in the Kingdom of Nouns, so there is boilerplate to manage.

Now you can implement HelloWorld in a perfectly natural way, and test it -- System.setOut allows you to replace the stream, so the write happens to a buffer that is under the control of the test.

It's not entirely clear to me what happens, however, if you have multiple tests concurrently writing to that stream.  The synchronization primitives ensure that each write is atomic, but there is a lot of time for the stream to be corrupted with other writes by the time the test harness gets to inspect the result.

This is why we normally design our tests so that they are isolated from shared mutable state; we want predictable results.  So in HelloWorld, this means we need to be able to ensure that the write happens to an isolated, rather than a shared stream.

So instead of testing HelloWorld::main, we end up testing HelloWorld.writeTo, or some improved spelling of the same idea.

Another pressure that shows up quickly is duplication - the byte sequence we want to test needs to be written into both the test and the implementation.  Again, we've learned patterns for dealing with that -- the data should move toward the test, so we have a function that accepts a message/prompt as an argument (in addition to passing along the target stream).  As an added bonus, we get a more general implementation for free.

Did we really need a more general implementation of HelloWorld?

Another practice that I associate with TDD is using the test as an example of how the subject may be used -- if the test is clumsy, then that's a hint that maybe the API needs some work.  The test needs a mutable buffer, and a PrintStream around it, and then needs to either unpack the contents of the buffer or express the specification as a byte array, when the natural primitive to use is a String literal.

You can, indeed, simplify the API, replacing the buffer with a useful object that serves a similar role.  At which point you either have two parallel code paths in your app (duplication of idea), or you introduce a bunch of additional composition so that the main logic always sees the same interface.

Our "testable" code turns more and more into spaghetti.

Now, it's possible that I simply lack imagination, and that once all of these tests are in place, you'll be able to refactor your way to an elegant implementation.  But to me, it looks like a trash fire.

There's a lesson here, and I think it is: left-pad.

Which is to say, not only is HelloWorld "so simple that there are obviously no deficiencies", but also that it is too simple to share; which is to say, the integration cost required to share the element exceeds the costs of writing it from scratch each time you need it.

Expressed a different way: there is virtually no chance that the duplication is going to burn you, because once written the implementation will not require any kind of coordinated future change (short of a massive incompatibility being introduced in the language runtime itself, in which case you are going to have bigger fires to fight).

Tuesday, February 5, 2019

The Influence of Tests

Some years ago, I became disenchanted with the notion that TDD uses tests to "drive" design in any meaningful way.



I came to notice two things: first, that the tests were just as happy to pass whatever cut and paste hack served as "the simplest thing that could possibly work", second that all of the refactoring patterns are reversible.

So what is being test infected buying me?

One interesting constraint on tests is that we want them to be reliable.  If the test subject hasn't changed, then we should get the same collection of observations if we move the test bed in time and space.  This in turn means we need to restrict the tests interaction with unstable elements -- I/O, the clock, the network, random entropy.  Our test subjects often expect to interact with these elements, so within the test environment we need to be able to provide a substitute.

So one of the design patterns driven by testing is "dependency injection".  Somewhere recently I came across the spelling "configurable dependency", which I think is better.  It helps to sharpen my attention on the fact that we are describing something that we change when we transition from a production environment to a test environment, which in turn suggests certain approaches.

But we're really talking about something more specific: configurable effects or perhaps configurable non-determinism.

The test itself doesn't care much about how much buffer surrounds the effect; but if we allow test coverage to influence us here, then we want the substituted code to be as small as we can manage.  To lean of Gary Bernhardt's terminology, we want the test to be able to control a thin imperative shell.

But then what?  We can keep pouring inputs through the shell without introducing any new pressures on the design.
Our designs must consist of many highly cohesive, loosely coupled components, just to make testing easy. -- Kent Beck, Test Driven Development by Example
I came across this recently, and it helps.

A key problem with the outside in approach, is that the "costs" of setting up a test are disproportionate to the constraint we are trying to establish.  Composition of the test subject requires us to draw the rest of the owl when all we need is a couple of circles.

To borrow an idea from Dan North, testing all the way from the boundary makes for really lousy examples, because the noise gets in the way of the idea.

The grain of the test should match the grain of the constraint it describes - if the constraint is small, then we should expect that the composition will have low complexity.

What we have then, I think, is a version of testing, the human author applying a number of heuristics when designing an automated check to ensure that the subject(s) will exhibit the appropriate properties.  In other words, we're getting a lot of mileage out of aligning the test/subject boundaries before we even get to green.

The kinds of design improvements that we make while refactoring?
There is definitely a family of refactorings that are motivated by the idea of taking some implementation detail and "lifting" it into the testable space. I think that you can fairly say that the (future) test is influencing the design that emerges during the refactoring.

I'm not convinced that we can credit tests for the results that emerge from the Design Dynamo.  My current thinking is that they are playing only a supporting role - repeatedly evaluating compliance with the constraints after each change, but not encouraging the selection of a particular change.

Further Reading


Mark Seemann: The TDD Apostate.
Michael Feathers: Making Too Much of TDD.

Sunday, January 27, 2019

Refactoring in the Wild

The past few days, I've been reviewing some code that looks like it could use some love.

It's part of an adapter, that connect the core logic running in our process with some shared mutable state which is managed by Postgres.


It's code that works, rather than clean code that works.  The abstractions aren't very good, separable concerns are coupled to no great advantage, division of work is arbitrary.  And yet...

In the course of our recent work, one of the developers noticed that we could eliminate a number of calls to the remote database by making a smarter version of the main query.  We're using JDBC, and in this case the change required modifying the select clause of the query and making the matching change in the handling of the result set.

Both bits of code were in the same function and fit on the same screen.  There's a duplicated pattern for working with queries, connections, statements, recordsets -- but nobody had come along and tried to "eliminate" that duplication, so the change was easy.

Also, because of changes we've made to the hosting of the database, we needed to change the strategy we use for managing the connection to the database.  That change ended up touching a lot of different methods that were accessing the connection data member directly - so we needed to do an Encapsulate Variable refactoring to avoid duplicating the details of the strategy everywhere.


Applying that refactoring today was no more difficult than introducing it six years ago would have been.


YAGNI... yet.

Saturday, January 19, 2019

QotD: Kata Kata Kata!

If you don't force a kata to yield insights about working on real problems, you are wasting opportunities.

A Lesson of a Small Refactoring

We demand rigidly defined areas of doubt and uncertainty. -- Douglas Adams
Working through a Fibonacci number kata last night, I discovered that I was frequently using a particular pattern in my refactoring.


What this gives me in the code is clear separation between the use cases that are not currently covered by tests. I can then be more aggressive in the section of code that is covered by tests, because I've already mitigated the risk of introducing an inadvertent change.

The early exit variation I didn't discover until later, but I think I like it better.  In particular, when you get to a state where the bottom is done, the branch with the early exit gets excised and everything just falls through to the "real" implementation.

This same trick shows up again when it is time to make the next change easy:

It's a bit odd, in that we go from an implementation with full line coverage to one with additional logic that reduces the line coverage. I'm OK with this; it is a form of scaffolding that isn't expected to survive until the implementation is published.

What I find intriguing here is the handling of the code path that doesn't yet have tests, and the inferences one might draw for working on "legacy" code....

Thursday, January 10, 2019

Why I Practice TDD

TL; DR: I practice because I make mistakes; each mistake I recognize is an opportunity to learn, and thereby not make that mistake where it would be more expensive to do so.

Last night, I picked up the Fibonacci kata again.

I believe it was the Fibonacci kata where I finally got my first understanding of what Kent Beck meant by "duplication", many years ago.  So it is something of an old friend.  On the other hand, there's not a lot of meat to it - you go through the motions of TDD, but it is difficult to mine for insights.

Nonetheless, I made three significant errors, absolute howlers that have had me laughing at myself all day.

Error Messages

Ken Thompson has an automobile which he helped design. Unlike most vehicles, it has neither a speedometer, nor gas gauge, nor any of the other numerous idiot lights which plague the modern driver. Rather, if the driver makes a mistake, a giant "?" lights up in the center of the dashboard. "The experienced driver," says Thompson, "will usually know what's wrong."

In the early going of the exercise, I stopped to review my error messages, writing up notes about the motivations for doing them well.

Mid exercise, I had a long stare at the messages.  I was in the middle of a test calibration, and I happened to notice a happy accident in the way that I had made the test(s) fail.

But it wasn't until endgame that I finally discovered that I had transposed the expected and actual arguments in my calls to the Assertions library.

The contributing factors -- I've abandoned TestNG in favor of JUnit5, but my old JUnit habits haven't kicked back in yet.  While all the pieces are still fresh in my mind, I don't really see the data.  I just see pass and fail, and a surprise there means revert to the previous checkpoint.

Fence Posts

The one really interesting point in Fibonacci is the relationship between a recursive implementation and an iterative one.  A recursive implementation falls out pretty naturally when removing the duplication (as I learned from Beck long ago), but as Steve McConnell points out in Code Complete: recursion is a really powerful technique and Fibonacci is just an awful application of it.

What recursion does give you is a lovely oracle that you can use to check your production implementation.  It's an intermediate step along the way.

In "refactoring" away from the recursive implementation, I managed to introduce an off-by-one error in my implementation.  As a result, my implementation was returning a Fibonacci number, just not the one it was supposed to.

Well that happens, right? you run the tests, they go red, and you discover your mistake.

Not. So Much.

I managed to hit a perfect storm of errors.  At the time, I had failed to recall the idea of using an oracle, so I didn't have that security blanket.  I went after the recursion as soon as it appeared, so I was working in the neighborhood of fibonacci(2), which is of course equal to fibonacci(1).

My other tests didn't catch the problem, because I had tests that measured the internal consistency
of the implementation, rather than independently verifying the result.  One of the hazards that comes about from testing an implementation "in the calculus of itself."  Using property tests was "fine", but I needed one more unambiguous success before relying on them.

Independent Verification

One problem I don't think I've ever seen addressed in a Fibonacci kata is integer overflow.  The Fibonacci series grows roughly geometrically.  Java integers have a limited number of bits, so ending up with a number that is too big is inevitable.

But my tests kept passing - well past the point where my estimates told me I should be seeing symptoms.

The answer?  In Java, integer operators do not indicate overflow or underflow in any way. JLS 4.2.2

And that's just as true in the test code as it is in the production code.  In precisely the same way that I was only checking the internal consistency of my own algorithm, I also fell into the trap of checking the internal consistency of the plus operator.

There are countermeasures one can take -- asserting a property that the returned value must be non-negative, for instance.

Conclusions

One thing to keep in mind is that production code makes a really lousy test of the test code, especially in an environment where the tests are "driving" the implementation.  The standard for test code should be that it obviously has no deficiencies, which doesn't leave much room for creativity.


 

 

Monday, January 7, 2019

A study in ports and adapters

I've got some utilities that I use in the interactive shell which write representations of documents to standard out.  I've been trying to go fast to go fast with them, but that's been a bit rough, so I've been considering a switch to go slow to go smooth approach.

My preference is to work from the outside in; if the discipline of test driven development is going to be helpful as a design tool, then we should really be allowing it to guide where the module boundaries belong.

So let's consider an outside in approach to these shell applications.  My first goal is to establish how I can disconnect the logic that I want to test from the context that it runs in.  Having that separation will allow me to shift that logic into a test harness, where I can measure its behavior.

In the case of these little shell tools, there are three "values" I need to think about.  The command line arguments are one, the environment that the app is executing in is a second, and the output effect is the third.  So in Java, I can be thinking about a world that looks like this:


In the ports and adapters lingo, TheApp::main is taking on the role of an adapter, connecting the java run time to the port: TheApp::f.

For testing, I don't want effects, I want values.  More specifically, I want values that I can evaluate for correctness independently from the test subject.  And so I'm aiming for an adapter that looks like a function.

So I'll achieve that by providing a PrintStream that I can later query for information

In my tests, I can then verify, independently, that the array of bytes returned by the adapter has the correct properties.

Approaching the problem from the outside like this, I'm no longer required to guess about my modules -- they naturally arise as I identify interesting decisions in my implementation through the processes of removing duplication and introducing expressive names.

Instead, my guesses are about the boundary itself: what do I need from the outside world? How can I describe that element in such a way that it can be implemented completely in process?

Standard input, standard output, standard error are all pretty straight forward in Java, which somewhat obscures the general case.  Time requires a little bit of thinking about.  Connecting to remote processes may require more thought -- we probably don't want to be working down at the level of a database connection pool, for example.

We'll often need to invent an abstraction to express each of the roles that are being provided by the boundary.  A rough cut that allows the decoupling of the outer world gives us a starting point from which to discover these more refined boundaries.




Wednesday, January 2, 2019

Don't Skip Trivial Refactorings

Summary: if you are going to use test driven development as your process for implementing new features, and you don't want to just draw the owl, then you need to train yourself to refactor aggressively.

My example today is Tom Oram's 2018-11-11 essay on Uncle Bob's Transformation Priority Premise.  Oram uses four tests and seven distinct implementations of even?

If he were refactoring aggressively, I think we would see three implementations from two tests.

Kent Beck, in the face of objections to such a refactoring under the auspices of "removing duplication", wrote:
Sorry. I really really really am just locally removing duplication,
data duplication, between the tests and the code. The nicely general formula
that results is a happy accident, and doesn't always appear. More often than
not, though, it does.

No instant mental triangulation. No looking forward to future test cases.
Simple removal of duplication. I get it now that other people don't think
this way. Believe me, I get it. Now I have to decide whether to begin
doubting myself, or just patiently wait for most of the rest of the world to
catch up. No contest--I can be patient.
At that time, Beck was a bit undisciplined with his definition of "refactoring".  Faced with the same problem, my guess is that he would have performed the constant -> scalar transformation during his "refactoring" step.  If there isn't a test, then it's not observable behavior?

Personally, I find that the disciplined approach reduces error rates.