Extra Cheese

A Blog


The Limits of TDD

Nov 10, 2009

My last post about TDD generated some great responses, some of which were skeptical. A few common complaints about TDD were brought up, and posed with civility, so I'd like to address them.

Complaint: You weren't stupid enough

When TDDing Fibonacci, we could get to a point where we have this function (and I did write exactly this code in my last post):

def fib(n):
        if n == 0:
            return 0
        else:
            return 1

But why should we write that? Why not this instead?

def fib(n):
    return [0, 1, 1][n]

This comes down to how we define "simple". In TDD, we make tests pass by making the simplest possible change. So, which of the above two is simpler?

Defining that word is our job; TDD as a process says nothing about it. The definition is a huge variable and, in my experience, it's the primary axis along which our skill as TDDers grows once we reach minimal competence. Note that we still have to define "simple" even if we're not doing TDD, but we won't have the test-driving pressure forcing the definition to be refined.

Regardless of how "simple" is defined, we must eventually accept that an arbitrarily long list is not the simplest thing. At that point, we refactor. Depending on the definition of simple, it may take seven tests to get to the final refactor instead of five. So what? Two TDDers need not generate the same tests, and this isn't a problem at all.

Complaint: TDDed tests are prescriptive

This is a complaint that TDDed code does exactly what the tests say it should do, so there might be bugs. If I write the wrong test, the reasoning goes, then it will drive me to write the wrong code.

When would we write the wrong test? Only when we misunderstand the problem. If we misunderstand the problem, and we go straight to the code, then we're be encoding our incorrect understanding directly in the code. That's bad. By writing the tests first, we have some extra protection against misunderstanding: every assumption about what the system should do is encoded as a test, and each test has a good name.

Often, this will point out our confusion during the TDD process – we'll find that we want to write a test whose name contradicts another test's name. Even if we translate our misunderstanding into a bug, however, good test names make it easy to revisit our assumptions later. A subtle, five-character change to the code may have been driven by a sixty-character test name, which will be easier to understand.

Complaint: Choosing tests is hard

When TDDing Fibonacci, I tested fib(0) first. Why did I test fib(1) next instead of fib(37) or fib(51)?

Because it was obvious! The problem domain of a unit test is necessarily small, so it's usually clear what the next step is. If the next step isn't clear, it probably means that the unit under test is too large (making it hard to think about extending it for another case), or that we don't understand the problem well enough (making it hard to think about what the code should do at all). In either case, TDD has just helped us: it's either pointed out a bad design, which we should fix, or it's pointed a gap in our knowledge about the problem, in which case we should put the keyboard away and fill that gap.

Complaint: The code you TDDed was bad

The particular code I came up with in my last blog post was a slow, recursive Fibonacci solution. Two people mentioned this in the comments.

TDD doesn't solve problems like "my run time is superlinear" or "my database loads aren't eager enough." It's not supposed to solve those problems! TDD frees us to solve those hard problems well by (1) pushing us toward a good, decoupled design and (2) providing us with large, fast test suites.

Complaint: TDD requires too much typing

This one has the easiest answer of all: typing is not the bottleneck. Just think about it for a minute. Go back and look at how many lines of code you actually generated yesterday. How long would it take you to type it all in one long burst? A few minutes? Seriously, typing is not the bottleneck.

TDD is not magic

Let's recap:

  • Complaint: You weren't stupid enough
  • Response: There's more than one legitimate definition of "stupid".
  • Complaint: TDDed tests are prescriptive
  • Response: This is a feature. Stating our assumptions up front exposes misunderstandings.
  • Complaint: Choosing tests is hard
  • Response: This is also a feature. It tells us that our design is bad or that we don't understand the problem.
  • Complaint: The code you TDDed was bad!
  • Response: TDD does not free us from thinking. TDD is not magic.
  • Complaint: It's too much typing.
  • Response: Typing is not the bottleneck.

Many complaints about TDD are complaints that it doesn't solve some problem. These are not problems with TDD – it's not supposed to solve every problem!

Dynamic languages don't make coffee, continuous integration doesn't shine shoes, and TDD doesn't make code scale. It's simply the basis of a solid, disciplined process for building software – a beginning, not an end.



Showing 15 comments

Posted by Jason Gorman at Tue Nov 10 06:07:33 2009

Some very good points.

Just wanted to ask, though: are the first two code snippets directly equivalent? My Ruby's a bit hazy.

Surely just "return n" would be the simplest analog to the first code snippet?

It's a very well-made point that if you can't write the right test, you don't understand the problem.

When choosing the next test, I like to think of rock climbing. It's like choosing the next foothold. What would be the smallest, safest move I could make that would be progress towards the summit?


Posted by Jason Gorman at Tue Nov 10 06:08:19 2009

...and when I say "Ruby", I mean Python, obviously ;-)


Posted by Carl Manaster at Tue Nov 10 06:30:08 2009

To your point that "typing is not the bottleneck," I would add that the solutions to the problem - if any - of typing include better languages and better development (ie typing) environments.  If typing is a problem, it's not just a problem when writing tests.  Solve it, make it not a problem - and drive your design with tests.


Posted by Alan Francis at Tue Nov 10 12:02:11 2009

I'd just mention that we did define "simple" a long time ago.  TDD, in case anyone isn't familiar, began as two of twelve practices which made up a methodology called "Extreme Programming" - "Test-First Programming" and "Refactoring" ended up merged as "Test-Driven Development".

The XP definition (and some alternatives) of simple is here on Wards Wiki: http://c2.com/cgi/wiki?XpSimplicityRules

-- AlanFrancis (greybeard TDDer :)


Posted by odrzut at Tue Nov 10 13:30:45 2009

@Jason Gorman they are equivalent if the argument is 0, 1, or 2 :)


Posted by gregK at Wed Nov 11 13:33:59 2009

You did not address the essence of the problem. 

Your refactoring was not a valid one since it was not a structure preserving transformation.  You had a incorrect implementation that passed the all the test (already scary enough) and changed the code to the correct solution that passes all the tests and called it a refactoring.  Why the test cases certainly did not drive you that way.

So you have shown:

1. The passing test cases do not ensure that the code is correct.
2. That TDD did not drive you to the correct solution.  You magically changed it in the last step even though you had no way of knowing that it was incorrect from the test cases.

This is the best proof by contradiction on TDD I have seen.


Posted by Jon at Wed Nov 11 21:08:01 2009

You still haven't addressed the biggest problem: your TDD example produces code that would perform terribly in production, and doesn't even test for those cases. (See below on how to fix that part)


Posted by Gary Bernhardt at Wed Nov 11 21:41:17 2009

Greg,

Regarding point (1): Any finite set of tests executing against an algorithm with an infinite domain will generate false positives for an infinite set of incorrect implementations. Your point (1) dismisses tests that don't ensure correctness, which means that it dismisses all finite tests, presumably in favor of proofs. You should have extended your concluding line to say (preserving your grammar) "This is the best proof by contradiction on tests I have seen."

I'm perfectly willing to admit that I think tests are, at a fundamental level, the wrong solution to the problem of quality. I just don't know of a better solution that can be employed for commercial software in 2009. I've never even heard of someone building modern commercial software with rigorous proofs, but I have heard of many people building such software without tests or proofs, and I'd like to help those people move to something better.

Regarding point (2): Finite test suites cannot show the correctness of solutions across infinite domains. If we only accept proofs of correctness, we can't accept test suites at all, which means that the particular method used to arrive at the test suite is entirely irrelevant.

If, however, we accept that finite test suites are useful in practice, if not mathematically rigorous, and that TDD generates test suites effectively, then we have to admit TDD as a valid method. The question then becomes "is TDD more effective at generating high-quality code and tests than alternate methods?" My intuition before learning TDD was that "no, it can't be," as yours clearly is. My experience, however, after humoring TDD to understand why people spoke so highly of it, was that it was more effective, much to my surprise.

There is no double-blind, empirical study backing this up. There is no underlying mathematical model from which it was derived. It's just something that we've found to work. When presented with the problem of building commercial software, our choices are effectively (1) write no tests, (2) write tests afterward, and (3) write tests before. "Write code with proofs" is not on the table as of 2009. I happen to have found TDD more effective than "build-then-test", despite my skepticism going into it. The only thing I have to offer is that anecdote, and those of many others.

The primary opposition to those anecdotes is complaints from people who have not done TDD (where "done TDD" is defined as "written at least 1,000 TDDed tests"). You have sided with those who have not become competent at the technique, but dismiss it anyway. I have sided with those who have written code before tests, code after tests, and code without tests, and have made a choice based on weighing each of these experiences.


Posted by Gary Bernhardt at Wed Nov 11 21:45:26 2009

Jon,

Please see my follow-up blog post, where I address exactly the complaint that you have just claimed that I "haven't addressed". I'll even excerpt it for you:

"""
Many complaints about TDD are complaints that it doesn't solve some problem. These are not problems with TDD – it's not supposed to solve every problem!
"""

TDD is not a magical process that produces perfect code, flawless in every way, immutable for all time. It produces working, cleanly-designed code along with a comprehensive test suite. The test suite allows us to improve the code along axes that are orthogonal to those TDD addresses, such as scalability.


Posted by Gary Bernhardt at Wed Nov 11 21:53:24 2009

Wow, I just noticed that this is the follow-up blog post. Jon, did you actually read it? I explicitly said that performance is beyond the scope of TDD's capabilities. No one claims that TDD magically writes perfect code and that you never have to engage your brain. Or, at least, no one worth listening to.


Posted by Andrew Dalke at Thu Nov 12 05:36:10 2009

Thank you sincerely for taking the time to address the points I raised in my comments to your previous post on this comment.

I have been thinking about the concept of simplicity. I read AlanFrancis' pointer to http://c2.com/cgi/wiki?XpSimplicityRules but saw nothing there which directly addresses this. On the topic "WhatIsSimplest" a clarification is "apply your own educated definition of simplicity until you're out of ideas". WardCunningham in "SimplestOrEasiest" writes "We choose simple solutions first so that we can maintain focus on the customer's problem" but that seems like it's not unit-test related because TDD is a developer approach, not a customer approach.

Both of these seem outside the context of your Fibonacci example. I still recall in college doing the analysis of computing the sequence iteratively and recursively, and testing it out on real code, and I've likely implemented the code a few dozen times in my life. I know what the error possibilities, what the edge cases are, and I know that only a few tests are needed to check for those.

In a discussion about simplicity then, the simplest solution for me is one where I don't have to think about Fibonacci-like solutions which are deliberately wrong and serve only as a way to implement approximation functions to the real solution.

When I wrote "a lot of typing" I didn't mean as you inferred only the typing time. I also mean the time needed to think up those solutions, only to have them thrown away.

This does generalize because there is a large set of problems where I have enough experience to get a feel for what the solution will look like even when I haven't coded it up before. It's far simpler for me to work along the track of the solution I expect than to strike off on a tangent I know will be wrong, like "return n" would be for one of your implementations. I don't even know if I would have thought of that as a solution because it's so obviously not right to me. (It being a linear solution to an exponential problem, over a domain where a linear solution is not a good approximation.)

In short, my definition of "simplest" would be "least mental strain" as given on that "XpSimplicityRules" page. You characterized my comment as saying you weren't "stupid enough" but I don't think of these as anywhere near equivalent statements.

I think you misunderstood my use of the word "prescriptive" and indeed I may have used it wrong, so let me clarify my meaning. I was thinking specifically of  developing a scientific theory or mathematical proof. Observations constitute test cases. If I have a new theory of behavioral dynamics in elk then I have plenty of observational data which it must agree with. However, that is not sufficient. It must make predictions about future dynamics. That's what keeps you honest, as you might have overfit the theory so it's perfect for the observed data and that's all.

(This happens with stock market analysis, where some high-order equation describes a time period nearly perfectly, then fails elsewhere.)

Mapping that into testing, perhaps this is more validation testing rather than unit testing, so might be outside the realm of TDD, which insists that newly added tests must fail.

But I would insist that my own tests for Fibonacci sequence generation include high-numbered values which I added, expecting them to work on my final code, and whose failure would tell me I got something wrong.

My tests would also include fields which would test if there was a problem with 32 bit integers (fib(48) or higher) and 64 bit integers (fib(94) or higher) since my experience says those are common failure points, and a future refactoring, say to Cython with specified integer types, might cause a future problem. (This is not YAGNI - I would insist on those tests solely for the first reason.)

Incidentally, adding those tests would have forced you to address the performance problems in the recursive solution, since fib(35) recursively takes 10 seconds on my laptop, and unit tests are supposed to be fast.

This leads to my testing approach, which is to build up unit tests (which fail first), corner cases tests (not all of which fail first, but where I'm not sure about if I handled it correctly), and falsifiability tests where I try something in the expected applicability domains assuming it will pass, and include oddball cases out of a sense of perversion.

These last are not part of TDD.


Posted by Andrew Dalke at Thu Nov 12 12:28:14 2009

Wanted also to comment about this: "There is no double-blind, empirical study backing this up. There is no underlying mathematical model from which it was derived. It's just something that we've found to work. When presented with the problem of building commercial software, our choices are effectively (1) write no tests, (2) write tests afterward, and (3) write tests before."

How have you found it to work? The reason for double-blind, empirical studies is that it's so easy to fool oneself. That's not saying it's the only way, but anecdotal accounts are a poor basis for conclusions, and fraught with problems like bias error (only the good results are reported) and the "No True Scotsman" fallacy (1,000 TDD tests before one can draw a conclusion?) The well-known saying is "the plural of anecdote is not data".

You also present here a false trilemma. It's completely possible, as I argue, to write some tests before, some tests during, some tests afterwards, AND to leave some parts of the code completely untested. (Notoriously, handling out-of-memory errors.)


Posted by Steve Weller at Thu Nov 12 20:58:14 2009

Simple is easy to define: anything not complex is simple. So the problem shifts to defining complexity.

Complexity is the ability to hide defects. Anything that demonstrably hides defects is complex and anything that is suspected to be complex can be shown to be so if at least one hidden defect is found.

Why test simple things? Because you cannot prove a negative, that they do not hide defects, and therefore are not simple after all.


Posted by Gary Bernhardt at Thu Nov 12 23:00:27 2009

Andrew,

I think I was a bit too vague about "simplicity". I was referring mostly to the low-level, minute-by-minute definition of syntactical simplicity – the question of "which is simple, an if or an array index?"

When you begin to address simplicity as "amount of work done to implement software", you step into a higher level of discussion about how to solve the problem, not how to make the test pass. That's a fine discussion to have, but it's not what I was originally referring to. :)

The issue of "deliberately wrong" intermediate forms hits at the core of many misperceptions about TDD. For most of the code you write, most of the time, you don't know the correct solution ahead of time. This is where TDD shines – if you know how to write the test, you at least know the next small step; if you don't know how to write the test, you don't even know what you're trying to achieve, and need to stop and think hard. This happens a lot, and it's a signal that you're probably about to descend into a design quagmire.

I suspect that much of this applies much less in both scientific computing, where you're often doing exploratory programming (where I've found TDD to work terribly), or implementing a complexy algorithm (where TDD can work just fine if you're used to it and have a good direction, but can also lead to horrible quagmires). It seems as though you do a lot of scientific computing, in which case pure TDD might often get in the way, although it can surely be applied quite well to a subset of the work, if the proper judgement about when and where is applied.

Regarding generalizing a solution you can foresee, I think it's important to realize just how fast my early "stupid" tests happen. Sometimes, the first one takes a while, because I must carefully consider what my API is going to look like. The next few tests, which are stupid things like "return n", may literally only survive for five to ten seconds each. I have my editor configured so that, with two keystrokes, it saves the file, runs the tests, jumps the cursor to the failing line in the correct file, and prints a red or green bar at the bottom of the screen along with the name of the exception. That takes less than a second. It's very hard to get a sense of this from reading a textual listing of the code, where each revision seems equal. :)

Of course, I wasn't that fast at the beginning. TDD will slow you down a lot at the very beginning, but for me it was justified by how much the process taught me about OO design in such a small amount of time.

Regarding tests for large Fibonacci values and border cases like integer boundaries: I would absolutely add those tests! The original post's example was not a complete process for the creation of a test suite; it was only illustrating the transitions between steps during the TDD process. After the tests drove the first fully-functional design out, I'd add exactly the types of tests you describe. These wouldn't fail at first, but that's fine; TDD doesn't preclude such things, they're just outside its scope. What I would do, to make sure the tests were honest, is to intentionally break the code, watch them fail (probably along with several other tests), then unbreak the code. This gives me at least some of the confidence that TDD does – I know that something is actually being tested. Again, because of my editor and environment, this break-test-unbreak process would take around five seconds. As you point out, these large tests would force consideration of performance, but that force would not be exerted by TDD, as my post and your comment both point out. :)

It sounds to me like you are very close to TDD, but just not using the word. You're right that the last stages of your process are not TDD, but this goes back to the core of this blog post: TDD is not a self-contained method for the generation of software; it's a scaffold to drive out behavior in a way that encourages good design and provides a very rich test suite. It would be crazy to expect that TDD alone, with no other practices, could produce production-quality software.

Regarding empirical studies etc., I'd like to note that, as far as I know, none of the software practices I've ever used on a day-to-day basis were taken up because of scientifically valid data. Even the best parts of our field are at a level of rigor that is slightly better than your average alternative medicine practice. This pains me, but I don't know what to do about it. I could go do a PhD in the field, but my contribution would be narrow almost by definition. :) For now, I'm learning everything I can, and trying everything I can in real-life code. I actually switched to TDD in the middle of an 18,000 line app I was building by myself at a startup. That's where I vetted it and, although I certainly had many pain points as I learned, I think it made a positive contribution even in the short term.

Explaining exactly why I thought it helped would take a lot of words, many of which I've forgotten in the time since then. The first big thing I noticed was that it was much less stressful – I didn't have to constantly think of how the code I was writing would impact the entire system. (This is largely because of the isolation and decoupling that TDD forces.) Eventually, I'd have to address integration, of course, but not at every moment.

The next great benefit was when I realized that in just a couple months, TDD had completely changed my design style, teaching me to recognize tight coupling much earlier than I ever could before, and to separate concerns much more carefully. Looking at code I'd written just a couple months before, I could immediately spot coupling points that I had missed completely before I started TDD.

The "1,000 tests" bar is somewhat arbitrary, but many people seem to consider 1,000 to 1,500 tests to be a reasonable average for competency. You have to realize that these are small tests. :)

The false trilemma was really a syntactic mistake on my part. I didn't mean to imply that one can't write tests both before and after the design is complete (I do this myself, as does everyone I know who practices TDD). It was just an attempt to point out that some people have written tests first and almost uniformly found it useful, while others have not done so and claim that it can't possibly be useful – a very bold claim.

This is closely related to the "1,000 tests" bar: if someone hasn't practiced a technique to competency, but will claim that it can't possibly work, despite the fact that I have reached competency and have found it to work wonderfully, then I will probably not take them seriously. :)


Posted by Andrew Dalke at Sat Nov 14 17:11:57 2009

"Simplicity" is a complex thing. To me, an array index is the simplest solution to this problem, for the tests you wrote. But the c2 wiki has plenty of discussion on that, although I will point out that AlanFrancis pointed out the XP definition, which does not seem to apply to this topic at all.

You wrote: "For most of the code you write, most of the time, you don't know the correct solution ahead of time. This is where TDD shines"

I would like to see some example of that, even if it's a long one. The most famous counterexample would be Ron Jeffries' attempt at solving Suduko. He didn't know the correct solution ahead of time and TDD did not help. Commentary from at least one TDD proponent said that TDD is not meant for algorithm development, but that's exactly the case you're promoting it for here.

It is true that I do a lot of scientific programming, but very little of it is in novel algorithm development. A lot of it is wrapping existing tools, and being paranoid enough to check for failure cases which are likely to occur, and trying to figure out some way to induce those failures to happen. Another part of it is developing better APIs for existing tools, and in that case I start from the high-level because TDD focuses too quickly on minutia.

I understand well how fast TDD can be, but my complaints here are about the specific Fibonacci example you described. It was about as quick for me to come up with my solution as it would be for you to write the first 3 of your test cases.

We do code katas in our local Python user's group. I've been trying to do the TDD development but I find that it takes so long to write (and this is after about 12-15 attempts) and then end result, while well tested, is tedious to read through. At the last session, our TDD code ended up being about 2-3 times longer than those who used the interactive prompt to help lead them into the right solution, and didn't use tests. Our code grew and we never spent the time thinking about how to minimize the code I think in part because it would mean throwing away half the tests we wrote, which weren't needed to solve the actual problem but where only needed to solve the intermediate steps leading towards the solution.

BTW, you mentioned also how it helped out with OO designed, but I've seen nothing where TDD shows a preference towards OO design. For example, your Fibonacci solution doesn't use any sort of OO design.


You wrote: "These wouldn't fail at first, but that's fine; TDD doesn't preclude such things, they're just outside its scope."

I don't understand that at all. TDD seems to absolutely preclude writing tests which don't fail. Every place I've seen says that the tests must fail, otherwise there's no need to add a given feature. Then again, when they say "feature" they seem to mean something like "implement fib(n)" and not "implement fib(0)", "implement fib(1)", "implement fib(2)", and so on, so the test you showed weren't really directly mapped to the feature request. Which is I think my point.

You wrote: "It sounds to me like you are very close to TDD". I can assure you that I am not. I rarely start writing the automated tests until I'm most of the way through figuring out how the pieces should go together. I've tried to do TDD and I find I spend time working on tests that I end up throwing away because I've decided to change the API, get rid of classes, and otherwise change things around. I also find that higher-level tests are as effective than the minute ones that TDD pushes - back of course by my experience on what makes a good test, along with code coverage and other instrumentation.

On the topic of "as far as I know, none of the software practices I've ever used on a day-to-day basis were taken up because of scientifically valid data", one of the things which came up in our local Python User's Group was the idea of small functions and method. This came from one of the members reading Robert Martin's "Clean Code", which has classes with dozens of methods, each with about 3 lines in it. To me it reads like a bunch of go-to statements where I have jump some place else to see what's really going on, and where method names are only used once in the entire program. I would rather have 1 20-line function than 10 2-line function (plus the function definition line).

The idea he promulgates is not new. It goes back to the Smalltalk community. On the other hand, if you read "Code Complete", McConnell points out that the research which has been done says that there's no observable difference during testing up until about 200 lines of code.

So what I'm interpreting based on what you've written is that TDD taught you good development practices. I seem to have many of the same development practices, but I do it without TDD, so I think it's the practices which are making you (and me) successful, and not specifically using TDD.

Best regards - Andrew Dalke


Name:


E-mail:


URL:


Comment: