Testing, Debugging and Optimization

Conrad Huang

April 15, 2013

Portions Copyright © 2005-06 Python Software Foundation.

Introduction

  • The more you invest in quality, the less time it takes to develop working software [Glass 2002]
  • Quality is not just testing
    • “Trying to improve the quality of software by doing more testing is like trying to lose weight by weighing yourself more often.” (Steve McConnell)
  • Quality is:
    • Designed in
    • Monitored and maintained through the whole software lifecycle
  • This lecture looks at basic things every developer can do to maintain quality

Limits to Testing

  • Suppose you have a function that compares two 7-digit phone numbers, and returns True if the first is greater than the second
    • (107)2 possible inputs
    • At ten million tests per second, that's 155 days
  • If they're 7-character alphabetic strings, it's 254 years
    • Then you move on to the second function…
  • And how do you know that your tests are correct?
  • All a test can do is show that there may be a bug

Terminology

  • A unit test exercises one component in isolation
    • Developer-oriented: tests the program's internals
  • An integration test exercises the whole system
    • User-oriented: tests the software's overall behavior
  • Regression testing is the practice of rerunning tests to check that the code still works
    • i.e., make sure that today's changes haven't broken things that were working yesterday
    • Programs that don't have regression tests are difficult (sometimes impossible) to maintain [Feathers 2005]

Test Results and Specifications

  • Any test can have one of three outcomes:
    • Pass: the actual outcome matches the expected outcome
    • Fail: the actual outcome is different from what was expected
    • Error: something went wrong inside the test (i.e., the test contains a bug)
      • Don't know anything about the system being tested
  • A specification is something that tells you how to classify a test's result
    • You can't test without some sort of specification

Structuring Tests

  • How to write tests so that:
    • It's easy to add or change tests
    • It's easy to see what's been tested, and what hasn't
  • A test consists of a fixture, an action, and an expected result
    • A fixture is something that a test is run on
    • Can be as simple as a single value, or as complex as a networked database
  • Every test should be independent
    • I.e., the outcome of one test shouldn't depend on what happened in another test
    • Otherwise, faults in early tests can distort the results of later ones
  • So each test:
    • Creates a fresh instance of the fixture
    • Performs the operation
    • Checks and records the result

A Simple Example

  • Test string.startswith
    • Specification: returns True if the string starts with the given prefix, and False otherwise
    • Hm… What if the prefix is the empty string?
  • Store the tests in a table
    • Easy to read and add to
    • Tests = [
      # String  Prefix  Expected
      ['a',     'a',    True],
      ['a',     'b',    False],
      ['abc',   'a',    True],
      ['abc',   'ab',   True],
      ['abc',   'abc',  True],
      ['abc',   'abcd', False],
      ['abc',   '',     True]
      ]
    • String and prefix are fixture
  • Now run them
      passes = 0
      failures = 0
      for (s, p, expected) in Tests:
      actual = s.startswith(p)
      if actual == expected:
          passes += 1
      else:
          failures += 1
      print 'passed', passes, 'out of', passes+failures, 'tests'
    • Hm… Where's the code to handle and report errors in the tests themselves?

Catching Errors

  • Python uses exceptions for error handling
    • Separates normal operation from error handling
    • Makes both easier to read
  • Structured like if/else
    • Code for healthy case goes in a try block
    • Error handling code goes in a matching except block
  • When something goes wrong in the try block, Python raises an exception
  • Can add an optional else block
    • Executed when things don't go wrong inside the try block

Simple Exception Example

  • Try dividing by zero and some non-zero values:
    for num in [-1, 0, 1]:
    try:
        inverse = 1/num
    except:
        print 'inverting', num, 'caused error'
    else:
        print 'inverse of', num, 'is', inverse
    
    inverse of -1 is -1
    inverting 0 caused error
    inverse of 1 is 1
    

[Flow of Control in Try/Except/Else]

Flow of Control in Try/Except/Else

Exception Objects

  • When Python raises an exception, it creates an object to hold information about what went wrong
    • Typically contains an error message
  • Can choose which errors to handle by specifying an exception type in the except statement
    • E.g., handle division by zero, but not out-of-bounds list index
    • # Note: mix of numeric and non-numeric values.
      values = [0, 1, 'momentum']
      
      # Note: top index will be out of bounds.
      for i in range(4):
      try:
          print 'dividing by value', i
          x = 1.0 / values[i]
          print 'result is', x
      except ZeroDivisionError, e:
          print 'divide by zero:', e
      except IndexError, e:
          print 'index error:', e
      except:
          print 'some other error:', e
      
      dividing by value 0
      divide by zero: float division
      dividing by value 1
      result is 1.0
      dividing by value 2
      some other error: float division
      dividing by value 3
      index error: list index out of range
      
  • The except blocks are tested in order—whichever matches first, wins
    • If a “naked” except appears, it must come last (since it catches everything)
    • Generally better to use except Exception, e so that you have the exception object

Exception Hierarchy

  • Exceptions are organized in a hierarchy
    • e.g., ZeroDivisionError, OverflowError, and FloatingPointError are all types of ArithmeticError
    • A handler for the general type catches all its specific sub-types
Name Purpose
Exception Root of exception hierarchy
ArithmeticError Illegal arithmetic operation
FloatingPointError Generic error in floating point calculation
OverflowError Result too large to represent
ZeroDivisionError Attempt to divide by zero
IndexError Bad index to sequence (out of bounds or illegal type)
TypeError Illegal type (e.g., trying to add integer and string)
ValueError Illegal value (e.g., math.sqrt(-1))
EnvironmentError Error interacting with the outside world
IOError Unable to create or open file, read data, etc.
OSError No permissions, no such device, etc.
Table 11.1: Common Exception Types in Python

Functions and Exceptions

[Stacking Exception Handlers]

Stacking Exception Handlers

  • Each time Python enters a try/except block, it pushes the except handlers on a stack
    • Just like the function call stack
  • When an exception is raised, Python searches this stack for the top-most matching handler
    • Often means jumping out of the middle of a function
      def invert(vals, index):
      try:
          vals[index] = 10.0/vals[index]
      except ArithmeticError, e:
          print 'inner exception handler:', e
      
      def each(vals, indices):
      try:
          for i in indices:
      	invert(vals, i)
      except IndexError, e:
          print 'outer exception handler:', e
      
      # Once again, the top index will be out of bounds.
      values = [-1, 0, 1]
      print 'values before:', values
      each(values, range(4))
      print 'values after:', values
      
      values before: [-1, 0, 1]
      inner exception handler: float division
      outer exception handler: list index out of range
      values after: [-10.0, 0, 10.0]
      

Raising Exceptions

  • Use raise to trigger exception processing
    • Specify the type of exception you're raising using raise Exception('this is an error message')
    • Please make your error messages more informative…
    • for i in range(4):
      try:
          if (i % 2) == 1:
      	raise ValueError('index is odd')
          else:
      	print 'not raising exception for %d' % i
      except ValueError, e:
          print 'caught exception for %d' % i, e
      
      not raising exception for 0
      caught exception for 1 index is odd
      not raising exception for 2
      caught exception for 3 index is odd
      

Exceptional Style

  • Always use exceptions to report errors instead of returning None, -1, False, or some other value
    • Allows callers to separate normal code from error handling
    • And sooner or later, your function will probably actually want to return that “special” value
    • Note: Python's own list.find breaks this rule
      • Returns -1 if something can't be found
  • Throw low, catch high
    • i.e., throw lots of very specific exceptions…
    • …but only catch them where you can actually take corrective action
    • Because every application handles errors differently
      • If someone is using your library in a GUI, you don't want to be printing to stderr

Handling Errors in Tests

  • Now we know how to check for errors in tests
    • Wrap the test in try/except
    Tests = [
        ['a',     'a',    False],    # wrong expected value
        ['a',     1,      False],    # wrong type
        ['abc',   'a',    True]      # everything legal
    ]
    
    passes = failures = errors = 0
    for (s, p, expected) in Tests:
        try:
            actual = s.startswith(p)
            if actual == expected:
                passes += 1
            else:
                failures += 1
        except:
            errors += 1
    
    print 'tests:', passes + failures + errors
    print 'passes:', passes
    print 'failures:', failures
    print 'errors:', errors
    
    tests: 3
    passes: 1
    failures: 1
    errors: 1
    
  • Note the deliberate errors in the test cases to exercise the testing code

Test-Driven Design

  • Tests are actually specifications
    • “Given these inputs, this code should behave the following way”
  • So write the tests first, then the application code
  • Sounds backward, but:
    • A great way to clarify specifications
      • I write the tests
      • “All” you have to do is write code that passes those tests
    • Gives programmers a definite goal
      • Coding is finished when all tests run
      • Particularly useful when trying to fix bugs in old code, as it forces you to figure out how to re-create the bug
      • Helps prevent the “one more feature” syndrome
    • Ensures that tests actually get written
      • People are often too tired, or too rushed, to test after coding
    • Helps clarify the Application Programming Interface (API) before it is set in stone
      • If something is awkward to test, it can be redesigned before it's written

TDD Example

  • I want you to write a function that calculates a running sum of the values in a list
    • Doesn't specify whether to create a new list, or overwrite the input
    • Doesn't specify how to handle errors
  • You'd probably prefer something like this:
      Tests = [
        [[],        [],          'empty list'],
        [[1],       [1],         'single value'],
        [[1, 3],    [1, 4],      'two values'],
        [[1, 3, 7], [1, 4, 11],  'three values'],
        [[-1, 1],   [-1, 0],     'negative values'],
        [[1, 3.0],  [1, 4.0],    'mixed types'],
        ["string",  ValueError,  'non-list input'],
        [['a'],     ValueError,  'non-numeric value']
        ]
    • If the expected result is an exception, pass only if that exception is raised
    • If the test doesn't pass, print the comment so that the programmer knows what to look at

Design by Contract

  • Functions ought to carry their specifications around with them
    • Keeping specification and implementation together makes both easier to understand
    • And improves the odds that programmers will keep them in sync
  • A function is defined by:
    • Its pre-conditions: what must be true in order for the function to work correctly
    • Its post-conditions: what the function guarantees will be true if its pre-conditions are met
    • May also have invariants: things that are true throughout the execution of the function
  • Leads to a style of programming called design by contract
  • Pre- and post-conditions constrain how the function can evolve
    • Can only ever relax pre-conditions (i.e., take a wider range of input)…
    • …or tighten post-conditions (i.e., produce a narrower range of output)
    • Tightening pre-conditions, or relaxing post-conditions, would violate the function's contract with its callers

Assertions

  • Normally specify pre- and post-conditions using assertions
    • A statement that something is true at a particular point in a program
    • If the assertion's condition is not met, Python raises an AssertionError exception
  • For example:
    • Pre-condition: input argument is a non-empty list
    • Post-condition: two values from the list such that the first is less than the second
    • def find_range(values):
      '''Find the non-empty range of values in the input sequence.'''
      assert (type(values) is list) and (len(values) > 0)
      left = min(values)
      right = max(values)
      assert (left in values) and (right in values) and (left <= right)
      return left, right
      
  • Note that the post-condition isn't as exacting as it should be
    • Doesn't check that left is less than or equal to all other values, or that right is greater than or equal to
    • The code to check the condition exactly is as likely to contain errors as the function itself
    • Which is one of the reasons design by contract isn't as popular as it might be

Defensive Programming

  • You can (and should) test for errors liberally
    • Even if you don't practice design by contract
    • Use assert or custom code that provides more information to help identify detected problems
  • Defensive programming is like defensive driving
    • Program as if the rest of the world is out to get you
    • “Fail early, fail often”
      • The less distance there is between the error and you detecting it, the easier it will be to find and fix
  • Good practice: every time you fix a bug, put in an error test and a comment
    • Because if you made the error, the right code can't be obvious
    • And you should protect yourself against someone “simplifying” the bug back in
    • def can_transmute(element):
        '''Can this element be turned into gold?'''
      
        # Bug #172: make sure the input is actually an element.
        assert is_valid_element(element)
      
        # Gold is trivial.
        if element is Gold:
            return True
      
        # Trans-uranic metals and halogens are impossible.
        if (element.atomic_number > Uranium.atomic_number) or \
           (element in Halogens):
            return False
      
        # Look for a sequence of steps that leads to gold.
        steps = search_transmutations(element, Gold)
        if steps == []:
            return False
        else:
            # Bug #201: must be at least two elements in sequence.
            assert len(steps) >= 2
            return True
          

Summary

  • The real goal of “quality assurance” isn't to find bugs: it's to figure out where they're coming from, so that they can be prevented
  • But without testing, no one (including you) has any right to rely on the program's output
  • Just because a program passes all the test does not mean it is of high quality
  • Only way to ensure quality is to design it in

Debugging

  • You're going to spend half your professional life debugging
    • So you should learn how to do it systematically
  • Talk about some simple rules
  • Then two common debugging tools

Agans' Rules

  • Many people make debugging harder than it needs to be by:
    • Not going about it systematically
    • Becoming impatient
    • Using inadequate tools
  • Agans' Rules [Agans 2002] describe how to apply the scientific method to debugging
    • Observe a failure
    • Invent a hypothesis explaining the cause
    • Test the hypothesis by running an experiment (i.e., a test)
    • Repeat until the bug has been found

Rule 0: Get It Right the First Time

  • The simplest bugs to fix are the ones that don't exist
  • Design, reflect, discuss, then code
    • “A week of hard work can sometimes save you an hour of thought.”
  • Design and build your code with testing and debugging in mind
    • Minimize the amount of “spooky action at a distance”
    • Minimize the number of things programmers have to keep track of at any one time
    • Train yourself to do things right, so that you'll code well even when you're tired, stressed, and facing a deadline
  • “Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?” (Brian Kernighan)

Rule 1: What Is It Supposed to Do?

  • First step is knowing what the problem is
    • “It doesn't work” isn't good enough
    • What exactly is going wrong?
    • How do you know?
    • You will learn a lot by following execution in a debugger and trying to anticipate what the program is going to do next
  • Requires you to know how the software is supposed to behave
    • Is this case covered by the specification?
    • If not:
      • Do you have enough knowledge to extrapolate?
      • Do you have the right to do so?
  • Try not to let what you want to see influence what you actually observe

Rule 2: Is It Plugged In?

  • Are you actually exercising the problem that you think you are?
    • Are you giving it the right test data?
    • Is it configured the way you think it is?
    • Is it the version you think it is?
    • Has the feature actually been implemented yet?
    • Why are you sure?
      • Maybe the reason you can't isolate the problem is that it's not there (I wouldn't use this one too often)
  • Another argument in favor of automatic regression tests
    • Guaranteed to rerun the test the same way each time

Rule 3: Make It Fail

  • You can only debug things when they go wrong
  • So find a test case that makes the code fail every time
    • Then try to find a simpler one
    • Or start with a trivially simple test case that passes, then add complexity until it fails
  • Each experiment becomes a test case
    • So that you can re-run all of them with a single command
    • How else are you going to know that the bug has actually been fixed?
  • Use the scientific method
    • Formulate a hypothesis, make a prediction, conduct an experiment, repeat
    • Remember, it's computer science, not computer flip-a-coin

Alternatives

  • What if you can't make it fail reliably?
    • Problem involves timing, network load, etc.
    • Or you just don't know enough about the cause yet
  • Use post-mortem inspection
    • But then you have to reason backwards to figure out why the program crashed
  • Or logging
    • But this can distort the program's behavior
    • And you'll have to wade through a lot of irrelevant information

Rule 4: Divide and Conquer

  • The smaller the gap between cause and effect, the easier the relationship is to see
  • So once you have a test that makes the system fail, use it isolate the faulty subsystem
    • Examine the input of the code that's failing
    • If that's wrong, look at the preceding code's input, and so on
  • Use assert to check things that ought to be right
    • “Fail early, fail often”
    • A good way to stop yourself from introducing new bugs as you fix old ones
  • When you do fix the bug, see whether you can add assertions to prevent it reappearing
    • If you made the mistake once, odds are that you, or someone, will make it again
  • Another argument against duplicated code
    • Few things are as frustrating as fixing a bug, only to have it crop up again elsewhere

Rule 5: One Change at a Time, For a Reason

  • Replacing random chunks of code unlikely to do much good
    • If you got it wrong the first time, what makes you think you'll get it right the second? Or the ninth?
    • So always have a hypothesis before making a change
  • Every time you make a change, re-run all of your tests immediately
    • The more things you change at once, the harder it is to know what's responsible for what
    • And the harder it is to keep track of what you've done, and what effect it had
    • Changes can also often uncover (or introduce) new bugs

Rule 6: Write It Down

  • Science works because scientists keep records
    • “Did left followed by right with an odd number of lines cause the crash? Or was it right followed by left? Or was I using an even number of lines?”
  • Records particularly useful when getting help
    • People are more likely to listen when you can explain clearly what you did

Rule 7: Be Humble

  • If you can't find it in 15 minutes, ask for help
    • Just explaining the problem aloud is often enough
    • “Never debug standing up.” (Gerald Weinberg)
  • Don't keep telling yourself why it should work: if it doesn't, it doesn't
    • Never debug while grinding your teeth, either…
  • Keep track of your mistakes
    • Just as runners keep track of their time for the 100 meter sprint
    • “You cannot manage what you cannot measure.” (Bill Hewlett)
  • And read [ Zeller 2006] to learn more

Debugging Tools

  • Print statement
    • Easy to use, but…
  • Symbolic debugger
    • Very powerful, but…

What's Wrong with Print Statements

  • Many people still debug by adding print statements to their programs
  • It's error-prone
    • Adding print statements is a good way to add typos
    • Particularly when you have to modify the block structure of your program
  • And time-consuming
    • All that typing…
    • And (if you're using Java, C++, or Fortran) all that recompiling…
  • And can be misleading
    • Moves things around in memory, changes execution timing, etc.
    • Common for bugs to hide when print statements are added, and reappear when they're removed
  • But print statements can be extremely effective
    • May be added using the same tools as programming
    • Is less likely to hide bugs in interpreted languages
    • Can collect lots of data in a single run

Symbolic Debuggers

  • A debugger is a program that runs another program on your behalf
    • Sometimes called a symbolic debugger because it shows you the source code you wrote, rather than raw machine code
  • While the target program (or debuggee) is running, the debugger can:
    • Pause, resume, or restart the target
    • Display or change values
    • Watch for calls to particular functions, changes to particular variables, etc.
  • Do not need to modify the source of the target program!
    • Depending on your language, you may need to compile it with different flags
  • And yes, the debugger modifies the target's layout in memory, and execution speed…
    • …but a lot less than print statements… (maybe)
    • …with a lot less effort from you
  • But you need to invest the time to learn to use it well

Debugging Summary

  • Debugging is not a black art
  • Like medical diagnosis, it's a skill that can be studied and improved
  • You're going to spend a lot of time doing it: you might as well learn how to do it well

Optimization

  • Does my program work properly?
    • Think about optimization during design
    • Get your program to work before optimizing
  • How do I make my program run faster?
    • Where is my program spending all its time?
    • What can I do about it?
    • Is it worth your time?
  • First rule of optmization: “Measure, measure, measure.”
    • Don't guess!
    • Performance bottlenecks are often in unexpected parts of the code
    • It's not just how slow a particular function is, but also how many times that function is called
    • If you improve code that takes 10% of run time by a factor of ten, you get a 9% increase in performance; if you improve code that takes 50% of run time by a factor of two, you get a 25% increase in performance
    • Moral: optimize the right section of code

Execution Profile

  • The execution profile of a program is a description of its run-time behavior
    • Different inputs generate different profiles
    • Sections of code that consume more computation time than others are known as “hot spots”
  • Use a profiler to identify hot spots for optimization
    • A profiler collects statistics on the execution profiles, e.g.,
      • counts the number of times a function is called
      • tracks how long the calls take
    • Data collection makes profiled program run slower than normal, sometimes a lot slower
  • Python has built-in profilers, e.g., profile, cProfile and line_profile

Speeding Things Up

  • Replace hot spots with faster code
    • If a collection of data will be searched repeatedly, instead of a linear list, use a sorted list or a dictionary
    • There is usually a tradeoff between bookkeeping overhead and search speed when using “faster” data structures, e.g.,
      • Keeping a list sorted
      • Managing a dictionary
    • Which data structure is “best” depends on the data
  • Restructure the entire program
    • Take a different approach to solving your problem
    • Throw hardware at “embarrassingly parallel” problems
      • If you have access to a computing cluster, and the problem can be partitioned into multiple jobs easily, use one CPU per job to improve performance
      • Embarrassingly parallel tasks include:
        • Processing multiple independent data sets
        • Repeating simulations with different initial conditions
      • Write a shell script to:
        • Partition big job into a bunch of little jobs
        • Run the jobs, either directly or submitting it into a batch job queue
        • Wait for all jobs to complete
        • Collate results

Optimization Summary

  • Measure, Measure, Measure
  • Use efficient data structures when possible
  • Take advantage of multiple processors