Unstatistical hypothesis testing

Testing when you do not know what to test

Posted by Javier Mermet

on December 11, 2020 · 7 mins read

On your desk, your coffee's already gone cold. In the midst of the night, the cursor blinks almost defiantly on the screen, mocking you. You are in for a long night. Everything is alright, though. After learning about modern python dependencies handling you published your flamant new package called foobar, what a strike of originality. It's been unexpectedly successful. Stars are skyrocketing on GitHub, but someone across the globe has opened an issue. So here we are now.

They were planning on running some A/B tests to see how their userbase responds to your brilliant and feature-complete package, only to find out that... there are no tests!

Testing

A quick recap on testing, first. It's code that tests other code. There, done. But what about unit testing? Let's take Martin Fowler's take on the subject, he knows a thing or two about this. Unit tests should:

consider a unit
be fast(er than other kinds of tests)
be written by programmers

And at this point you might be wondering what is a unit?, and there are lots of resources out there covering this topic. I don't want to add noise to the signal. But let's stick to the notion that is a thingy. More specifically, an atomic thingy, to some degree of atomicity that lets you sleep at night.

I'm a hands-on-learner, so let's start with a simple example on how to write some tests on Python. You have the following code on foobar.add:

from foobar.optimized_types import BigNumber


def add_numbers(x: BigNumber, y: BigNumber) -> BigNumber:
    """Add two big numbers and return a new big number.

    Parameters
    ----------
      x (BigNumber): The first operand.
      y (BigNumber): The second operand.

    Returns
    -------
      BigNumber: The sum of x and y.
    """
    ...  # brilliant implementation follows

How do you go about testing this? Hello pytest, my old friend, I've come to test with you again.

Some tests

Let's add some tests!

def test_add_numbers_zero_and_zero():
    assert add_numbers(0, 0) == 0


def test_add_numbers_one_and_zero():
    assert add_numbers(1, 0) == 1


def test_add_numbers_zero_and_one():
    assert add_numbers(0, 1) == 0

Notice anything wrong? Can you smell any code smells? These tests have (at least) a couple issues:

ill defined tests
repeated code

While for the first item the issue is between the chair and the ergo split keyboard, the second one is where frameworks come to the rescue:

data = [(0, 0, 0), (1, 0, 1), (0, 1, 1)]


@pytest.mark.parametrize("n1,n2,expected", data)
def test_add_numbers(n1, n2, expected):
    assert add_numbers(n1, n2) == expected

That's more like it, don't you love the smell of deleted code in the mornings?

Test properties, not cases

But, let's take a moment to think about this. Your function adds. You are testing just a few cases. When I first learned about testing I asked why you don't test all possible cases. The answer is pretty straightforward: time. So you usually test just a few edge cases and happy paths along with their expected outcomes, based on your domain and software knowledge. But what defines a sum operation to be correct?

Conmutativity

$a + b = b + a$

Associativity

$a + (b + c) = (a + b) + c$

Identity

$a + 0 = a$

So, how do you go about testing this? Enter hypothesis:

[...] a Python library for creating unit tests which are simpler to write and more powerful when run, finding edge cases in your code you wouldn’t have thought to look for.

from hypothesis import given
from hypothesis import strategies as st

@given(st.integers(), st.integers())
def test_code_add_commutativity(a, b):
    assert add_numbers(a, b) == add_numbers(b, a)


@given(st.integers())
def test_code_add_identity(a):
    assert add_numbers(a, 0) == a


@given(st.integers(), st.integers(), st.integers())
def test_code_add_associativity(a, b, c):
    assert add_numbers(
        a,
        add_numbers(b, c)
    ) == add_numbers(
        add_numbers(a, b),
        c
    )

And that's it! Instead of testing cases you are now testing properties of your code. How cool is that?

So, what's property based testing?

Let's start with the funny definition:

The thing that Quickcheck does

And now, let's go for something more... comprehensive, what is it that quickcheck does:

Assertions are written about logical properties that a function should fulfill
QuickCheck attempts to generate a test case that falsifies such assertions
QuickCheck tries to reduce it to a minimal failing subset by removing or simplifying input data that are unneeded to make the test fail

So, is this the same as fuzzy testing? Well yes, but actually no. From the end user POV, you get a fuzzer and a library of tools for making it easy to construct property based tests given said fuzzer.

Dissecting our example

So, if you take a look at our example above, you'll see that you decorate tests with given to indicate an entrypoint for hypothesis. You are also using a strategy to generate integers. Then you write your asserts as usual!

Examples

But there's a lot to this, and perhaps you need a couple ideas of what kind of properties you can test.

Sorting

Let's talk about sorting. What are it's properties?

sort(l) returns a list
a sorted list has the same elements as the original list
there is an ordering between elements
sorting a sorted list does not change anything

So, with hypothesis you would need to generate, say, lists of integers and test said properties,

@given(st.lists(st.integers()))
def test_sorting(l):
    sl = my_sort(l)

    assert isinstance(sl, list)

    assert Counter(sl) == Counter(l)

    assert all(x <= y for x, y in zip(sl, sl[1:]))

    assert my_sort(sl) == sl

Easy, right?

Encode/Decode

Imagine you have an encode/decode couple of functions. Despite testing some particular cases, you might find the idea that encode(decode(x)) == x and decode(encode(x)) == x useful.

A simple example could be a to_binary and from_binary encoding/decoding pair of functions:

import pytest
from hypothesis import given, reject
from hypothesis import strategies as st


def to_binary(i):
    res = []
    while i != 0:
        i, mod = divmod(i, 2)
        res.append(mod)
    return "".join(map(str, res))[::-1]


def from_binary(b):
    return sum((2 ** idx) * int(v) for idx, v in enumerate(b[::-1]))


@given(st.text(alphabet="1", min_size=1))
def test_only_ones(x):
    assert from_binary(x) == 2 ** len(x) - 1


@given(st.integers(min_value=0))
def test_encode_decode(x):
    assert from_binary(to_binary(x)) == x


@given(st.text(alphabet="01", min_size=1))
def test_decode_encode(x):
    x = x.lstrip("0")
    if len(x) == 0:
        reject()
    assert to_binary(from_binary(x)) == x

Optimization

Imagine you have a ground truth function which has been proved and tested (perhaps even using property based testing!) and you want to optimize it for shorter run times. Say, we had multiply_numbers(x: BigNumber, y: BigNumber) on foobar.multiply which had been implemented using long multiplication, and let's assume it is correct. But now you want to re-implement it using Karatsuba's algorithm, you could run both and assert both results are the same to ensure correctness, while at the same time measuring their clock times and asserting that for large enough numbers, Karatsuba should have shorter run times.

So, you'd consider your previous implementation as an oracle and check that the new implementation agrees with it.

Other cases

I had a lot of fun using hypothesis to find a non canonical coin system for the change making problem, which you can find here. But that's a story for another day!

Tell me more!

Hypothesis has a lot of cool features. It integrates nicely with pytest, it has strategies to generate the most common types and allows you to generate your own!

But let's imagine it finds a breaking case, and let's imagine it's a really complicated case. That's probably no use. Luckily, there's integrated shrinking which means that it will reduce the failing example to an example as simple as possible!

So, now you got a simple failing case to work on and fix. But next time you run the tests, you'd like to run that very same example to see if it's fixed. But we said examples were random(-ish) 😞 . Well, not quite, hypothesis keeps a small database of failing examples to check on future runs.

One feature I haven't yet got to try out is tests ghostwriting, but sure looks promising and really interesting. The ghostwriter module generates test functions which allows you to get started with property based testing more quickly and more easily. Several of the examples provided here can be mapped to some ghostwriter currently implemented.

Although most times seeing the generated inputs is of no use, you can set test's verbosity to verify the generated examples.

Hypothesis for data projects

As a data scientist, I didn't come across hypothesis as an accident. I was working on a project based on social networks analysis. Our networks had well defined properties, and also the data related to each node had some vague format rules, but well defined properties (think phone numbers for instance, number of edges, and so on). We needed to run some transformations on the data and assert the results made sense. But the input data was relatively complex, so the transforms were still held some complexity to them.

Testing all edge cases would have difficult, if not impossible, and extremely time consuming. Some properties of transformations are encoded business rules, while others are logic ones.

As you might have guessed, most of this was done using pandas.DataFrame. Hypothesis has strategies for pandas, including Series, indexes, DataFrames and they are really easy to use. It also has numpy strategies!

Can I use it for web development?

I won't claim to be an expert on this, I am not. But the hypothesis[django] extra has strategies to test Django models and forms.

Last words

Of course this is no silver bullet, there are no silver bullets. Property based testing has some drawbacks, most notably, it's slow. Or at least compared to your old unit tests. You should take a moment to think on when and where to use it. But, it's always great to have more tools in your toolbelt.

References

This article has been adapted from an internal talk, which was given during pre-pandemic times, when we could safely gather at the office and livestream to remote locations. Unfortunately, it wasn't recorded. You can find the original slides here, which link to the examples code if you want to dive deeper. Now, go test your code's properties!