On your desk, your coffee’s already gone cold. In the midst of the night, the cursor blinks almost defiantly on the screen, mocking you. You are in for a long night. Everything is alright, though. After learning about modern python dependencies handling you published your flamant new package called
foobar, what a strike of originality. It’s been unexpectedly successful. Stars are skyrocketing on GitHub, but someone across the globe has opened an issue. So here we are now.
They were planning on running some A/B tests to see how their userbase responds to your brilliant and feature-complete package, only to find out that… there are no tests!
A quick recap on testing, first. It’s code that tests other code. There, done. But what about unit testing? Let’s take Martin Fowler’s take on the subject, he knows a thing or two about this. Unit tests should:
And at this point you might be wondering what is a unit?, and there are lots of resources out there covering this topic. I don’t want to add noise to the signal. But let’s stick to the notion that is a thingy. More specifically, an atomic thingy, to some degree of atomicity that lets you sleep at night.
I’m a hands-on-learner, so let’s start with a simple example on how to write some tests on Python. You have the following code on
from foobar.optimized_types import BigNumber def add_numbers(x: BigNumber, y: BigNumber) -> BigNumber: """Add two big numbers and return a new big number. Parameters ---------- x (BigNumber): The first operand. y (BigNumber): The second operand. Returns ------- BigNumber: The sum of x and y. """ ... # brilliant implementation follows
How do you go about testing this? Hello pytest, my old friend, I’ve come to test with you again.
Let’s add some tests!
def test_add_numbers_zero_and_zero(): assert add_numbers(0, 0) == 0 def test_add_numbers_one_and_zero(): assert add_numbers(1, 0) == 1 def test_add_numbers_zero_and_one(): assert add_numbers(0, 1) == 0
Notice anything wrong? Can you smell any code smells? These tests have (at least) a couple issues:
While for the first item the issue is between the chair and the ergo split keyboard, the second one is where frameworks come to the rescue:
data = [(0, 0, 0), (1, 0, 1), (0, 1, 1)] @pytest.mark.parametrize("n1,n2,expected", data) def test_add_numbers(n1, n2, expected): assert add_numbers(n1, n2) == expected
That’s more like it, don’t you love the smell of deleted code in the mornings?
But, let’s take a moment to think about this. Your function adds. You are testing just a few cases. When I first learned about testing I asked why you don’t test all possible cases. The answer is pretty straightforward: time. So you usually test just a few edge cases and happy paths along with their expected outcomes, based on your domain and software knowledge. But what defines a
sum operation to be correct?
\(a + b = b + a\)
\(a + (b + c) = (a + b) + c\)
\(a + 0 = a\)
So, how do you go about testing this? Enter hypothesis:
[…] a Python library for creating unit tests which are simpler to write and more powerful when run, finding edge cases in your code you wouldn’t have thought to look for.
from hypothesis import given from hypothesis import strategies as st @given(st.integers(), st.integers()) def test_code_add_commutativity(a, b): assert add_numbers(a, b) == add_numbers(b, a) @given(st.integers()) def test_code_add_identity(a): assert add_numbers(a, 0) == a @given(st.integers(), st.integers(), st.integers()) def test_code_add_associativity(a, b, c): assert add_numbers( a, add_numbers(b, c) ) == add_numbers( add_numbers(a, b), c )
And that’s it! Instead of testing cases you are now testing properties of your code. How cool is that?
Let’s start with the funny definition:
The thing that Quickcheck does
And now, let’s go for something more… comprehensive, what is it that quickcheck does:
So, is this the same as fuzzy testing? Well yes, but actually no. From the end user POV, you get a fuzzer and a library of tools for making it easy to construct property based tests given said fuzzer.
So, if you take a look at our example above, you’ll see that you decorate tests with
given to indicate an entrypoint for
hypothesis. You are also using a
strategy to generate integers. Then you write your
asserts as usual!
But there’s a lot to this, and perhaps you need a couple ideas of what kind of properties you can test.
Let’s talk about sorting. What are it’s properties?
sort(l)returns a list
hypothesis you would need to generate, say, lists of integers and test said properties,
@given(st.lists(st.integers())) def test_sorting(l): sl = my_sort(l) assert isinstance(sl, list) assert Counter(sl) == Counter(l) assert all(x <= y for x, y in zip(sl, sl[1:])) assert my_sort(sl) == sl
Imagine you have an
decode couple of functions. Despite testing some particular cases, you might find the idea that
encode(decode(x)) == x and
decode(encode(x)) == x useful.
A simple example could be a
from_binary encoding/decoding pair of functions:
import pytest from hypothesis import given, reject from hypothesis import strategies as st def to_binary(i): res =  while i != 0: i, mod = divmod(i, 2) res.append(mod) return "".join(map(str, res))[::-1] def from_binary(b): return sum((2 ** idx) * int(v) for idx, v in enumerate(b[::-1])) @given(st.text(alphabet="1", min_size=1)) def test_only_ones(x): assert from_binary(x) == 2 ** len(x) - 1 @given(st.integers(min_value=0)) def test_encode_decode(x): assert from_binary(to_binary(x)) == x @given(st.text(alphabet="01", min_size=1)) def test_decode_encode(x): x = x.lstrip("0") if len(x) == 0: reject() assert to_binary(from_binary(x)) == x
Imagine you have a ground truth function which has been proved and tested (perhaps even using property based testing!) and you want to optimize it for shorter run times. Say, we had
multiply_numbers(x: BigNumber, y: BigNumber) on
foobar.multiply which had been implemented using long multiplication, and let’s assume it is correct. But now you want to re-implement it using Karatsuba’s algorithm, you could run both and assert both results are the same to ensure correctness, while at the same time measuring their clock times and asserting that for large enough numbers, Karatsuba should have shorter run times.
So, you’d consider your previous implementation as an oracle and check that the new implementation agrees with it.
But let’s imagine it finds a breaking case, and let’s imagine it’s a really complicated case. That’s probably no use. Luckily, there’s integrated shrinking which means that it will reduce the failing example to an example as simple as possible!
So, now you got a simple failing case to work on and fix. But next time you run the tests, you’d like to run that very same example to see if it’s fixed. But we said examples were random(-ish) 😞 . Well, not quite, hypothesis keeps a small database of failing examples to check on future runs.
One feature I haven’t yet got to try out is tests ghostwriting, but sure looks promising and really interesting. The ghostwriter module generates test functions which allows you to get started with property based testing more quickly and more easily. Several of the examples provided here can be mapped to some ghostwriter currently implemented.
Although most times seeing the generated inputs is of no use, you can set test’s verbosity to verify the generated examples.
As a data scientist, I didn’t come across hypothesis as an accident. I was working on a project based on social networks analysis. Our networks had well defined properties, and also the data related to each node had some vague format rules, but well defined properties (think phone numbers for instance, number of edges, and so on). We needed to run some transformations on the data and assert the results made sense. But the input data was relatively complex, so the transforms were still held some complexity to them.
Testing all edge cases would have difficult, if not impossible, and extremely time consuming. Some properties of transformations are encoded business rules, while others are logic ones.
As you might have guessed, most of this was done using
pandas.DataFrame. Hypothesis has strategies for pandas, including
DataFrames and they are really easy to use. It also has
I won’t claim to be an expert on this, I am not. But the
hypothesis[django] extra has strategies to test Django models and forms.
Of course this is no silver bullet, there are no silver bullets. Property based testing has some drawbacks, most notably, it’s slow. Or at least compared to your old unit tests. You should take a moment to think on when and where to use it. But, it’s always great to have more tools in your toolbelt.
This article has been adapted from an internal talk, which was given during pre-pandemic times, when we could safely gather at the office and livestream to remote locations. Unfortunately, it wasn’t recorded. You can find the original slides here, which link to the examples code if you want to dive deeper. Now, go test your code’s properties!