Writing a boolean array for pandas that can deal with missing values

When working with missing data in pandas, one often runs into issues as the main way is to convert data into float columns. pandas provides efficient/native support for boolean columns through the numpy.dtype('bool'). Sadly, this dtype only supports True/False as possible values and no possibility for storing missing values. Additionally, numpy uses a whole byte to store the True/False information while a single bit would be sufficient.

The current way in pandas

By intuition, one would simply create a boolean array with missings by calling either pandas.Series([True, None, False]) or pandas.Series([True, numpy.nan, False]). This sadly creates a Series with an object dtype. While already having an object dtype means that all operations are run in native Python and not accelerated by numpy, the behaviour of reduction functions like any() is feeling a bit weird at first. The actual algorithm implements Series([x, y, z]).any() as x or y or z which seems reasonable but doesn’t match the docstring of pandas.Series.any dealing with missing data. The result of any() should always be a boolean, irrespectively if the data contains missing values:

skipna : bool, default True
Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

In the case of pd.Series([False, np.nan]).any(skipna=False), we get the surprising result of False or np.nan = np.nan. The behaviour of any() using None as a missing value representation it even less intuitive as the operation is not commutative. I’ve opened a discussion about this behaviour upstream and it may be changed soon to match the docstring but this still won’t get us a fast solution.

As previously mentioned, for dealing with missing values in pandas, you normally have to resort to float columns. The same is true in the case of booleans with missing values. Using pd.Series([False, np.nan]).astype(float).any(skipna=False), we get True as the result. This is what pandas docstring also describes as the correct behaviour of any() on arrays with missing values. While we now have a working solution, we ended up using the wrong data type (float instead of bool) and we waste even more space for storing a tri-state (true/false/missing).

Making it more efficient

Some time ago, I have started a prototype(!) implementation of a pandas.ExternsionArray implementation backed by Apache Arrow. This project is called fletcher and is available via PyPI and conda-forge. Apache Arrow uses two memory buffers to represent a column for booleans with missing values. It uses one bitmask for storing whether a row is missing or has a value and a second bitmap whether the value is true or false. This then uses 2 bit instead of 8 bit (numpy.dtype('bool')) or even 16-64bit in the case of a float column.

While we don’t present a boolean array that is on-par with the functionality you would get with a float-typed column, we want to have a minimally working, efficient boolean column. fletcher in its current form already supports the creation of a column and the proxing of basic pandas functionality that is independent of the data type. It doesn’t support type-specific functionality though. Thus we need to implement the any() and all() reduction on this column. As Apache Arrow does not yet support these operations as a built-in, we need to implement them on our own.

Testing the implementation

Following good software engineering practices, we are going to implement this using test-driven development. As we already have a working implementation and would like to implement an alternative one using a more efficient storage, we can use the existing one as a reference for checking the correctness of our new implementation. This is a perfect case for using property-based testing as we are able to retrieve the correct result for a given input data. As a helpful tool to implement property-based testing on Python, we are using hypothesis.

For writing such a unit test, we first need to take care of the creation of random input data. pandas.Series.any takes two relevant arguments in our case, the data itself and a flag skipna whether to skip missing values or consider them as part of the algorithm. The generation of the skipna parameter is straight-forward though the st.booleans() strategy which either emits True or False. For our data parameter, there is no predefined strategy that would generate the expected data. Thus we need to construct a combined strategy out of hypothesis existing strategies. Using st.one_of(st.booleans(), st.none()) we can create a strategy that either returns True or False or None as a missing value. By wrapping this into st.lists(…), we generate a list of these values instead of a single scalar.

import hypothesis.strategies as st
from hypothesis import example, given

@given(data=st.lists(st.one_of(st.booleans(), st.none())), skipna=st.booleans())

hypothesis is really good at generating edge cases for the given input types but we also want to be explicit in testing them. Thus we can use its example decorator to enforce that the empty list is always passed in as test data.

@example([], False)
@example([], True)

Now that the random input generation is solved, we need to take care to run the code and test its properties. Testing the properties is simple as we can test all properties of the result by comparing it against a reference result. As mentioned in the first section, for dealing with boolean arrays with missing values in pandas, we need a float-typed Series.

import pandas as pd
import pyarrow as pa

arrow = pa.array(data, type=pa.bool_())
pandas = pd.Series(data).astype(float)

assert any_op(arrow, skipna) == pandas.any(skipna=skipna)

One speciality of the fletcher implementation is that it doesn’t only work on continuous arrays but also on chunked arrays. As such, we will have code in the later implementation that reduces the result of any(…) on separate memory regions together. This is an edge case our above defined data generation doesn’t cover. Thus we take the random data and split it in half to get a chunked array and test this also leads to the very same result.

if len(data) > 2:
    arrow = pa.chunked_array(
        [data[: len(data) // 2], data[len(data) // 2 :]], type=pa.bool_()
    )
    assert any_op(arrow, skipna) == pandas.any(skipna=skipna)

Putting all pieces together, we get the following unit test:

@given(data=st.lists(st.one_of(st.booleans(), st.none())), skipna=st.booleans())
@example([], False)
@example([], True)
def test_any_op(data, skipna):
    arrow = pa.array(data, type=pa.bool_())
    pandas = pd.Series(data).astype(float)

    assert any_op(arrow, skipna) == pandas.any(skipna=skipna)

    # Split in the middle and check whether this still works
    if len(data) > 2:
        arrow = pa.chunked_array(
            [data[: len(data) // 2], data[len(data) // 2 :]], type=pa.bool_()
        )
        assert any_op(arrow, skipna) == pandas.any(skipna=skipna)

Implementing any(…)/all(…) reductions

With the unit test in place, we can now start implementing the functions. The first major difference to NumPy booleans is that we don’t have a bytemask but a bitmask. Thus means that every byte in memory contains eight values. As we will be working value-by-value we first need to write some code that given a position selects the correct memory address and from the byte there the correct bit. As Arrow columns store the value and the missing flag in two separate memory buffers, we can use the same index but different base address to select these values.

byte_offset = i // 8
bit_offset = i % 8
mask = np.uint8(1 << bit_offset)
valid = valid_bits[byte_offset] & mask
value = data[byte_offset] & mask

The all implementation is quite straight-forward as adhering to the pandas definition of skipna means that the value of it is actually irrelevant. Thus we come to the following solution. The only important aspect here is that we give numba the hint that valid and value are boolean variables.

@numba.njit(locals={"valid": numba.bool_, "value": numba.bool_})
def _all_op(length, valid_bits, data):
    # This may be specific to Pandas but we return True as long as there is not False in the data.
    for i in range(length):
        fetch valid and value
        if valid and not value:
            return False
    return True

For the end-user facing operation, we provide a function that takes a pyarrow.Array instead of the individual memory buffers. Unpacking those can be done by calling arr.buffers() where in the case of a BooleanArray the first is the valid bitmap (this is true for all Arrow arrays) and the second the memory buffer holding the actual values. Additionally, we don’t only support passing in pyarrow.Array objects but also pyarrow.ChunkedArray objects that are made up of a list of pyarrow.Array objects. Here we need to call the all_op on a per-array basis and merge the results at the end again.

def all_op(arr, skipna):
    if isinstance(arr, pa.ChunkedArray):
        return all(all_op(chunk, skipna) for chunk in arr.chunks)

    # skipna is not relevant in the Pandas behaviour
    return _all_op(len(arr), *arr.buffers())

The any operation is slightly more complex as we here also need to take care of the differing behaviour depending on the setting of skipna. Thus the user-facing function dispatches to different jitted-functions depending on whether we want to take missing values into account.

def any_op(arr, skipna):
    if isinstance(arr, pa.ChunkedArray):
        return any(any_op(chunk, skipna) for chunk in arr.chunks)

    if skipna:
        return _any_op_skipna(len(arr), *arr.buffers())
    return _any_op(len(arr), *arr.buffers())

In the case of skipna=True, the we return True if one element satisfies the condition valid and value. For skipna=False we interpret missing values as True (as defined in pandas documentation) and thus the any_op is True whether there is an entry for which (valid and value) or (not valid) holds.

A last, small special case

There is sadly one more thing to handle which is that Arrow allows to omit the valid bitmap in the case when there are no nulls (i.e. null_count == 0). This helps to save a bit of memory in the quite-often case of no missing data (in our case this amounts to a saving of half of the used memory) but sadly introduces another branch in our code. For the all operation, we can use the same code as in the case where we have a valid bitmap but instead of accessing it, we simply assume that every entry is valid. Therefore the non-null case of the all operation can be implemented using:

@numba.njit(locals={"value": numba.bool_})
def _all_op_nonnull(length, data):
    for i in range(length):
        byte_offset = i // 8
        bit_offset = i % 8
        mask = np.uint8(1 << bit_offset)
        value = data[byte_offset] & mask
        if not value:
            return False
    return True

Performance

While we already can fallback in pandas to using float{32,64} columns to correctly work with boolean columns with missings, one thing we also want to achieve is a better performance. As boolean operations are quite simple, the best impact in performance is the reduced memory usage we get by using Arrow.

To compare the performance, we have written some simple benchmarks using airspeed velocity. In the case of all, we have the taken the most complex version where all entries are True and thus the whole array must be scanned to compute a valid result. We do the same most-expensive version for any but here we set all entries to False. To also test the performance difference with having missing data, we added another pair of benchmarks where we set the last entry to None/NAN/missing. For NumPy this makes no difference but in the Arrow case, the valid bitmap is actually allocated and must then be checked for every entry.

Running the benchmarks on my MacBook with Python 3.7 gives the following results:

% asv run --python=same -b 'boolean*'
· Discovering benchmarks
· Running 12 total benchmarks (1 commits * 1 environments * 12 benchmarks)
[  0.00%] ·· Benchmarking existing-py_Users_uwe_miniconda3_envs_fletcher_bin_python
[  4.17%] ··· Running (boolean.BooleanAll.time_fletcher--)......
[ 29.17%] ··· Running (boolean.BooleanAny.time_numpy--)..
[ 54.17%] ··· boolean.BooleanAll.time_fletcher               16.2±0.6ms
[ 58.33%] ··· boolean.BooleanAll.time_fletcher_withna        16.3±0.4ms
[ 62.50%] ··· boolean.BooleanAll.time_numpy                    35.6±1ms
[ 66.67%] ··· boolean.BooleanAll.time_numpy_withna             37.8±2ms
[ 70.83%] ··· boolean.BooleanAny.time_fletcher               15.9±0.1ms
[ 75.00%] ··· boolean.BooleanAny.time_fletcher_withna        16.5±0.8ms
[ 79.17%] ··· boolean.BooleanAny.time_numpy                  35.9±0.6ms
[ 83.33%] ··· boolean.BooleanAny.time_numpy_withna           35.2±0.2ms
[ 87.50%] ··· boolean.BooleanAny.track_size_fletcher            2097152
[ 91.67%] ··· boolean.BooleanAny.track_size_fletcher_withna     4194304
[ 95.83%] ··· boolean.BooleanAny.track_size_numpy              67108864
[100.00%] ··· boolean.BooleanAny.track_size_numpy_withna       67108864

From the results, we can see that runtime-wise, we are at least 2.1x times faster than the NumPy implementation in any case and can achieve an increase of upto 2.32x in speed in the most extreme case. More remarkable though is the improvement in memory usage. We have been fair to NumPy/pandas and already used float32 instead of the typically allocated float64 type. Still, for the case with no missing data, we can save a factor of 32 on RAM whereas with missings, we still have a factor of 16 while getting correct results. The np.bool dtype would also have saved us a factor of 4 but as mentioned in the beginning, the results with missing data where inconsistent and thus sometimes incorrect.

Conclusion

In this blog post, we have shown on how a simple operation on a pyarrow.Array or pyarrow.ChunkedArray can be implemented using numba and hypothesis for property based testing. This was especially helpful as there is already an existing but memory-wasting implementation available. We could use that implementation to verify that our new implementation has the same behaviour with a different underlying storage.

With the Arrow-based implementation we now have a custom boolean type that uses only 2 bits of storage per row instead of the existing implementation that need between 8 and 64 bit and at the same type achieve a speedup of 2 in the runtime.

The work described here has already been merged into fletcher 1, 2, 3 and released as 0.2.0 on PyPI and conda-forge.