Fletcher 0.3: A status report on the mission to get pandas hooked on Apache Arrow

· 25 Feb 2020

It has been now nearly two years since the idea came up to use pandas’ new ExtensionArray interface to provide columns in pandas that are backed by Apache Arrow. fletcher was started as a prototype project to show how this idea can be brought together. Since then there has been quite a lot of development in both pandas and Apache Arrow. Still, fletcher remains a prototype to show how this could look like as essential functionality is missing to use it productively. With the two year mark now approaching, I thought it was a good time to give a progress report and tag new intermediate release.

Although I highly warn of any productive use of fletcher, I would be curious to find out where the first entry point is that fails for users. Thus, please try to use it in your project and report the first exception you encounter in our issue tracker. This will the focus of fletcher development as it will give us the insight of what the required functionality is to provide a minimal useful library.

Choosing between chunked & continuous arrays as storage backend

Initially ExtensionArray instances in fletcher were solely backed by pyarrow.ChunkedArray instances. This was chosen as chunked arrays allow for the most flexibility, e.g. concatenating them can be done in constant time. But with the flexibility on the user side also comes a lot of complexity in implementing algorithms on top of them. Due to the nature of the chunking, you don’t deal with a simple linear index to access any element of an array but you always need to translate between the scalar index that indicates the position in the whole array and the tuple (chunked_index, index_in_chunk) that gives you the relative position of an element to its containing chunk.

Thus, we now provide two different extension array implementations. There now is the more simpler FletcherContinuousArray which is backed by a pyarrow.Array instance and thus is always a continuous memory segments. The initial FlectherArray which is backed by a pyarrow.ChunkedArray is now renamed to FletcherChunkedArray. While pyarrow.ChunkedArray allows for more flexibility on how the data is stored, the implementation of algorithms is more complex for it. As this hinders contributions and also the adoption in downstream libraries, we now provide both implementations with an equal level of support. We don’t provide the more general named class FlectherArray anymore as there is not a clear opinion on whether this should point to FletcherContinuousArray or FletcherChunkedArray. As usage increases, we might provide such an alias class in future again.

Arithmetic, comparison and reduce operations

For numeric data, pandas has added in the last year a test suite that provides a vast amount of tests to check all kind of numeric operations on ExtensionArray. With the help of this suite, we were able to implement these operations on top of pyarrow.float* and pyarrow.int* types. The current implementation applies the mask on the input arrays and then delegates to numpy for the computations. With this, we are on the same performance level as pandas.IntegerArray. In future, we want to use numeric operation that are directly implemented in Apache Arrow C++ and make direct use of the validity bitmap instead. This will save on memory bandwidth / storage as well will be faster on the actual numeric operations as bitmap checking and operation calculation won’t be separated steps anymore.

BooleanArray & StringArray in pandas and its fletcher counterparts

In the newest release, we have an implementation of a boolean array that supports missings and behaves like a pandas.Series of float type for any and all. There was a blog post outlining its implementation. In pandas 1.0, a new BooleanArray was released with a slightly different behaviour, we will adapt fletcher to this in the next release. Currently our tests are failing and it looks like an inconsistency in pandas’ implementation which we currently investigate.

Besides BooleanArray, pandas 1.0 also added StringArray which brings in a check that all objects in that column are strings but doesn’t improve on performance. Thus there is still the need for a fast string type like we are implementing in fletcher. A first step in this direction we now support .str.cat as an algorithm on fletcher string columns via .fr_text.cat.

Missing kernels / operations / .. in Arrow or fletcher

One of the main things making fletcher not practically usable at the moment are the missing algorithm implementations on top of it. You can select / slice / store fletcher columns but executing operations like zfill for strings or dt.year on top of its columns is not possible yet. These operations currently need a cast to an object-typed series making them even slower than their current pandas counterparts.

Such operations are named kernels in the Apache Arrow C++ source code where a rudimentary set of them exists already. Sadly a lot of common functionality is missing for the basic data types and some of the existing kernels are only implemented for pyarrow.Array and not for ChunkedArray. Having good kernels available for ChunkedArray in Arrow C++ itself is crucial as applying the kernel to individual chunks often includes non-trivial transformations of intermediate results or indices that were given as an input.

With the pandas integration basics now in place in fletcher, we will be able to concentrate on exactly these kernels. As one of the points of fletcher is to explore on how to implement kernels on top of Arrow in the most efficient way with numba, we are first trying to implement a kernel in fletcher and will only resort to Arrow if the implementation turns out to be too complex or too slow with numba. One of the drawbacks of putting a kernel implementation into Arrow C++ is that we need to wait for a release of it to make it available to end-users. With implementing them first in fletcher, we can make releases on our own and thus release them faster to the user. Afterwards, when Arrow is then released, we can remove our implementation and point to the most likely (a bit) faster implementation in C++.

spatialpandas as an impact example

The main goal of fletcher is to make impact on pandas and Apache Arrow but we are also very pleased that we have an impact on the ecosystem. The influence of fletcher can be seen a bit in spatialpandas, an ExtensionArray implementation for spatial/geometric operations. spatialpandas is also building on top of pyarrow and numba to implement certain specific spatial-specific data types and also reuses basic code from fletcher for accessing Arrow data.

Next steps

As the next steps in fletcher, we will focus to implement more of the .str and .dt methods. These are the simple datatypes fletcher can have massive improvements over the status quo in pandas. This is because strings are currently implemented as object dtype even when using the new StringDtype and thus are not comparable in performance to the dtypes that are implement with numpy-native types. For date(time) columns, we can also improve a bit by allowing more-than-nanosecond precision and also the use of 32bit datatype for dates where 64bit aren’t needed to represent the most commonly used timespans.

It would also be nice to have more operations on nested types as they are currently unavailable in pandas and fletcher supports them through the use of Arrow. But as the kernel implementations for them are much more complex, we are going to focus first on strings and dates.