Uwe’s Blog

My writing about data engineering, opensource development, general programming and thoughts about engineering culture.

  • Trimming down pyarrow’s conda footprint (Part 2 of X)

    We have again reduced the footprint of creating a conda environment with pyarrow. This time we have done some detective work on the package contents and removed contents from thrift-cpp and pyarrow that are definitely not needed at runtime.

  • Removing Python as a dependency of R

    Surprisingly Python was a runtime dependency of R on conda-forge. As R doesn’t need Python to run, this was a bit weird. We got rid of this by splitting up the GLib package.

  • Trimming down pyarrow’s conda footprint (Part 1 of X)

    We have substantially reduced the footprint of creating a conda environment with pyarrow. While working on this, we have also substantially reduced the size of a base Python installation from conda-forge. All this was done without disabling any functionality. We reduced the size of a conda environment for pyarrow by nearly 50% and reduced the “pyarrow tax” for reading...

  • Building R Arrow on Windows: A tale of two compilers

    Windows support for Apache Arrow is pretty good. There are Python wheels, Python conda packages and a binary build for R on CRAN. One thing that has been missing though for a long time has been a conda package for R Arrow on Windows. Thanks to a lot of experimentation and some important suggestions by Isuru Fernando (Thanks!), we...

  • The one pandas internal I teach all my new colleagues: the BlockManager

    When new members join our team, they usually are already fluent in data analysis with pandas and know their way around the typical quirks. They know that they should use vectorised functions where possible and avoid using apply with a slow Python callable. There are two main reasons, I teach them the BlockManager quite at the beginning....