Uwe’s Blog

My writing about data engineering, opensource development, general programming and thoughts about engineering culture.

  • Calculating levenshtein distances with fletcher

    Levenshtein distance is a typical measure to compare two different strings. It gives you the minimal number of add, remove and replace operations to transition from one string to another.

  • Trimming down pyarrow’s conda footprint (Part 2 of X)

    We have again reduced the footprint of creating a conda environment with pyarrow. This time we have done some detective work on the package contents and removed contents from thrift-cpp and pyarrow that are definitely not needed at runtime.

  • Removing Python as a dependency of R

    Surprisingly Python was a runtime dependency of R on conda-forge. As R doesn’t need Python to run, this was a bit weird. We got rid of this by splitting up the GLib package.

  • Trimming down pyarrow’s conda footprint (Part 1 of X)

    We have substantially reduced the footprint of creating a conda environment with pyarrow. While working on this, we have also substantially reduced the size of a base Python installation from conda-forge. All this was done without disabling any functionality. We reduced the size of a conda environment for pyarrow by nearly 50% and reduced the “pyarrow tax” for reading...

  • Building R Arrow on Windows: A tale of two compilers

    Windows support for Apache Arrow is pretty good. There are Python wheels, Python conda packages and a binary build for R on CRAN. One thing that has been missing though for a long time has been a conda package for R Arrow on Windows. Thanks to a lot of experimentation and some important suggestions by Isuru Fernando (Thanks!), we...