-
Trimming down pyarrow’s conda footprint (Part 1 of X)
·We have substantially reduced the footprint of creating a conda environment with
pyarrow
. While working on this, we have also substantially reduced the size of a base Python installation from conda-forge. All this was done without disabling any functionality. We reduced the size of a conda environment for pyarrow by nearly 50% and reduced the “pyarrow tax” for... -
Building R Arrow on Windows: A tale of two compilers
·Windows support for Apache Arrow is pretty good. There are Python wheels, Python conda packages and a binary build for R on CRAN. One thing that has been missing though for a long time has been a conda package for R Arrow on Windows. Thanks to a lot of experimentation and some important suggestions by Isuru Fernando (Thanks!), we...
-
The one pandas internal I teach all my new colleagues: the BlockManager
·When new members join our team, they usually are already fluent in data analysis with
pandas
and know their way around the typical quirks. They know that they should use vectorised functions where possible and avoid usingapply
with a slow Python callable. There are two main reasons, I teach them theBlockManager
quite... -
Fletcher 0.3: A status report on the mission to get pandas hooked on Apache Arrow
·It has been now nearly two years since the idea came up to use
pandas
’ newExtensionArray
interface to provide columns inpandas
that are backed by Apache Arrow.fletcher
was started as a prototype project to show how this idea can be brought together. Since then there has been quite... -
Fast JDBC access in Python using pyarrow.jvm
·While most databases are accessible via ODBC where we have an efficient way via turbodbc to turn results into a
pandas.DataFrame
, there are nowadays a lot of databases that either only come solely with a JDBC driver or the non-JDBC drivers are not part of free or open-source offering. To access these databases, you can use