05 October 2018
Identifying performance bottlenecks in long-running processes often involves careful instrumentation ahead or guessing where the root of the problem may be. A very welcome set of tools are the ones that help you diagnose problems of live systems without modifying them. One important tool I recently came across is the pyflame profiler.
03 August 2018
Apache Arrow is an in-memory memory format for
columnar data. In more “plain” English, it is a standard on how to store
DataFrames/tables in memory, independent of the programming language. One of
its most prominent uses is for the
@pandas_udf decorator in Apache
to move data quickly between Scala and Python/pandas.
19 May 2018
Three weeks ago MAN AHL organised an opensource hackathon at their London office. As part of the Hackathon people should contribute to one of the PyData artifacts they regularly use. To support them in making their first contribution, AHL also coordinated that several core committers of opensource projects were present at the event. I joined in as the representative of the Apache Arrow project.
17 December 2017
Often, we use
pyarrow in a Jupyter Notebook during work. With the
xeus-cling kernel, we can also use the C++ APIs directly in an interactive fashion in Jupyter.
24 February 2016
Use Akka Streams as a new technique to extract specific articles from the Wikipedia xml dump into single files without the need to fit all data into RAM.