Apache Arrow is provided for Python users through two package managers,
conda. The first mechanism, providing binary, pip-installable Python wheels is currently unmaintained as highlighted on the mailing list. There has been shoutouts for help, e.g. on Twitter that we need new contributors who look after the builds. We sadly cannot point to all...
When working with missing data in
pandas, one often runs into issues as the main way is to convert data into
pandasprovides efficient/native support for boolean columns through the
numpy.dtype('bool'). Sadly, this
True/Falseas possible values and no possibility for storing missing values. Additionally,
The New York City Taxi & Limousine Commission Trip Record Data is a really nice dataset to get started with Data Engineering or teaching it. It has several nice properties that make it quite useful that we will show in this article. We will look at this data using only
pandas, not introducing any other tooling. Many properties...
At the moment in Computer Science, there are two hot topics: AI and Blockchain. Behind these two buzzwords, there are industries striving to build successful products. Currently, I work in the sector often labelled as AI. Usually, it is also described with other terms like Machine Learning or Big Data. In this sector the currently most sought-after job is the...
Data Science and Machine Learning are tasks that have their own requirements on I/O. As many other tasks, they start out on tabular data in most cases. In contrast to a typical reporting task, they don’t work on aggregates but require the data on the most granular level. Some machine learning algorithms are able to directly work on aggregates but...