While most databases are accessible via ODBC where we have an efficient way via turbodbc to turn results into a
pandas.DataFrame, there are nowadays a lot of databases that either only come solely with a JDBC driver or the non-JDBC drivers are not part of free or open-source offering. To access these databases, you can use JayDeBeApi...
TL;DR: Recently, DuckDB a database that promises to become the SQLite-of-analytics, was released and I took it for an initial test drive. Install it via
conda install python-duckdbor
pip install duckdb.
Apache Arrow is provided for Python users through two package managers,
conda. The first mechanism, providing binary, pip-installable Python wheels is currently unmaintained as highlighted on the mailing list. There has been shoutouts for help, e.g. on Twitter that we need new contributors who look after the builds. We sadly cannot point to all...
When working with missing data in
pandas, one often runs into issues as the main way is to convert data into
pandasprovides efficient/native support for boolean columns through the
numpy.dtype('bool'). Sadly, this
True/Falseas possible values and no possibility for storing missing values. Additionally,
The New York City Taxi & Limousine Commission Trip Record Data is a really nice dataset to get started with Data Engineering or teaching it. It has several nice properties that make it quite useful that we will show in this article. We will look at this data using only
pandas, not introducing any other tooling. Many properties...