In a previous issue of Let’s Talk Text, I professed my love for the parquet file-format. That’s why when I heard about a database tool called DuckDB that can efficiently query parquet files, I immediately decided to cover it in this issue.
DuckDB is an immensely powerful open-source tool to run queries on large datasets. I love it for its ability to query parquet files efficiently, but it can be used to query CSVs, JSONs, and even Pandas dataframes. In fact, in the past few weeks, I’ve been using DuckDB as a replacement for Pandas since it’s so much more powerful.
DuckDB is natively parallelized and vectorized, whereas Pandas is not. Pandas loads all of the data into memory at once and runs on a single thread. That means you can’t use Pandas natively with a dataset larger than what your system memory can hold. With DuckDB, you can just point to a folder with a bunch of partitioned Parquet files (which are larger than memory), and query that data with SQL while automatically distributing load across all of the CPU cores.
Despite years of experience with Pandas, I often find myself looking up the syntax for common Pandas operations over and over again. DuckDB uses SQL, so there’s almost no learning curve. It’s the perfect solution for use cases where the dataset size isn’t terabytes large.
As part of this issue I created a Colab notebook in which I compare the speeds of DuckDB and Pandas on a 6.7 GB dataset with ~259 million rows. The DuckDB query ran in ~21 seconds on average, and the Pandas query crashed the notebook due to a memory error. Feel free to copy the notebook and play around with both tools.
Very cool, thanks for sharing - I will try it out!
Does the DuckDB package easily convert the final data to in-memory NumPy arrays (e.g. to run an sklearn or keras model afterwards)?