Skip to main content

How to query Parquet files

A lot of the world's data lives in Amazon S3 buckets. In this guide, we'll learn how to query that data using chDB.

Setup

Let's first create a virtual environment:

python -m venv .venv
source .venv/bin/activate

And now we'll install chDB. Make sure you have version 2.0.2 or higher:

pip install "chdb>=2.0.2"

And now we're going to install iPython:

pip install ipython

We're going to use ipython to run the commands in the rest of the guide, which you can launch by running:

ipython

You can also use the code in a Python script or in your favorite notebook.

Exploring Parquet metadata

We're going to explore a Parquet file from the Amazon reviews dataset. But first, let's install chDB:

import chdb

When querying Parquet files, we can use the ParquetMetadata input format to have it return Parquet metadata rather than the content of the file. Let's use the DESCRIBE clause to see the fields returned when we use this format:

query = """
DESCRIBE s3(
'https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2015.snappy.parquet',
ParquetMetadata
)
SETTINGS describe_compact_output=1
"""

chdb.query(query, 'TabSeparated')
num_columns UInt64
num_rows UInt64
num_row_groups UInt64
format_version String
metadata_size UInt64
total_uncompressed_size UInt64
total_compressed_size UInt64
columns Array(Tuple(name String, path String, max_definition_level UInt64, max_repetition_level UInt64, physical_type String, logical_type String, compression String, total_uncompressed_size UInt64, total_compressed_size UInt64, space_saved String, encodings Array(String)))
row_groups Array(Tuple(num_columns UInt64, num_rows UInt64, total_uncompressed_size UInt64, total_compressed_size UInt64, columns Array(Tuple(name String, path String, total_compressed_size UInt64, total_uncompressed_size UInt64, have_statistics Bool, statistics Tuple(num_values Nullable(UInt64), null_count Nullable(UInt64), distinct_count Nullable(UInt64), min Nullable(String), max Nullable(String))))))

Let's have now have a look at the metadata for this file. columns and row_groups both contain arrays of tuples containing many properties, so we'll exclude those for now.

query = """
SELECT * EXCEPT(columns, row_groups)
FROM s3(
'https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2015.snappy.parquet',
ParquetMetadata
)
"""

chdb.query(query, 'Vertical')
Row 1:
──────
num_columns: 15
num_rows: 41905631
num_row_groups: 42
format_version: 2.6
metadata_size: 79730
total_uncompressed_size: 14615827169
total_compressed_size: 9272262304

From this output, we learn that this Parquet file has over 40 million rows, split across 42 row groups, with 15 columns of data per row. A row group is a logical horizontal partitioning of the data into rows. Each row group has associated metadata and querying tools can make use of that metadata to efficiently query the file.

Let's take a look at one of the row groups:

query = """
WITH rowGroups AS (
SELECT rg
FROM s3(
'https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2015.snappy.parquet',
ParquetMetadata
)
ARRAY JOIN row_groups AS rg
LIMIT 1
)
SELECT tupleElement(c, 'name') AS name, tupleElement(c, 'total_compressed_size') AS total_compressed_size,
tupleElement(c, 'total_uncompressed_size') AS total_uncompressed_size,
tupleElement(tupleElement(c, 'statistics'), 'min') AS min,
tupleElement(tupleElement(c, 'statistics'), 'max') AS max
FROM rowGroups
ARRAY JOIN tupleElement(rg, 'columns') AS c
"""

chdb.query(query, 'DataFrame')
                 name  total_compressed_size  total_uncompressed_size                                                min                                                max
0 review_date 493 646 16455 16472
1 marketplace 66 64 US US
2 customer_id 5207967 7997207 10049 53096413
3 review_id 14748425 17991290 R10004U8OQDOGE RZZZUTBAV1RYI
4 product_id 8003456 13969668 0000032050 BT00DDVMVQ
5 product_parent 5758251 7974737 645 999999730
6 product_title 41068525 63355320 ! Small S 1pc Black 1pc Navy (Blue) Replacemen... 🌴 Vacation On The Beach
7 product_category 1726 1815 Apparel Pet Products
8 star_rating 369036 374046 1 5
9 helpful_votes 538940 1022990 0 3440
10 total_votes 610902 1080520 0 3619
11 vine 11426 125999 0 1
12 verified_purchase 102634 125999 0 1
13 review_headline 16538189 27634740 🤹🏽‍♂️🎤Great product. Practice makes perfect. D...
14 review_body 145886383 232457911 🚅 +🐧=💥 😀

Querying Parquet files

Next, let's query the contents of the file. We can do this by adjusting the above query to remove ParquetMetadata and then, say, compute the most popular star_rating across all reviews:

query = """
SELECT star_rating, count() AS count, formatReadableQuantity(count)
FROM s3(
'https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2015.snappy.parquet'
)
GROUP BY ALL
ORDER BY star_rating
"""

chdb.query(query, 'DataFrame')
   star_rating     count formatReadableQuantity(count())
0 1 3253070 3.25 million
1 2 1865322 1.87 million
2 3 3130345 3.13 million
3 4 6578230 6.58 million
4 5 27078664 27.08 million

Interestingly, there are more 5 star reviews than all the other ratings combined! It looks like people like the products on Amazon or, if they don't, they just don't submit a rating.