PyArrow – Robin on Linux

Get the number of rows for a parquet file

We were using Pandas to get the number of rows for a parquet file:

import pandas as pd
df = pd.read_parquet("my.parquet")
print(df.shape[0])

This is easy but will cost a lot of time and memory when the parquet file is very large. For example, it may cost more than 100GB of memory to just read a 10GB parquet file.

If we only need to get the number of rows, not the whole data, Pyarrow will be a better solution:

import pyarrow.parquet as pq
table = pq.read_table("my.parquet", columns=[])
print(table.num_rows)

This method only spend a couple seconds and cost about 2GB of memory for the same parquet file.

An old bug about PyArrow

To save memory for my program using Pandas, I change types of some column from string to category as the reference.

df[["os_type", "cpu_type", "chip_brand"]] =
	df[["os_type", "cpu_type", "chip_brand"]].astype("category")

It could save at least half memory in my case. But when I use pyarrow to store the dataframe to parquet

df.to_parquet("my.parquet")

it reports errors:

Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647

It’s a bug from old version pyarrow and had been fixed in Sep 2019. Then I tried to upgrade my pyarrow-0.12.1 to pyarrow-0.17.1 and it fixed this error.

But the story hasn’t ended up here.

For pyarrow-0.12.1, the below snippet will return a class of type <pyarrow.lib.Column>

import pyarrow.parquet as pq
table = pq.read_table(path)
table.column(0)

and this class will also contain a attribute “Column name”

But for pyarrow-0.17.1, the same code will return a class of type <pyarrow.lib.ChunkedArray> which doesn’t have a “Column name”.

This difference will make some code fail (actually, our program). Beware of this: after you upgrade pyarrow (or any other library in Python), run the test to make sure all the legacy code work properly.