Apache Parquet files are a popular columnar storage format used by data scientists and anyone using the Hadoop ecosystem. It was developed to be very efficient in terms of compression and encoding. Check out their documentation if you want to know all the details about how Parquet files work.
You can read and write Parquet files with Python using the pyarrow package.
Let’s learn how that works now!
Installing pyarrow
The first step is to make sure you have everything you need. In addition to the Python programming language, you will also need pyarrow and the pandas package. You will use pandas because it is another Python package that uses columns as a data format and works well with Parquet files.
You can use pip to install both of these packages. Open up your terminal and run the following command:
python -m pip install pyarrow pandas
If you use Anaconda, you’ll want to install pyarrow using this command instead.
conda install -c conda-forge pyarrow
Anaconda should already include pandas, but if not, you can use the same command above by replacing pyarrow with pandas.
Now that you have pyarrow and pandas installed, you can use it to read and write Parquet files!
Writing Parquet Files with Python
Writing Parquet files with Python is pretty straightforward. The code to turn a pandas DataFrame into a Parquet file is about ten lines.
Open up your favorite Python IDE or text editor and create a new file. You can name it something like parquet_file_writer.py
or use some other descriptive name. Then enter the following code:
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq def write_parquet(df: pd.DataFrame, filename: str) -> None: table = pa.Table.from_pandas(df) pq.write_table(table, filename) if __name__ == "__main__": data = {"Languages": ["Python", "Ruby", "C++"], "Users": [10000, 5000, 8000], "Dynamic": [True, True, False], } df = pd.DataFrame(data=data, index=list(range(1, 4))) write_parquet(df, "languages.parquet")
For this example, you have three imports:
- One for
pandas
, so you can create aDataFrame
- One for
pyarrow
, to create a specialpyarrow.Table
object - One for
pyarrow.parquet
to transform the table object into a Parquet file
The write_parquet() function takes in a pandas DataFrame and the file name or path to save the Parquet file to. Then, you transform the DataFrame into a pyarrow Table object before converting that into a Parquet File using the write_table()
method, which writes it to disk.
Now you are ready to read that file you just created!
Reading Parquet Files with Python
Reading the Parquet file you created earlier with Python is even easier. You’ll need about half as many lines of code!
You can put the following code into a new file called something like parquet_file_reader.py
if you want to:
import pyarrow.parquet as pq def read_parquet(filename: str) -> None: table = pq.read_table(filename) df = table.to_pandas() print(df) if __name__ == "__main__": read_parquet("languages.parquet")
In this example, you read the Parquet file into a pyarrow Table format and then convert it to a pandas DataFrame using the Table’s to_pandas() method.
When you print out the contents of the DataFrame, you will see the following:
Languages Users Dynamic 1 Python 10000 True 2 Ruby 5000 True 3 C++ 8000 False
You can see from the output above that the DataFrame contains all data you saved.
One of the strengths of using a Parquet file is that you can read just parts of the file instead of the whole thing. For example, you can read in just some of the columns rather then the whole file!
Here’s an example of how that works:
import pyarrow.parquet as pq def read_columns(filename: str, columns: list[str]) -> None: table = pq.read_table(filename, columns=columns) print(table) if __name__ == "__main__": read_columns("languages.parquet", columns=["Languages", "Users"])
To read in just the “Languages” and “Users” columns from the Parquet file, you pass in the a list that contains just those column names. Then when you call read_table() you pass in the columns you want to read.
Here’s the output when you run this code:
pyarrow.Table Languages: string Users: int64 ---- Languages: [["Python","Ruby","C++"]] Users: [[10000,5000,8000]]
This outputs the pyarrow Table format, which differs slightly from a pandas DataFrame. It tells you information about the different columns; for example, Languages are strings, and Users are of type int64.
If you prefer to work only with pandas DataFrames, the pyarrow package allows that too. As long as you know the Parquet file contains pandas DataFrames, you can use read_pandas() instead of read_table().
Here’s a code example:
import pyarrow.parquet as pq def read_columns_pandas(filename: str, columns: list[str]) -> None: table = pq.read_pandas(filename, columns=columns) df = table.to_pandas() print(df) if __name__ == "__main__": read_columns_pandas("languages.parquet", columns=["Languages", "Users"])
When you run this example, the output is a DataFrame that contains just the columns you asked for:
Languages Users 1 Python 10000 2 Ruby 5000 3 C++ 8000
One advantage of using the read_pandas() and to_pandas() methods is that they will maintain any additional index column data in the DataFrame, while the pyarrow Table may not.
Reading Parquet File Metadata
You can also get the metadata from a Parquet file using Python. Getting the metadata can be useful when you need to inspect an unfamiliar Parquet file to see what type(s) of data it contains.
Here’s a small code snippet that will read the Parquet file’s metadata and schema:
import pyarrow.parquet as pq def read_metadata(filename: str) -> None: parquet_file = pq.ParquetFile(filename) metadata = parquet_file.metadata print(metadata) print(f"Parquet file: {filename} Schema") print(parquet_file.schema) if __name__ == "__main__": read_metadata("languages.parquet")
There are two ways to get the Parquet file’s metadata:
- Use pq.ParquetFile to read the file and then access the metadata property
- Use pr.read_metadata(filename) instead
The benefit of the former method is that you can also access the schema property of the ParquetFile object.
When you run this code, you will see this output:
<pyarrow._parquet.FileMetaData object at 0x000002312C1355D0> created_by: parquet-cpp-arrow version 15.0.2 num_columns: 4 num_rows: 3 num_row_groups: 1 format_version: 2.6 serialized_size: 2682 Parquet file: languages.parquet Schema <pyarrow._parquet.ParquetSchema object at 0x000002312BBFDF00> required group field_id=-1 schema { optional binary field_id=-1 Languages (String); optional int64 field_id=-1 Users; optional boolean field_id=-1 Dynamic; optional int64 field_id=-1 __index_level_0__; }
Nice! You can read the output above to learn the number of rows and columns of data and the size of the data. The schema tells you what the field types are.
Wrapping Up
Parquet files are becoming more popular in big data and data science-related fields. Python’s pyarrow package makes working with Parquet files easy. You should spend some time experimenting with the code in this tutorial and using it for some of your own Parquet files.
When you want to learn more, check out the Parquet documentation.
Pingback: Python/FastAPI/Django Weekly News Summary | May 06-May 12,2024