Pandas Parquet Feather
Parquet and Feather are two efficient columnar data storage formats designed for big data analysis and fast read/write scenarios. Compared to CSV, they offer better compression ratios and query performance.
* * *
## Why Use Parquet and Feather
The main advantages of columnar storage formats:
| Feature | CSV | Parquet | Feather |
| --- | --- | --- | --- |
| Storage Format | Row-based | Column-based | Column-based |
| Read Speed | Slow | Very Fast | Extremely Fast |
| Write Speed | Medium | Fast | Extremely Fast |
| Compression Ratio | Low | High | Medium |
| Cross-language Support | Universal | Java/Python/R | Python/R |
> Parquet is an Apache Foundation project widely used in the big data ecosystem (Hadoop, Spark, DuckDB, etc.). Feather is the native implementation of the Arrow project, focusing on extremely fast read/write performance.
* * *
## Parquet File Operations
### Install Dependencies
pip install pyarrow fastparquet
Both PyArrow and FastParquet are Python implementations of Parquet; PyArrow is the default option, while FastParquet can be faster in certain scenarios.
### Reading Parquet Files
## Example
import pandas as pd
# Read a Parquet file (default uses pyarrow)
df = pd.read_parquet("data.parquet")
# Specify a specific engine
df = pd.read_parquet("data.parquet", engine="pyarrow")
df = pd.read_parquet("data.parquet", engine="fastparquet")
# Read remote Parquet files
df = pd.read_parquet("s3://bucket/data.parquet")
# Read specific columns (column pruning, improves performance)
df = pd.read_parquet("data.parquet", columns=["name","age","city"])
print(df.head())
### Writing Parquet Files
## Example
import pandas as pd
# Prepare test data
df = pd.DataFrame({
"id": range(1,10001),
"name": ["user" + str(i)for i in range(1,10001)],
"age": [20,25,30,35] * 2500,
"city": ["Beijing","Shanghai","Guangzhou","Shenzhen"] * 2500,
"score": [round(i * 0.1,2)for i in range(1,10001)]
})
# Write to a Parquet file (default uses pyarrow engine)
df.to_parquet("data.parquet", index=False)
# Specify compression method (snappy is fast, gzip has higher compression ratio)
df.to_parquet("data_snappy.parquet", compression="snappy", index=False)
df.to_parquet("data_gzip.parquet", compression="gzip", index=False)
df.to_parquet("data_none.parquet", compression=None, index=False)# No compression
# Check file sizes
import os
print("File size comparison:")
print(f"Estimated original CSV: {len(df) * 50 / 1024 / 1024:.2f} MB")
print(f"Snappy compressed: {os.path.getsize('data_snappy.parquet') / 1024 / 1024:.2f} MB")
print(f"Gzip compressed: {os.path.getsize('data_gzip.parquet') / 1024 / 1024:.2f} MB")
print(f"No compression: {os.path.getsize('data_none.parquet') / 1024 / 1024:.2f} MB")
### Using Fast Parquet Engine
## Example
import pandas as pd
import time
# Test write performance with different engines
df = pd.DataFrame({
"id": range(1,100001),
"value": range(1,100001)
})
# Write using pyarrow engine
start =time.time()
df.to_parquet("test_pyarrow.parquet", engine="pyarrow", compression="snappy")
print(f"PyArrow write time: {time.time() - start:.3f}s")
# Write using fastparquet engine
start =time.time()
df.to_parquet("test_fastparquet.parquet", engine="fastparquet", compression="snappy")
print(f"FastParquet write time: {time.time() - start:.3f}s")
* * *
## Feather File Operations
Feather is the Python implementation of the Arrow project, focusing on extremely fast memory read/write speeds.
### Install Dependencies
pip install pyarrow
### Reading Feather Files
## Example
import pandas as pd
# Read a Feather file
df = pd.read_feather("data.feather")
# Read v2 version of Feather file (more universal)
df = pd.read_feather("data.feather", version="2.0")
# Read specific columns
df = pd.read_feather("data.feather", columns=["name","age"])
print(df.head())
### Writing Feather Files
## Example
import pandas as pd
# Prepare test data
df = pd.DataFrame({
"id": range(1,10001),
"name": ["user" + str(i)for i in range(1,10001)],
"value": [round(i * 0.1,2)for i in range(1,10001)]
})
# Write to a Feather file
df.to_feather("data.feather")
# LZ4 compression (faster)
df.to_feather("data_lz4.feather", compression="lz4")
# ZSTD compression (higher compression ratio)
df.to_feather("data_zstd.feather", compression="zstd")
# Check file sizes
import os
print(f"No compression: {os.path.getsize('data.feather') / 1024:.2f} KB")
print(f"LZ4: {os.path.getsize('data_lz4.feather') / 1024:.2f} KB")
print(f"ZSTD: {os.path.getsize('data_zstd.feather') / 1024:.2f} KB")
* * *
## Performance Comparison Test
## Example
import pandas as pd
import numpy as np
import time
import os
import tempfile
# Create test data
np.random.seed(42)
n_rows =1000000
df = pd.DataFrame({
"id": range(n_rows),
"category": np.random.choice(["A","B","C","D"], n_rows),
"value": np.random.randn(n_rows),
"flag": np.random.choice([True,False], n_rows),
"date": pd.date_range("2020-01-01", periods=n_rows, freq="1min")
})
# Create temporary directory
with tempfile.TemporaryDirectory()as tmpdir:
# CSV write
start =time.time()
df.to_csv(f"{tmpdir}/data.csv", index=False)
csv_write =time.time() - start
start =time.time()
df_csv = pd.read_csv(f"{tmpdir}/data.csv")
csv_read =time.time() - start
# Parquet write
start =time.time()
df.to_parquet(f"{tmpdir}/data.parquet", index=False)
pq_write =time.time() - start
start =time.time()
df_pq = pd.read_parquet(f"{tmpdir}/data.parquet")
pq_read =time.time() - start
# Feather write
start =time.time()
df.to_feather(f"{tmpdir}/data.feather")
feather_write =time.time() - start
start =time.time()
df_feather = pd.read_feather(f"{tmpdir}/data.feather")
feather_read =time.time() - start
# Output results
print(f"{'Format':<10} {'Write Time':<12} {'Read Time':<12} {'File Size':<12}")
print("-" * 50)
print(f"{'CSV':<10} {csv_write:.3f}s{'':<5} {csv_read:.3f}s{'':<5} {os.path.getsize(f'{tmpdir}/data.csv')/1024/1024:.2f}MB")
print(f"{'Parquet':<10} {pq_write:.3f}s{'':<5} {pq_read:.3f}s{'':<5} {os.path.getsize(f'{tmpdir}/data.parquet')/1024/1024:.2f}MB")
print(f"{'Feather':<10} {feather_write:.3f}s{'':<5} {feather_read:.3f}s{'':<5} {os.path.getsize(f'{tmpdir}/data.feather')/1024/1024:.2f}MB")
* * *
## Use Cases and Recommendations
| Scenario | Recommended Format | Reason |
| --- | --- | --- |
| Big Data Analysis | Parquet | High compression ratio, fast columnar queries, wide ecosystem |
| Data Transfer Between Python/R | Feather | Extremely fast read/write, native Arrow support |
| Spark/Hadoop Integration | Parquet | Standard format in big data ecosystem |
| Data Backup/Archiving | Parquet | High compression, supports schema evolution |
| Temporary Data Caching | Feather | Fastest read/write speed |
* * *
## Common Issues
**1. Error when reading Parquet**
Ensure pyarrow or fastparquet is installed: `pip install pyarrow`
**2. File corruption**
Ensure data is fully written to disk during writing; it's recommended to verify file readability after writing.
**3. Version compatibility**
Different versions of Parquet may have compatibility issues; it's recommended to use fixed versions in projects.
YouTip