Pandas Parquet Feather

Parquet and Feather are two efficient columnar data storage formats designed for big data analysis and fast read/write scenarios. Compared to CSV, they offer better compression ratios and query performance. * * * ## Why Use Parquet and Feather The main advantages of columnar storage formats: | Feature | CSV | Parquet | Feather | | --- | --- | --- | --- | | Storage Format | Row-based | Column-based | Column-based | | Read Speed | Slow | Very Fast | Extremely Fast | | Write Speed | Medium | Fast | Extremely Fast | | Compression Ratio | Low | High | Medium | | Cross-language Support | Universal | Java/Python/R | Python/R | > Parquet is an Apache Foundation project widely used in the big data ecosystem (Hadoop, Spark, DuckDB, etc.). Feather is the native implementation of the Arrow project, focusing on extremely fast read/write performance. * * * ## Parquet File Operations ### Install Dependencies pip install pyarrow fastparquet Both PyArrow and FastParquet are Python implementations of Parquet; PyArrow is the default option, while FastParquet can be faster in certain scenarios. ### Reading Parquet Files ## Example import pandas as pd # Read a Parquet file (default uses pyarrow) df = pd.read_parquet("data.parquet") # Specify a specific engine df = pd.read_parquet("data.parquet", engine="pyarrow") df = pd.read_parquet("data.parquet", engine="fastparquet") # Read remote Parquet files df = pd.read_parquet("s3://bucket/data.parquet") # Read specific columns (column pruning, improves performance) df = pd.read_parquet("data.parquet", columns=["name","age","city"]) print(df.head()) ### Writing Parquet Files ## Example import pandas as pd # Prepare test data df = pd.DataFrame({ "id": range(1,10001), "name": ["user" + str(i)for i in range(1,10001)], "age": [20,25,30,35] * 2500, "city": ["Beijing","Shanghai","Guangzhou","Shenzhen"] * 2500, "score": [round(i * 0.1,2)for i in range(1,10001)] }) # Write to a Parquet file (default uses pyarrow engine) df.to_parquet("data.parquet", index=False) # Specify compression method (snappy is fast, gzip has higher compression ratio) df.to_parquet("data_snappy.parquet", compression="snappy", index=False) df.to_parquet("data_gzip.parquet", compression="gzip", index=False) df.to_parquet("data_none.parquet", compression=None, index=False)# No compression # Check file sizes import os print("File size comparison:") print(f"Estimated original CSV: {len(df) * 50 / 1024 / 1024:.2f} MB") print(f"Snappy compressed: {os.path.getsize('data_snappy.parquet') / 1024 / 1024:.2f} MB") print(f"Gzip compressed: {os.path.getsize('data_gzip.parquet') / 1024 / 1024:.2f} MB") print(f"No compression: {os.path.getsize('data_none.parquet') / 1024 / 1024:.2f} MB") ### Using Fast Parquet Engine ## Example import pandas as pd import time # Test write performance with different engines df = pd.DataFrame({ "id": range(1,100001), "value": range(1,100001) }) # Write using pyarrow engine start =time.time() df.to_parquet("test_pyarrow.parquet", engine="pyarrow", compression="snappy") print(f"PyArrow write time: {time.time() - start:.3f}s") # Write using fastparquet engine start =time.time() df.to_parquet("test_fastparquet.parquet", engine="fastparquet", compression="snappy") print(f"FastParquet write time: {time.time() - start:.3f}s") * * * ## Feather File Operations Feather is the Python implementation of the Arrow project, focusing on extremely fast memory read/write speeds. ### Install Dependencies pip install pyarrow ### Reading Feather Files ## Example import pandas as pd # Read a Feather file df = pd.read_feather("data.feather") # Read v2 version of Feather file (more universal) df = pd.read_feather("data.feather", version="2.0") # Read specific columns df = pd.read_feather("data.feather", columns=["name","age"]) print(df.head()) ### Writing Feather Files ## Example import pandas as pd # Prepare test data df = pd.DataFrame({ "id": range(1,10001), "name": ["user" + str(i)for i in range(1,10001)], "value": [round(i * 0.1,2)for i in range(1,10001)] }) # Write to a Feather file df.to_feather("data.feather") # LZ4 compression (faster) df.to_feather("data_lz4.feather", compression="lz4") # ZSTD compression (higher compression ratio) df.to_feather("data_zstd.feather", compression="zstd") # Check file sizes import os print(f"No compression: {os.path.getsize('data.feather') / 1024:.2f} KB") print(f"LZ4: {os.path.getsize('data_lz4.feather') / 1024:.2f} KB") print(f"ZSTD: {os.path.getsize('data_zstd.feather') / 1024:.2f} KB") * * * ## Performance Comparison Test ## Example import pandas as pd import numpy as np import time import os import tempfile # Create test data np.random.seed(42) n_rows =1000000 df = pd.DataFrame({ "id": range(n_rows), "category": np.random.choice(["A","B","C","D"], n_rows), "value": np.random.randn(n_rows), "flag": np.random.choice([True,False], n_rows), "date": pd.date_range("2020-01-01", periods=n_rows, freq="1min") }) # Create temporary directory with tempfile.TemporaryDirectory()as tmpdir: # CSV write start =time.time() df.to_csv(f"{tmpdir}/data.csv", index=False) csv_write =time.time() - start start =time.time() df_csv = pd.read_csv(f"{tmpdir}/data.csv") csv_read =time.time() - start # Parquet write start =time.time() df.to_parquet(f"{tmpdir}/data.parquet", index=False) pq_write =time.time() - start start =time.time() df_pq = pd.read_parquet(f"{tmpdir}/data.parquet") pq_read =time.time() - start # Feather write start =time.time() df.to_feather(f"{tmpdir}/data.feather") feather_write =time.time() - start start =time.time() df_feather = pd.read_feather(f"{tmpdir}/data.feather") feather_read =time.time() - start # Output results print(f"{'Format':<10} {'Write Time':<12} {'Read Time':<12} {'File Size':<12}") print("-" * 50) print(f"{'CSV':<10} {csv_write:.3f}s{'':<5} {csv_read:.3f}s{'':<5} {os.path.getsize(f'{tmpdir}/data.csv')/1024/1024:.2f}MB") print(f"{'Parquet':<10} {pq_write:.3f}s{'':<5} {pq_read:.3f}s{'':<5} {os.path.getsize(f'{tmpdir}/data.parquet')/1024/1024:.2f}MB") print(f"{'Feather':<10} {feather_write:.3f}s{'':<5} {feather_read:.3f}s{'':<5} {os.path.getsize(f'{tmpdir}/data.feather')/1024/1024:.2f}MB") * * * ## Use Cases and Recommendations | Scenario | Recommended Format | Reason | | --- | --- | --- | | Big Data Analysis | Parquet | High compression ratio, fast columnar queries, wide ecosystem | | Data Transfer Between Python/R | Feather | Extremely fast read/write, native Arrow support | | Spark/Hadoop Integration | Parquet | Standard format in big data ecosystem | | Data Backup/Archiving | Parquet | High compression, supports schema evolution | | Temporary Data Caching | Feather | Fastest read/write speed | * * * ## Common Issues **1. Error when reading Parquet** Ensure pyarrow or fastparquet is installed: `pip install pyarrow` **2. File corruption** Ensure data is fully written to disk during writing; it's recommended to verify file readability after writing. **3. Version compatibility** Different versions of Parquet may have compatibility issues; it's recommended to use fixed versions in projects.

YouTip

Pandas Parquet Feather

📂 Categories