YouTip LogoYouTip

Pandas Df To Parquet

[![Image 1: Pandas Common Functions](#) Pandas Common Functions](#) * * * `to_parquet()` is a method of DataFrame used to export data into Parquet format files. Parquet is a columnar storage file format specifically designed for big data analytics scenarios. It features high compression ratio, high read/write performance, and supports complex data types, making it the standard format for big data frameworks such as Apache Hadoop and Apache Spark. * * * ## Basic Syntax and Parameters ### Syntax Format DataFrame.to_parquet(path, engine='auto', compression='snappy', index=None, partition_cols=None, storage_options=None, ...) ### Parameter Description | Parameter | Type | Description | Default Value | | --- | --- | --- | --- | | path | str, path object | File path | Required | | engine | str | Engine: 'auto', 'pyarrow', 'fastparquet' | 'auto' | | compression | str | Compression method: 'snappy', 'gzip', 'brotli', None | 'snappy' | | index | bool, None | Whether to include index | None | | partition_cols | list | Partition columns, store by column partitioning | None | ### Return Value Description * **Return Type**: `None` * Directly writes data to a Parquet file with no return value. * * * ## Examples Through the following examples, fully master various usages of `to_parquet()`. ### Example 1: Basic Usage - Export to Parquet File First create a DataFrame, then use `to_parquet()` to export it as a Parquet file. ## Example import pandas as pd # Create an example DataFrame data ={ 'name': ['Tom','Jerry','Mike','Lucy','John'], 'age': [28,35,42,26,31], 'city': ['Beijing','Shanghai','Guangzhou','Shenzhen','Hangzhou'], 'salary': [8000,12000,15000,7000,9000], 'department': ['IT','HR','Sales','IT','HR'] } df = pd.DataFrame(data) # Example 1a: Basic export # path: file path (required) # Uses snappy compression by default df.to_parquet('employees.parquet') print("Exported to employees.parquet") # Check file size import os file_size =os.path.getsize('employees.parquet') print(f"File size: {file_size} bytes") # Example 1b: Read and verify # Need to install pyarrow or fastparquet df_check = pd.read_parquet('employees.parquet') print("nVerification read:") print(df_check) **Output Result:** Exported to employees.parquet File size: About 600-800 bytes (much smaller than CSV) Verification read: name age city salary department 0 Tom 28 Beijing 8000 IT 1 Jerry 35 Shanghai 12000 HR 2 Mike 42 Guangzhou 15000 Sales3 Lucy 26 Shenzhen 7000 IT 4 John 31 Hangzhou 9000 HR **Code Explanation:** * `to_parquet()` exports the DataFrame to Parquet format. * By default, `snappy` compression is used, which provides significant compression. * Parquet files are much smaller than CSV files and faster to read. ### Example 2: Selecting Engine and Compression Method Parquet format supports multiple engines and compression methods that can be selected based on requirements. ## Example import pandas as pd import os # Create a larger DataFrame to observe compression effects import numpy as np np.random.seed(42) df_large = pd.DataFrame({ 'id': range(10000), 'value': np.random.randn(10000), 'category': np.random.choice(['A','B','C','D'],10000), 'name': np.random.choice(['Tom','Jerry','Mike','Lucy','John'],10000) }) # Example 2a: Using different compression methods # snappy: Fast compression, fast speed, moderate compression ratio (default) df_large.to_parquet('output_snappy.parquet', compression='snappy') # gzip: High compression ratio, smaller file size df_large.to_parquet('output_gzip.parquet', compression='gzip') # brotli: Higher compression ratio df_large.to_parquet('output_brotli.parquet', compression='brotli') # No compression df_large.to_parquet('output_none.parquet', compression=None) # Compare file sizes print("Comparison of file sizes with different compression methods:") for name in['snappy','gzip','brotli','none']: size =os.path.getsize(f'output_{name}.parquet') print(f" {name}: {size:,} bytes") print() # Example 2b: Selecting engine # auto: Auto-select (default) # pyarrow: Implementation from Apache Arrow, complete features, good performance # fastparquet: Pure Python implementation, good compatibility print("Available engines: auto, pyarrow, fastparquet") print("Current engine:", end=" ") # Check available engines try: import pyarrow print("pyarrow") except ImportError: pass try: import fastparquet print("fastparquet") except ImportError: pass **Output Result:** Comparison of file sizes with different compression methods: snappy: About 100KB gzip: About 80KB brotli: About 70KB none: About 200KB Each compression method has its pros and cons: snappy: Fast speed, moderate compression ratio (recommended default) gzip: Higher compression ratio, suitable for storage brotli: Highest compression ratio, suitable for cold data none: No compression, fastest speed **Code Explanation:** * The `compression` parameter allows choosing different compression methods. * `snappy` is the default option, balancing speed and compression ratio. * `gzip` offers higher compression but slightly slower speed. * The `engine` parameter selects which engine to use. ### Example 3: Partitioned Storage Parquet supports column-based partitioned storage, an important feature in big data analysis. ## Example import pandas as pd import os import shutil # Create DataFrame df = pd.DataFrame({ 'name': ['Tom','Jerry','Mike','Lucy','John','Mary','Bob','Alice'], 'age': [28,35,42,26,31,29,38,24], 'department': ['IT','HR','Sales','IT','HR','IT','Sales','HR'], 'city': ['Beijing','Shanghai','Guangzhou','Shenzhen','Hangzhou', 'Beijing','Shanghai','Beijing'] }) # Example 3a: Partition by single column # partition_cols specifies partition columns, generates subdirectories if os.path.exists('partitioned'): shutil.rmtree('partitioned') df.to_parquet('partitioned', partition_cols=['department']) print("Exported with department partitioning") # View partition directory structure for root, dirs, files in os.walk('partitioned'): level = root.replace('partitioned','').count(os.sep) indent =' ' * 2 * level print(f'{indent}{os.path.basename(root)}/') subindent =' ' * 2 * (level + 1) for file in files: print(f'{subindent}{file}') **Output Result:** Exported with department partitioning Partition directory structure: partitioned/ department=HR/ xxx.parquet department=IT/ xxx.parquet department=Sales/ xxx.parquet **Code Explanation:** * The `partition_cols` parameter partitions storage by specified columns. * Partitioned storage creates subdirectories in the file system, one directory per partition value. * Partitioned storage greatly benefits large data queries by allowing reading only required partitions. ### Example 4: Handling Indexes When exporting to Parquet, you can choose whether to include the index. ## Example import pandas as pd # Create DataFrame with index df = pd.DataFrame({ 'name': ['Tom','Jerry','Mike','Lucy'], 'age': [28,35,42,26], 'city': ['Beijing','Shanghai','Guangzhou','Shenzhen'] }) df.index=['A001','A002','A003','A004'] # Example 4a: Include index by default df.to_parquet('with_index.parquet') print("Exported (index included by default)") # Example 4b: Exclude index df.to_parquet('without_index.parquet', index=False) print("Exported (index excluded)") # Example 4c: Explicitly include index df.to_parquet('explicit_index.parquet', index=True) print("Exported (index explicitly included)") # Read and compare print("nRead and compare differences:") print("nOriginal data:") print(df) print("nReading with_index.parquet:") print(pd.read_parquet('with_index.parquet')) print("nReading without_index.parquet:") print(pd.read_parquet('without_index.parquet')) print("nReading explicit_index.parquet:") print(pd.read_parquet('explicit_index.parquet')) **Output Result:** Exported (index included by default) Exported (index excluded) Exported (index explicitly included) Read and compare differences: Original data: name age city A001 Tom 28 Beijing A002 Jerry 35 Shanghai A003 Mike 42 Guangzhou A004 Lucy 26 Shenzhen Reading with_index.parquet: name age city index 0 Tom 28 Beijing A001 1 Jerry 35 Shanghai A002 2 Mike 42 Guangzhou A003 3 Lucy 26 Shenzhen A004 Reading without_index.parquet: name age city 0 Tom 28 Beijing1 Jerry 35 Shanghai2 Mike 42 Guangzhou3 Lucy 26 Shenzhen Reading explicit_index.parquet: name age city index 0 Tom 28 Beijing A001 1 Jerry 35 Shanghai A002 2 Mike 42 Guangzhou A003 3 Lucy 26 Shenzhen A004 **Code Explanation:** * `index=True` explicitly includes the index as a column. * `index=False` excludes the index. * Default behavior depends on pandas index settings. * * * ## Notes * Using `to_parquet()` requires installing `pyarrow` or `fastparquet`. * Recommended to install `pyarrow`: `pip install pyarrow`. * Parquet is columnar storage, suitable for big data analytics, not for small datasets. * Partitioned storage can significantly improve query performance for large data. * Parquet supports complex data types (nested structures), though rarely used in pandas DataFrames. * * * ## Summary `to_parquet()` is a method for exporting DataFrame data into Parquet format. Parquet is the standard columnar storage format in big data, featuring high compression ratio, high performance, and support for partitioning. In big data processing scenarios, Parquet is the preferred data format. It perfectly integrates with big data frameworks like Apache Spark and Apache Hive. For large-scale data processing, it is recommended to replace CSV or Excel with the Parquet format. [![Image 2: Pandas Common Functions](#) Pandas Common Functions](#)
← Pandas Df FillnaPandas Df To Excel β†’