Pandas Data Io

Pandas provides a rich set of functions for reading and writing various data formats. In addition to commonly used CSV and Excel, it also supports SQL databases, HTML tables, Parquet, and other formats. This section will introduce these additional I/O features to help you flexibly handle data import and export in different scenarios. * * * ## Overview of Core I/O Functions Pandas's I/O functionality is very powerful, supporting read and write operations for multiple data formats. The following are commonly used read and write functions: | Read Function | Write Function | Supported Formats | Typical Use Cases | | --- | --- | --- | --- | | `pd.read_csv()` | `to_csv()` | CSV, TSV | Log files, tabular data | | `pd.read_excel()` | `to_excel()` | Excel (.xlsx, .xls) | Business reports, financial data | | `pd.read_sql()` | `to_sql()` | SQL Databases | Enterprise database interaction | | `pd.read_html()` | - | HTML Tables | Web data scraping | | `pd.read_parquet()` | `to_parquet()` | Apache Parquet | Big data analysis, storage | | `pd.read_feather()` | `to_feather()` | Feather format | Fast read/write, memory data | | `pd.read_json()` | `to_json()` | JSON | API data, web services | | `pd.read_pickle()` | `to_pickle()` | Python pickle | Serialization of Python objects | > Different formats have different application scenarios. CSV is a universal format but results in larger file sizes; Parquet is columnar storage, suitable for big data analysis; Excel is good for human readability but not ideal for large datasets. * * * ## CSV and Text Files CSV (Comma-Separated Values) is the most common data exchange format, and Pandas offers the most comprehensive support for it. ### Common Read Parameters ## Example import pandas as pd # Basic reading df = pd.read_csv("data.csv") # Specify delimiter (CSV defaults to comma, TSV to tab) df_tsv = pd.read_csv("data.tsv", sep="t") # Specify encoding (commonly used for Chinese files) df_utf8 = pd.read_csv("data.csv", encoding="utf-8") df_gbk = pd.read_csv("data.csv", encoding="gbk") # Skip rows (skip header or comment lines) df = pd.read_csv("data.csv", skiprows=3) # Skip first 3 rows df = pd.read_csv("data.csv", skiprows=[2,4]) # Skip row 2 and 4 # Use specific header row (default is first row) df = pd.read_csv("data.csv", header=0) # Use row 0 as header df = pd.read_csv("data.csv", header=None) # No header, auto-generate 0,1,2... # Specify column names df = pd.read_csv("data.csv", names=["ID","Name","Age","City"]) # Specify index column df = pd.read_csv("data.csv", index_col=0) # First column as index df = pd.read_csv("data.csv", index_col=) # Multiple columns as composite index ### Handling Missing Values ## Example import pandas as pd import numpy as np # Specify which values should be treated as missing df = pd.read_csv( "data.csv", na_values=["NA","null","NULL","N/A",""," "] # These values will be recognized as NaN ) # Different columns can use different missing value markers df = pd.read_csv( "data.csv", na_values={ "Age": ["unknown","0"], # Missing values in "Age" column "City": # Missing values in "City" column } ) # Keep certain values as valid (not recognized as missing) df = pd.read_csv("data.csv", keep_default_na=False) ### Writing CSV Files ## Example import pandas as pd # Prepare test data df = pd.DataFrame({ "Name": ["Zhang San","Li Si","Wang Wu"], "Age": [25,30,28], "City": ["Beijing","Shanghai","Guangzhou"] }) # Write CSV (default includes index) df.to_csv("output.csv") # Do not write index df.to_csv("output.csv", index=False) # Specify delimiter df.to_csv("output.csv", sep="t") # Do not write header df.to_csv("output.csv", header=False) # Specify encoding df.to_csv("output.csv", encoding="utf-8-sig") # With BOM, suitable for Excel opening * * * ## JSON Files JSON (JavaScript Object Notation) is the most common data format in web applications, and Pandas offers flexible read and write capabilities. ### JSON Structure and Reading Methods JSON data structures come in various forms, and you need to choose appropriate parameters based on the actual structure when reading: ## Example import pandas as pd import json # Prepare test JSON data data_records =''' [ {"name": "Zhang San", "age": 25, "city": "Beijing"}, {"name": "Li Si", "age": 30, "city": "Shanghai"}, {"name": "Wang Wu", "age": 28, "city": "Guangzhou"} ] ''' # Method 1: JSON array (one object per line) -> DataFrame df = pd.read_json(data_records, orient="records") print("orient='records':") print(df) print() # Method 2: JSON object (key-value pairs) -> Series data_dict ='{"name": "Zhang San", "age": 25, "city": "Beijing"}' s = pd.read_json(data_dict, typ="series") print("Read as Series:") print(s) ### JSON Lines Format JSON Lines (.jsonl) is a format where each line contains a complete JSON object, often used in logs and big data scenarios: ## Example import pandas as pd # JSON Lines format data jsonl_data ='''{"name": "Zhang San", "age": 25} {"name": "Li Si", "age": 30} {"name": "Wang Wu", "age": 28} {"name": "Zhao Liu", "age": 35} ''' # Write to JSON Lines file with open("data.jsonl","w", encoding="utf-8")as f: f.write(jsonl_data) # Read JSON Lines (each line is a JSON object) df = pd.read_json("data.jsonl", lines=True) print(df) ### Writing JSON ## Example import pandas as pd df = pd.DataFrame({ "Name": ["Zhang San","Li Si","Wang Wu"], "Age": [25,30,28], "City": ["Beijing","Shanghai","Guangzhou"] }) # Write JSON with different orientations df.to_json("output_records.json", orient="records", force_ascii=False, indent=2) df.to_json("output_index.json", orient="index", force_ascii=False, indent=2) df.to_json("output_columns.json", orient="columns", force_ascii=False, indent=2) # JSON Lines format df.to_json("output.jsonl", orient="records", lines=True, force_ascii=False) * * * ## Pickle Serialization Pickle is Python's native object serialization format that can save any Python object, including DataFrames and complex nested structures. ## Example import pandas as pd import pickle # Create a DataFrame with complex structure df = pd.DataFrame({ "A": range(10), "B": range(10,20), "C": ["foo","bar"] * 5 }) # Write to Pickle file df.to_pickle("data.pkl") # Read from Pickle file df_loaded = pd.read_pickle("data.pkl") print(df_loaded) # Compressed write (gzip compression, smaller file size) df.to_pickle("data.pkl.gz", compression="gzip") df_loaded = pd.read_pickle("data.pkl.gz", compression="gzip") > The Pickle format can only be read by Python and is unsuitable for cross-language data exchange. Additionally, loading Pickle files from untrusted sources poses security risks; do not load .pkl files from unknown sources. * * * ## Performance and Scenario Selection Different data formats have varying performance characteristics. Choosing the right format can significantly improve efficiency: | Format | Read Speed | Write Speed | File Size | Suitable Scenarios | | --- | --- | --- | --- | --- | | CSV | Slow | Medium | Large | Universal exchange format, human-readable | | Parquet | Very Fast | Fast | Very Small | Big data analysis, columnar queries | | Feather | Extremely Fast | Extremely Fast | Large | Fast transfer of memory data | | Pickle | Fast | Fast | Medium | Python object persistence | | JSON | Slow | Medium | Very Large | Web APIs, cross-language | * * * ## Common Issues and Considerations **1. Insufficient Memory When Reading Large Files** For extremely large CSV or JSON files, use the `chunksize` parameter to read in chunks, avoiding loading everything into memory at once. **2. Chinese Encoding Issues** When handling Chinese files, ensure correct encoding is specified. Common encodings include utf-8, gbk, gb2312, and utf-8-sig (with BOM). **3. Corrupted Files After Writing** Before writing important data, it's recommended to read and verify it. Using compressed formats when writing large files reduces disk I/O and lowers the risk of data corruption. **4. Excel Format Limitations** Excel 2003 (.xls) supports up to 65536 rows per sheet, while Excel 2007 (.xlsx) supports up to 1048576 rows. For larger datasets, consider alternative formats.

YouTip

Pandas Data Io

📂 Categories