Pandas Data Io
Pandas provides a rich set of functions for reading and writing various data formats. In addition to commonly used CSV and Excel, it also supports SQL databases, HTML tables, Parquet, and other formats. This section will introduce these additional I/O features to help you flexibly handle data import and export in different scenarios.
* * *
## Overview of Core I/O Functions
Pandas's I/O functionality is very powerful, supporting read and write operations for multiple data formats. The following are commonly used read and write functions:
| Read Function | Write Function | Supported Formats | Typical Use Cases |
| --- | --- | --- | --- |
| `pd.read_csv()` | `to_csv()` | CSV, TSV | Log files, tabular data |
| `pd.read_excel()` | `to_excel()` | Excel (.xlsx, .xls) | Business reports, financial data |
| `pd.read_sql()` | `to_sql()` | SQL Databases | Enterprise database interaction |
| `pd.read_html()` | - | HTML Tables | Web data scraping |
| `pd.read_parquet()` | `to_parquet()` | Apache Parquet | Big data analysis, storage |
| `pd.read_feather()` | `to_feather()` | Feather format | Fast read/write, memory data |
| `pd.read_json()` | `to_json()` | JSON | API data, web services |
| `pd.read_pickle()` | `to_pickle()` | Python pickle | Serialization of Python objects |
> Different formats have different application scenarios. CSV is a universal format but results in larger file sizes; Parquet is columnar storage, suitable for big data analysis; Excel is good for human readability but not ideal for large datasets.
* * *
## CSV and Text Files
CSV (Comma-Separated Values) is the most common data exchange format, and Pandas offers the most comprehensive support for it.
### Common Read Parameters
## Example
import pandas as pd
# Basic reading
df = pd.read_csv("data.csv")
# Specify delimiter (CSV defaults to comma, TSV to tab)
df_tsv = pd.read_csv("data.tsv", sep="t")
# Specify encoding (commonly used for Chinese files)
df_utf8 = pd.read_csv("data.csv", encoding="utf-8")
df_gbk = pd.read_csv("data.csv", encoding="gbk")
# Skip rows (skip header or comment lines)
df = pd.read_csv("data.csv", skiprows=3) # Skip first 3 rows
df = pd.read_csv("data.csv", skiprows=[2,4]) # Skip row 2 and 4
# Use specific header row (default is first row)
df = pd.read_csv("data.csv", header=0) # Use row 0 as header
df = pd.read_csv("data.csv", header=None) # No header, auto-generate 0,1,2...
# Specify column names
df = pd.read_csv("data.csv", names=["ID","Name","Age","City"])
# Specify index column
df = pd.read_csv("data.csv", index_col=0) # First column as index
df = pd.read_csv("data.csv", index_col=) # Multiple columns as composite index
### Handling Missing Values
## Example
import pandas as pd
import numpy as np
# Specify which values should be treated as missing
df = pd.read_csv(
"data.csv",
na_values=["NA","null","NULL","N/A",""," "] # These values will be recognized as NaN
)
# Different columns can use different missing value markers
df = pd.read_csv(
"data.csv",
na_values={
"Age": ["unknown","0"], # Missing values in "Age" column
"City": # Missing values in "City" column
}
)
# Keep certain values as valid (not recognized as missing)
df = pd.read_csv("data.csv", keep_default_na=False)
### Writing CSV Files
## Example
import pandas as pd
# Prepare test data
df = pd.DataFrame({
"Name": ["Zhang San","Li Si","Wang Wu"],
"Age": [25,30,28],
"City": ["Beijing","Shanghai","Guangzhou"]
})
# Write CSV (default includes index)
df.to_csv("output.csv")
# Do not write index
df.to_csv("output.csv", index=False)
# Specify delimiter
df.to_csv("output.csv", sep="t")
# Do not write header
df.to_csv("output.csv", header=False)
# Specify encoding
df.to_csv("output.csv", encoding="utf-8-sig") # With BOM, suitable for Excel opening
* * *
## JSON Files
JSON (JavaScript Object Notation) is the most common data format in web applications, and Pandas offers flexible read and write capabilities.
### JSON Structure and Reading Methods
JSON data structures come in various forms, and you need to choose appropriate parameters based on the actual structure when reading:
## Example
import pandas as pd
import json
# Prepare test JSON data
data_records ='''
[
{"name": "Zhang San", "age": 25, "city": "Beijing"},
{"name": "Li Si", "age": 30, "city": "Shanghai"},
{"name": "Wang Wu", "age": 28, "city": "Guangzhou"}
]
'''
# Method 1: JSON array (one object per line) -> DataFrame
df = pd.read_json(data_records, orient="records")
print("orient='records':")
print(df)
print()
# Method 2: JSON object (key-value pairs) -> Series
data_dict ='{"name": "Zhang San", "age": 25, "city": "Beijing"}'
s = pd.read_json(data_dict, typ="series")
print("Read as Series:")
print(s)
### JSON Lines Format
JSON Lines (.jsonl) is a format where each line contains a complete JSON object, often used in logs and big data scenarios:
## Example
import pandas as pd
# JSON Lines format data
jsonl_data ='''{"name": "Zhang San", "age": 25}
{"name": "Li Si", "age": 30}
{"name": "Wang Wu", "age": 28}
{"name": "Zhao Liu", "age": 35}
'''
# Write to JSON Lines file
with open("data.jsonl","w", encoding="utf-8")as f:
f.write(jsonl_data)
# Read JSON Lines (each line is a JSON object)
df = pd.read_json("data.jsonl", lines=True)
print(df)
### Writing JSON
## Example
import pandas as pd
df = pd.DataFrame({
"Name": ["Zhang San","Li Si","Wang Wu"],
"Age": [25,30,28],
"City": ["Beijing","Shanghai","Guangzhou"]
})
# Write JSON with different orientations
df.to_json("output_records.json", orient="records", force_ascii=False, indent=2)
df.to_json("output_index.json", orient="index", force_ascii=False, indent=2)
df.to_json("output_columns.json", orient="columns", force_ascii=False, indent=2)
# JSON Lines format
df.to_json("output.jsonl", orient="records", lines=True, force_ascii=False)
* * *
## Pickle Serialization
Pickle is Python's native object serialization format that can save any Python object, including DataFrames and complex nested structures.
## Example
import pandas as pd
import pickle
# Create a DataFrame with complex structure
df = pd.DataFrame({
"A": range(10),
"B": range(10,20),
"C": ["foo","bar"] * 5
})
# Write to Pickle file
df.to_pickle("data.pkl")
# Read from Pickle file
df_loaded = pd.read_pickle("data.pkl")
print(df_loaded)
# Compressed write (gzip compression, smaller file size)
df.to_pickle("data.pkl.gz", compression="gzip")
df_loaded = pd.read_pickle("data.pkl.gz", compression="gzip")
> The Pickle format can only be read by Python and is unsuitable for cross-language data exchange. Additionally, loading Pickle files from untrusted sources poses security risks; do not load .pkl files from unknown sources.
* * *
## Performance and Scenario Selection
Different data formats have varying performance characteristics. Choosing the right format can significantly improve efficiency:
| Format | Read Speed | Write Speed | File Size | Suitable Scenarios |
| --- | --- | --- | --- | --- |
| CSV | Slow | Medium | Large | Universal exchange format, human-readable |
| Parquet | Very Fast | Fast | Very Small | Big data analysis, columnar queries |
| Feather | Extremely Fast | Extremely Fast | Large | Fast transfer of memory data |
| Pickle | Fast | Fast | Medium | Python object persistence |
| JSON | Slow | Medium | Very Large | Web APIs, cross-language |
* * *
## Common Issues and Considerations
**1. Insufficient Memory When Reading Large Files**
For extremely large CSV or JSON files, use the `chunksize` parameter to read in chunks, avoiding loading everything into memory at once.
**2. Chinese Encoding Issues**
When handling Chinese files, ensure correct encoding is specified. Common encodings include utf-8, gbk, gb2312, and utf-8-sig (with BOM).
**3. Corrupted Files After Writing**
Before writing important data, it's recommended to read and verify it. Using compressed formats when writing large files reduces disk I/O and lowers the risk of data corruption.
**4. Excel Format Limitations**
Excel 2003 (.xls) supports up to 65536 rows per sheet, while Excel 2007 (.xlsx) supports up to 1048576 rows. For larger datasets, consider alternative formats.
YouTip