Pandas Sql
Pandas provides a set of functions that directly interact with SQL databases, allowing query results to be read into DataFrames directly, or DataFrames to be written back to databases. This greatly simplifies the interaction process between databases and Python, eliminating the need for data analysts to manually handle database connections and result parsing.
* * *
## Core Function Overview
| Function | Purpose | Return Type |
| --- | --- | --- |
| `pd.read_sql()` | Execute SQL queries or read entire tables (universal function compatible with both scenarios) | DataFrame |
| `pd.read_sql_query()` | Execute SQL query statements, suitable for complex queries | DataFrame |
| `pd.read_sql_table()` | Read entire tables directly, only supports SQLAlchemy connections | DataFrame |
| `DataFrame.to_sql()` | Write DataFrame to database table | None / int |
> In actual work, it is recommended to use `pd.read_sql()`, which automatically determines whether to execute a query or read an entire table based on the input parameters, offering the best compatibility. `pd.read_sql_table()` only supports SQLAlchemy engine connections and does not support native `sqlite3` or other DB-API connections.
* * *
## Establishing Database Connections
Pandas itself does not connect to databases directly; it requires third-party libraries to establish connections first before passing them in. There are two main approaches: **SQLAlchemy engines** (recommended) and **native DB-API connections** (lightweight and simple).
### Method 1: SQLAlchemy (Recommended)
SQLAlchemy is the most popular database toolkit for Python, supporting all mainstream databases and offering the best compatibility with Pandas:
pip install sqlalchemy
## Example
from sqlalchemy import create_engine
# SQLAlchemy connection string format: database type + driver://username:password@host:port/database_name
# SQLite (file-based database, no username/password required)
engine = create_engine("sqlite:///mydata.db")
# MySQL
engine = create_engine("mysql+pymysql://root:password@localhost:3306/mydb")
# PostgreSQL
engine = create_engine("postgresql+psycopg2://user:password@localhost:5432/mydb")
# SQL Server
engine = create_engine("mssql+pyodbc://user:password@server/mydb?driver=ODBC+Driver+17+for+SQL+Server")
# Verify connection success
with engine.connect()as conn:
print("Connection successful")
### Method 2: Native sqlite3 Connection (SQLite Only)
Python's built-in `sqlite3` module requires no additional installation, suitable for lightweight local SQLite databases:
## Example
import sqlite3
# Connect to SQLite file database (automatically creates file if it doesn't exist)
conn = sqlite3.connect("mydata.db")
# Connect to in-memory database (data disappears after program exits, suitable for testing)
conn_memory = sqlite3.connect(":memory:")
# Close connection manually when done
# conn.close()
* * *
## Reading Data with pd.read_sql()
### Basic Syntax
pd.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)
Main parameter descriptions:
| Parameter | Type | Description |
| --- | --- | --- |
| `sql` | str | SQL query statement, or table name (depends on `con` type) |
| `con` | Connection object | SQLAlchemy engine or DB-API connection object |
| `index_col` | str or list | Set specified column(s) as row index of DataFrame |
| `params` | list or dict | Parameter values for parameterized queries, preventing SQL injection |
| `parse_dates` | list or dict | Parse specified columns into datetime type |
| `chunksize` | int | Read in chunks, returns an iterator of DataFrames with specified number of rows per chunk |
### 1. Read Entire Table
## Example
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("sqlite:///mydata.db")
# Read entire employees table
df = pd.read_sql("employees", con=engine)
print(df.head())
print(f"Total {len(df)} rows, {len(df.columns)} columns")
### 2. Execute SQL Query
## Example
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("sqlite:///mydata.db")
# Query with conditions
df = pd.read_sql("SELECT * FROM employees WHERE department = 'IT'", con=engine)
# Multi-table join query
sql ="""
SELECT e.name, e.salary, d.department_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 10000
ORDER BY e.salary DESC
"""
df = pd.read_sql(sql, con=engine)
print(df.head(10))
### 3. Parameterized Queries (Prevent SQL Injection)
When query conditions come from user input, **always use parameterized queries**, never concatenate strings to form SQL:
## Example
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("sqlite:///mydata.db")
# β Dangerous approach: string concatenation, vulnerable to SQL injection
dept ="IT"
# df = pd.read_sql(f"SELECT * FROM employees WHERE department = '{dept}'", engine)
# β
Safe approach: use ? placeholders (sqlite3) or :name named parameters (SQLAlchemy)
# SQLite / DB-API style (use ?)
df = pd.read_sql(
"SELECT * FROM employees WHERE department = ? AND salary > ?",
con=engine,
params=["IT",8000]# params is a list, replaces ? by position
)
# SQLAlchemy named parameter style (use :param_name)
from sqlalchemy import text
with engine.connect()as conn:
df = pd.read_sql(
text("SELECT * FROM employees WHERE department = :dept AND salary > :min_salary"),
con=conn,
params={"dept": "IT","min_salary": 8000}
)
print(df)
### 4. Set Index Column
## Example
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("sqlite:///mydata.db")
# Set id column from database as DataFrame row index
df = pd.read_sql("SELECT * FROM employees", con=engine, index_col="id")
print(df.head())
# Use multiple columns as composite index
df = pd.read_sql(
"SELECT * FROM orders",
con=engine,
index_col=["year","month"]# Composite index
)
### 5. Parse Date Columns
Date fields stored in databases are read as strings by default. Using the `parse_dates` parameter can directly parse them into `datetime` types:
## Example
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("sqlite:///mydata.db")
# Parse created_at and updated_at columns into datetime type
df = pd.read_sql(
"SELECT * FROM orders",
con=engine,
parse_dates=["created_at","updated_at"]
)
print(df.dtypes)
# created_at datetime64
# updated_at datetime64
# You can also specify parsing formats (for non-standard date formats)
df = pd.read_sql(
"SELECT * FROM orders",
con=engine,
parse_dates={"created_at": "%Y%m%d"}# Parse "20240115" format into date
)
* * *
## Chunked Reading of Large Datasets (chunksize)
When database tables contain large amounts of data, loading everything into memory at once can cause OOM (out-of-memory) errors. Using the `chunksize` parameter allows reading data in batches, processing only part of the data at a time:
## Example
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("mysql+pymysql://root:password@localhost/bigdata")
# chunksize=10000 means reading 10000 rows each time, returning an iterator
chunks = pd.read_sql("SELECT * FROM large_table", con=engine, chunksize=10000)
# Process chunks one by one (only 10000 rows in memory at a time)
result_list =[]
for i,chunk in enumerate(chunks):
# Process each chunk (e.g., filtering, aggregation, etc.)
processed =chunk[chunk=="active"]
result_list.append(processed)
print(f"Processed chunk {i+1}, current block valid rows: {len(processed)}")
# Combine all processed chunks into one DataFrame
final_df = pd.concat(result_list, ignore_index=True)
print(f"Final total valid rows: {len(final_df)}")
* * *
## pd.read_sql_query() and pd.read_sql_table()
### pd.read_sql_query(): Only Execute Query
Functionally similar to `pd.read_sql()`, but only accepts SQL query statements, not table names. Parameters are identical, suitable for scenarios where you want to clearly distinguish between "query" and "read table" operations:
## Example
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("sqlite:///mydata.db")
df = pd.read_sql_query(
"SELECT name, salary FROM employees WHERE salary > 5000",
con=engine
)
print(df)
### pd.read_sql_table(): Read Entire Table (SQLAlchemy Only)
`pd.read_sql_table()` is specifically used to read entire tables, supporting filtering of columns and rows via parameters, but **only supports SQLAlchemy engine connections**, not native connections like sqlite3:
## Example
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("sqlite:///mydata.db")
# Read specific columns from employees table
df = pd.read_sql_table(
"employees",
con=engine,
columns=["id","name","salary","department"]# Only read these columns
)
# Also supports specifying schema (database schema)
df = pd.read_sql_table(
"employees",
con=engine,
schema="hr"# Read employees table under hr schema
)
print(df.head())
* * *
## Writing DataFrame to Database (to_sql)
Use `DataFrame.to_sql()` to write DataFrames to databases, supporting creating new tables, appending data, or replacing existing tables:
### Basic Syntax
DataFrame.to_sql(name, con, schema=None, if_exists='fail', index=True, index_label=None, chunksize=None, dtype=None, method=None)
The `if_exists` parameter determines behavior when the target table already exists:
| if_exists Value | Behavior | Use Case |
| --- | --- | --- |
| `'fail'` (default) | Raise error if table exists | Prevent accidental overwriting |
| `'replace'` | Drop and recreate table | Full data refresh |
| `'append'` | Append data to existing table without changing structure | Incremental data writing |
## Example
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("sqlite:///mydata.db")
# Prepare sample data
df = pd.DataFrame({
"name": ["Zhang San","Li Si","Wang Wu"],
"department": ["IT","HR","IT"],
"salary": [12000,8000,15000]
})
# Write DataFrame to employees table, replace if exists
df.to_sql(
"employees",
con=engine,
if_exists="replace",# Replace existing data
index=False# Don't write DataFrame row index to database (usually unnecessary)
)
# Append new data to existing table (note: column names and data types must match)
new_employees = pd.DataFrame({
"name": ,
"department": ,
"salary":
})
new_employees.to_sql("employees", con=engine, if_exists="append", index=False)
# Verify write result
result = pd.read_sql("SELECT * FROM employees", con=engine)
print(result)
### Specify Column Data Types
You can explicitly specify column data types in the database using the `dtype` parameter during writing:
## Example
import pandas as pd
from sqlalchemy import create_engine, Integer, String, Float, DateTime
engine = create_engine("sqlite:///mydata.db")
df = pd.DataFrame({
"id": [1,2,3],
"name": ["Alice","Bob","Charlie"],
"score": [92.5,88.0,95.3],
})
df.to_sql(
"students",
con=engine,
if_exists="replace",
index=False,
dtype={
"id": Integer(),# Integer type
"name": String(50),# Variable-length string, max 50 characters
"score": Float()# Floating point number
}
)
* * *
## Complete Usage Example
Here is a complete example of reading sales data from an SQLite database and performing analysis:
## Example
import pandas as pd
import sqlite3
from sqlalc
YouTip