Pandas Pd Read Sql
[ Pandas Common Functions](#)
* * *
`read_sql()` is a function in the pandas library used to read data from databases, supporting both SQL queries and direct reading of database tables.
Databases are core components of enterprise-level applications that store large amounts of business data. `read_sql()` can connect to various types of databases (such as MySQL, PostgreSQL, SQLite, etc.), execute SQL queries, and convert the results into DataFrames for convenient subsequent data analysis and processing.
* * *
## Basic Syntax and Parameters
### Syntax Format
pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None, dtype=None, ...)
### Parameter Description
| Parameter | Type | Description | Default Value |
| --- | --- | --- | --- |
| sql | str or SQLAlchemy Selectable | SQL query statement or table name | Required |
| con | sqlalchemy engine, or sqlite3 connection | Database connection object | Required |
| index_col | str, list of str | Column name(s) to use as row index | None |
| coerce_float | bool | Whether to attempt converting numeric strings to floats | True |
| params | list, tuple, dict | Parameters for parameterized SQL queries | None |
| parse_dates | list, dict | Columns to parse as dates | None |
| chunksize | int | Size of chunks returned by iterator | None |
### Return Value
* **Return Type**: `pd.DataFrame` or Iterator
* When `chunksize` is None, returns a single DataFrame.
* When `chunksize` is set, returns an iterator that yields DataFrames on each iteration.
* * *
## Examples
Through the following examples, master various uses of `read_sql()`.
### Example 1: Using SQLite Database
SQLite is a lightweight embedded database that does not require a separate database server, making it ideal for learning and testing.
## Example
import pandas as pd
import sqlite3
# Create an in-memory SQLite database and insert test data
conn = sqlite3.connect(':memory:')
# Create table and insert data
cursor = conn.cursor()
# Create employee table
cursor.execute('''
CREATE TABLE employees (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
age INTEGER,
city TEXT,
salary INTEGER
)
''')
# Insert test data
employees_data =[
(1,'Tom',28,'Beijing',8000),
(2,'Jerry',35,'Shanghai',12000),
(3,'Mike',42,'Guangzhou',15000),
(4,'Lucy',26,'Shenzhen',7000),
(5,'John',31,'Beijing',9500)
]
cursor.executemany('INSERT INTO employees VALUES (?, ?, ?, ?, ?)', employees_data)
conn.commit()
# Example 1a: Execute SQL query and read data
# sql: SQL query statement (required)
# con: Database connection object (required)
query ="SELECT name, age, city, salary FROM employees WHERE salary > 8000"
df_query = pd.read_sql(query, conn)
print("Executing SQL query:")
print(df_query)
print()
# Example 1b: Read entire table directly
# Table name can be passed as sql parameter
df_table = pd.read_sql('employees', conn)
print("Reading entire table:")
print(df_table)
print()
# Example 1c: Set index using index_col
df_indexed = pd.read_sql('employees', conn, index_col='id')
print("Setting index column:")
print(df_indexed)
**Expected Output:**
Executing SQL query: name age city salary 0 Tom 28 Beijing 80001 Jerry 35 Shanghai 120002 Mike 42 Guangzhou 15000 chunksize empty string NaN NaN NaN NaN3 Lucy 26 Shenzhen 70004 John 31 Beijing 9500Reading entire table: id name age city salary 0 1 Tom 28 Beijing 80001 2 Jerry 35 Shanghai 120002 3 Mike 42 Guangzhou 150003 4 Lucy 26 Shenzhen 70004 5 John 31 Beijing 9500Setting index column: name age city salary id 1 Tom 28 Beijing 80002 Jerry 35 Shanghai 120003 Mike 42 Guangzhou 150004 Lucy 26 Shenzhen 70005 John 31 Beijing 9500
**Code Explanation:**
* `read_sql()` accepts SQL query statements as the first argument.
* The `con` parameter requires a database connection object, which can be from sqlite3, SQLAlchemy, etc.
* You can pass a table name directly to read the entire table's data.
* The `index_col` parameter can specify a column to be used as row index.
### Example 2: Using SQLAlchemy to Connect to Database
SQLAlchemy is the most popular ORM framework in Python, supporting connections to multiple databases.
## Example
import pandas as pd
from sqlalchemy import create_engine, text
# Use SQLAlchemy to create an in-memory SQLite database
engine = create_engine('sqlite:///:memory:')
# Create test data
with engine.connect()as conn:
# Create product table
conn.execute(text('''
CREATE TABLE products (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
category TEXT,
price REAL,
stock INTEGER
)
'''))
# Insert data
products_data =[
('A','Electronics',999.99,50),
('B','Electronics',499.99,100),
('C','Clothing',29.99,200),
('D','Food',5.99,500),
('E','Books',19.99,150)
]
conn.execute(text('''
INSERT INTO products (name, category, price, stock) VALUES (?, ?, ?, ?)
'''), products_data)
conn.commit()
# Example 2a: Read data using SQLAlchemy engine
query ="SELECT * FROM products WHERE category = :category"
df_params = pd.read_sql(
query,
engine,
params={'category': 'Electronics'}# Parameterized query to prevent SQL injection
)
print("Using parameterized query:")
print(df_params)
print()
# Example 2b: Read entire table directly
df_all = pd.read_sql('products', engine)
print("Reading entire table:")
print(df_all)
print()
# Example 2c: Try connecting to other databases (example with MySQL)
# Note: Need to install corresponding database drivers
# pip install pymysql # MySQL
# pip install psycopg2 # PostgreSQL
# MySQL connection example
# engine_mysql = create_engine('mysql+pymysql://username:password@host/database')
# df_mysql = pd.read_sql('SELECT * FROM table_name', engine_mysql)
print("Note: Connecting to MySQL/PostgreSQL requires installing respective drivers")
**Expected Output:**
Using parameterized query: id name category price stock 0 1 A Electronics 999.99 50 2 2 B Electronics 499.99 100Reading entire table: id name category price stock 0 1 A Electronics 999.99 501 2 B ... 499.99 1002 3 C Clothing 29.99 2003 4 D Food 5.99 5004 5 E Books 19.99 150
**Code Explanation:**
* SQLAlchemy provides a unified interface for database connections, supporting multiple databases such as MySQL, PostgreSQL, Oracle, etc.
* The `params` parameter is used for parameterized queries, helping to prevent SQL injection attacks.
* Use SQLAlchemy's `create_engine()` to create a database engine.
### Example 3: Chunk Reading and Processing Large Data
When query results are very large, you can read them in chunks to save memory.
## Example
import pandas as pd
import sqlite3
# Create a large amount of test data
conn = sqlite3.connect('test_large.db')
cursor = conn.cursor()
# Create table
cursor.execute('''
CREATE TABLE large_table (
id INTEGER PRIMARY KEY,
value INTEGER,
category TEXT
)
''')
# Insert 10000 rows of test data
import random
large_data =[(i,random.randint(1,1000),random.choice(['A','B','C']))
for i in range(1,10001)]
cursor.executemany('INSERT INTO large_table VALUES (?, ?, ?)', large_data)
conn.commit()
# Example 3a: Read data in chunks
# chunksize parameter specifies how many rows to return at a time
print("Reading data in chunks:")
chunks = pd.read_sql('SELECT * FROM large_table', conn, chunksize=2000)
# Process each chunk using generator
for i,chunk in enumerate(chunks):
print(f"Chunk {i+1}: {len(chunk)} rows")
print(f"Number of rows where category='A': {len(chunk[chunk['category'] == 'A'])}")
print()
# Example 3b: Aggregation calculation (merge all chunks)
# If global statistics are needed, merge all chunks
all_chunks =[]
chunks = pd.read_sql('SELECT category, SUM(value) as total FROM large_table GROUP BY category',
conn, chunksize=1000)
for chunk in chunks:
all_chunks.append(chunk)
# Merge results
result = pd.concat(all_chunks, ignore_index=True)
print("Aggregated results:")
print(result)
print()
# Close connection
conn.close()
# Clean up test file
import os
os.remove('test_large.db')
print("Test database cleaned")
**Expected Output:**
Reading data in chunks:Chunk 1: 2000 rowsChunk 2: 2000 rowsChunk 3: 2000 rowsChunk 4: 2000 rowsChunk 5: 2000 rowsNumber of rows where category='A': about 667 rowsAggregated results: category total 0 A 16875001 B 16782502 C 1675000
**Code Explanation:**
* The `chunksize` parameter returns an iterator, yielding a DataFrame on each iteration.
* Chunk reading is suitable for large datasets, avoiding memory issues.
* For aggregation queries, you can use `pd.concat()` to merge results from all chunks.
* * *
## Notes
* Using `read_sql()` requires installing the corresponding database driver.
* SQLite doesn't need additional installation and is ready to use.
* MySQL requires `pymysql` or `mysql-connector-python`: `pip install pymysql`.
* PostgreSQL requires `psycopg2`: `pip install psycopg2`.
* Using parameterized queries (`params` parameter) helps prevent SQL injection attacks.
* When reading large tables, use the `chunksize` parameter for chunked reading.
* * *
## Summary
`read_sql()` is the core function in pandas for connecting to databases and reading data. It can execute any SQL query and convert the results into DataFrame format.
In real data analysis work, databases are the core storage method for enterprise data. Mastering `read_sql()` allows you to directly fetch data from databases for analysis. Pay attention to using parameterized queries for security and chunked reading for handling large datasets.
[ Pandas Common Functions](#)
YouTip