YouTip LogoYouTip

Pandas Sql

Pandas provides a set of functions that directly interact with SQL databases, allowing query results to be read into DataFrames directly, or DataFrames to be written back to databases. This greatly simplifies the interaction process between databases and Python, eliminating the need for data analysts to manually handle database connections and result parsing. * * * ## Core Function Overview | Function | Purpose | Return Type | | --- | --- | --- | | `pd.read_sql()` | Execute SQL queries or read entire tables (universal function compatible with both scenarios) | DataFrame | | `pd.read_sql_query()` | Execute SQL query statements, suitable for complex queries | DataFrame | | `pd.read_sql_table()` | Read entire tables directly, only supports SQLAlchemy connections | DataFrame | | `DataFrame.to_sql()` | Write DataFrame to database table | None / int | > In actual work, it is recommended to use `pd.read_sql()`, which automatically determines whether to execute a query or read an entire table based on the input parameters, offering the best compatibility. `pd.read_sql_table()` only supports SQLAlchemy engine connections and does not support native `sqlite3` or other DB-API connections. * * * ## Establishing Database Connections Pandas itself does not connect to databases directly; it requires third-party libraries to establish connections first before passing them in. There are two main approaches: **SQLAlchemy engines** (recommended) and **native DB-API connections** (lightweight and simple). ### Method 1: SQLAlchemy (Recommended) SQLAlchemy is the most popular database toolkit for Python, supporting all mainstream databases and offering the best compatibility with Pandas: pip install sqlalchemy ## Example from sqlalchemy import create_engine # SQLAlchemy connection string format: database type + driver://username:password@host:port/database_name # SQLite (file-based database, no username/password required) engine = create_engine("sqlite:///mydata.db") # MySQL engine = create_engine("mysql+pymysql://root:password@localhost:3306/mydb") # PostgreSQL engine = create_engine("postgresql+psycopg2://user:password@localhost:5432/mydb") # SQL Server engine = create_engine("mssql+pyodbc://user:password@server/mydb?driver=ODBC+Driver+17+for+SQL+Server") # Verify connection success with engine.connect()as conn: print("Connection successful") ### Method 2: Native sqlite3 Connection (SQLite Only) Python's built-in `sqlite3` module requires no additional installation, suitable for lightweight local SQLite databases: ## Example import sqlite3 # Connect to SQLite file database (automatically creates file if it doesn't exist) conn = sqlite3.connect("mydata.db") # Connect to in-memory database (data disappears after program exits, suitable for testing) conn_memory = sqlite3.connect(":memory:") # Close connection manually when done # conn.close() * * * ## Reading Data with pd.read_sql() ### Basic Syntax pd.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None) Main parameter descriptions: | Parameter | Type | Description | | --- | --- | --- | | `sql` | str | SQL query statement, or table name (depends on `con` type) | | `con` | Connection object | SQLAlchemy engine or DB-API connection object | | `index_col` | str or list | Set specified column(s) as row index of DataFrame | | `params` | list or dict | Parameter values for parameterized queries, preventing SQL injection | | `parse_dates` | list or dict | Parse specified columns into datetime type | | `chunksize` | int | Read in chunks, returns an iterator of DataFrames with specified number of rows per chunk | ### 1. Read Entire Table ## Example import pandas as pd from sqlalchemy import create_engine engine = create_engine("sqlite:///mydata.db") # Read entire employees table df = pd.read_sql("employees", con=engine) print(df.head()) print(f"Total {len(df)} rows, {len(df.columns)} columns") ### 2. Execute SQL Query ## Example import pandas as pd from sqlalchemy import create_engine engine = create_engine("sqlite:///mydata.db") # Query with conditions df = pd.read_sql("SELECT * FROM employees WHERE department = 'IT'", con=engine) # Multi-table join query sql =""" SELECT e.name, e.salary, d.department_name FROM employees e JOIN departments d ON e.dept_id = d.id WHERE e.salary > 10000 ORDER BY e.salary DESC """ df = pd.read_sql(sql, con=engine) print(df.head(10)) ### 3. Parameterized Queries (Prevent SQL Injection) When query conditions come from user input, **always use parameterized queries**, never concatenate strings to form SQL: ## Example import pandas as pd from sqlalchemy import create_engine engine = create_engine("sqlite:///mydata.db") # ❌ Dangerous approach: string concatenation, vulnerable to SQL injection dept ="IT" # df = pd.read_sql(f"SELECT * FROM employees WHERE department = '{dept}'", engine) # βœ… Safe approach: use ? placeholders (sqlite3) or :name named parameters (SQLAlchemy) # SQLite / DB-API style (use ?) df = pd.read_sql( "SELECT * FROM employees WHERE department = ? AND salary > ?", con=engine, params=["IT",8000]# params is a list, replaces ? by position ) # SQLAlchemy named parameter style (use :param_name) from sqlalchemy import text with engine.connect()as conn: df = pd.read_sql( text("SELECT * FROM employees WHERE department = :dept AND salary > :min_salary"), con=conn, params={"dept": "IT","min_salary": 8000} ) print(df) ### 4. Set Index Column ## Example import pandas as pd from sqlalchemy import create_engine engine = create_engine("sqlite:///mydata.db") # Set id column from database as DataFrame row index df = pd.read_sql("SELECT * FROM employees", con=engine, index_col="id") print(df.head()) # Use multiple columns as composite index df = pd.read_sql( "SELECT * FROM orders", con=engine, index_col=["year","month"]# Composite index ) ### 5. Parse Date Columns Date fields stored in databases are read as strings by default. Using the `parse_dates` parameter can directly parse them into `datetime` types: ## Example import pandas as pd from sqlalchemy import create_engine engine = create_engine("sqlite:///mydata.db") # Parse created_at and updated_at columns into datetime type df = pd.read_sql( "SELECT * FROM orders", con=engine, parse_dates=["created_at","updated_at"] ) print(df.dtypes) # created_at datetime64 # updated_at datetime64 # You can also specify parsing formats (for non-standard date formats) df = pd.read_sql( "SELECT * FROM orders", con=engine, parse_dates={"created_at": "%Y%m%d"}# Parse "20240115" format into date ) * * * ## Chunked Reading of Large Datasets (chunksize) When database tables contain large amounts of data, loading everything into memory at once can cause OOM (out-of-memory) errors. Using the `chunksize` parameter allows reading data in batches, processing only part of the data at a time: ## Example import pandas as pd from sqlalchemy import create_engine engine = create_engine("mysql+pymysql://root:password@localhost/bigdata") # chunksize=10000 means reading 10000 rows each time, returning an iterator chunks = pd.read_sql("SELECT * FROM large_table", con=engine, chunksize=10000) # Process chunks one by one (only 10000 rows in memory at a time) result_list =[] for i,chunk in enumerate(chunks): # Process each chunk (e.g., filtering, aggregation, etc.) processed =chunk[chunk=="active"] result_list.append(processed) print(f"Processed chunk {i+1}, current block valid rows: {len(processed)}") # Combine all processed chunks into one DataFrame final_df = pd.concat(result_list, ignore_index=True) print(f"Final total valid rows: {len(final_df)}") * * * ## pd.read_sql_query() and pd.read_sql_table() ### pd.read_sql_query(): Only Execute Query Functionally similar to `pd.read_sql()`, but only accepts SQL query statements, not table names. Parameters are identical, suitable for scenarios where you want to clearly distinguish between "query" and "read table" operations: ## Example import pandas as pd from sqlalchemy import create_engine engine = create_engine("sqlite:///mydata.db") df = pd.read_sql_query( "SELECT name, salary FROM employees WHERE salary > 5000", con=engine ) print(df) ### pd.read_sql_table(): Read Entire Table (SQLAlchemy Only) `pd.read_sql_table()` is specifically used to read entire tables, supporting filtering of columns and rows via parameters, but **only supports SQLAlchemy engine connections**, not native connections like sqlite3: ## Example import pandas as pd from sqlalchemy import create_engine engine = create_engine("sqlite:///mydata.db") # Read specific columns from employees table df = pd.read_sql_table( "employees", con=engine, columns=["id","name","salary","department"]# Only read these columns ) # Also supports specifying schema (database schema) df = pd.read_sql_table( "employees", con=engine, schema="hr"# Read employees table under hr schema ) print(df.head()) * * * ## Writing DataFrame to Database (to_sql) Use `DataFrame.to_sql()` to write DataFrames to databases, supporting creating new tables, appending data, or replacing existing tables: ### Basic Syntax DataFrame.to_sql(name, con, schema=None, if_exists='fail', index=True, index_label=None, chunksize=None, dtype=None, method=None) The `if_exists` parameter determines behavior when the target table already exists: | if_exists Value | Behavior | Use Case | | --- | --- | --- | | `'fail'` (default) | Raise error if table exists | Prevent accidental overwriting | | `'replace'` | Drop and recreate table | Full data refresh | | `'append'` | Append data to existing table without changing structure | Incremental data writing | ## Example import pandas as pd from sqlalchemy import create_engine engine = create_engine("sqlite:///mydata.db") # Prepare sample data df = pd.DataFrame({ "name": ["Zhang San","Li Si","Wang Wu"], "department": ["IT","HR","IT"], "salary": [12000,8000,15000] }) # Write DataFrame to employees table, replace if exists df.to_sql( "employees", con=engine, if_exists="replace",# Replace existing data index=False# Don't write DataFrame row index to database (usually unnecessary) ) # Append new data to existing table (note: column names and data types must match) new_employees = pd.DataFrame({ "name": , "department": , "salary": }) new_employees.to_sql("employees", con=engine, if_exists="append", index=False) # Verify write result result = pd.read_sql("SELECT * FROM employees", con=engine) print(result) ### Specify Column Data Types You can explicitly specify column data types in the database using the `dtype` parameter during writing: ## Example import pandas as pd from sqlalchemy import create_engine, Integer, String, Float, DateTime engine = create_engine("sqlite:///mydata.db") df = pd.DataFrame({ "id": [1,2,3], "name": ["Alice","Bob","Charlie"], "score": [92.5,88.0,95.3], }) df.to_sql( "students", con=engine, if_exists="replace", index=False, dtype={ "id": Integer(),# Integer type "name": String(50),# Variable-length string, max 50 characters "score": Float()# Floating point number } ) * * * ## Complete Usage Example Here is a complete example of reading sales data from an SQLite database and performing analysis: ## Example import pandas as pd import sqlite3 from sqlalc
← Pandas Parquet FeatherTs Vue3 β†’