Pandas Functions
The following lists some common functions in Pandas and their usage examples:
## Reading Data
| Function | Description |
| --- | --- |
| pd.read_csv(filename) | Read CSV files; |
| pd.read_excel(filename) | Read Excel files; |
| pd.read_sql(query, connection_object) | Read data from SQL database; |
| pd.read_json(json_string) | Read data from a JSON string; |
| pd.read_html(url) | Read data from an HTML page. |
## Examples
import pandas as pd
# Read data from a CSV file
df = pd.read_csv('data.csv')
# Read data from an Excel file
df = pd.read_excel('data.xlsx')
# Read data from a SQL database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table_name', conn)
# Read data from a JSON string
json_string ='{"name": "John", "age": 30, "city": "New York"}'
df = pd.read_json(json_string)
# Read data from an HTML page
url =''
dfs = pd.read_html(url)
df = dfs# Select the first dataframe
* * *
## Viewing Data
| Function | Description |
| --- | --- |
| df.head(n) | Display the first n rows of data; |
| df.tail(n) | Display the last n rows of data; |
| df.info() | Display information about the data, including column names, data types, missing values, etc.; |
| df.describe() | Display basic statistical information about the data, including mean, variance, maximum, minimum, etc.; |
| df.shape | Display the number of rows and columns in the data. |
## Examples
# Display the first five rows of data
df.head()
# Display the last five rows of data
df.tail()
# Display data information
df.info()
# Display basic statistical information
df.describe()
# Display the number of rows and columns in the data
df.shape
## Example
import pandas as pd
data =[
{"name": "Google","likes": 25,"url": "https://www.google.com"},
{"name": "","likes": 30,"url": ""},
{"name": "Taobao","likes": 35,"url": "https://www.taobao.com"}
]
df = pd.DataFrame(data)
# Display the first two rows of data
print(df.head(2))
# Display the last row of data
print(df.tail(1))
The output of the above example is:
name likes url 0 Google 25 https://www.google.com1 30 name likes url 2 Taobao 35 https://www.taobao.com
* * *
## Data Cleaning
| Function | Description |
| --- | --- |
| df.dropna() | Remove rows or columns containing missing values; |
| df.fillna(value) | Replace missing values with a specified value; |
| df.replace(old_value, new_value) | Replace a specified value with a new value; |
| df.duplicated() | Check for duplicate data; |
| df.drop_duplicates() | Remove duplicate data. |
## Examples
# Remove rows or columns containing missing values
df.dropna()
# Replace missing values with a specified value
df.fillna(0)
# Replace a specified value with a new value
df.replace('old_value','new_value')
# Check for duplicate data
df.duplicated()
# Remove duplicate data
df.drop_duplicates()
* * *
## Data Selection and Slicing
| Function | Description |
| --- | --- |
| df | Select a specified column; |
| df.loc[row_index, column_name] | Select data by label; |
| df.iloc[row_index, column_index] | Select data by position; |
| df.ix[row_index, column_name] | Select data by label or position; |
| df.filter(items=[column_name1, column_name2]) | Select specified columns; |
| df.filter(regex='regex') | Select columns whose names match a regular expression; |
| df.sample(n) | Randomly select n rows of data. |
## Examples
# Select a specified column
df['column_name']
# Select data by label
df.loc[row_index, column_name]
# Select data by position
df.iloc[row_index, column_index]
# Select data by label or position
df.ix[row_index, column_name]
# Select specified columns
df.filter(items=['column_name1','column_name2'])
# Select columns whose names match a regular expression
df.filter(regex='regex')
# Randomly select n rows of data
df.sample(n=5)
* * *
## Sorting Data
| Function | Description |
| --- | --- |
| df.sort_values(column_name) | Sort by the values in a specified column; |
| df.sort_values([column_name1, column_name2], ascending=[True, False]) | Sort by multiple column values; |
| df.sort_index() | Sort by index. |
## Examples
# Sort by the values in a specified column
df.sort_values('column_name')
# Sort by multiple column values
df.sort_values(['column_name1','column_name2'], ascending=[True,False])
# Sort by index
df.sort_index()
* * *
## Grouping and Aggregation
| Function | Description |
| --- | --- |
| df.groupby(column_name) | Group by a specified column; |
| df.aggregate(function_name) | Perform aggregation operations on grouped data; |
| df.pivot_table(values, index, columns, aggfunc) | Generate a pivot table. |
## Examples
# Group by a specified column
df.groupby('column_name')
# Perform aggregation operations on grouped data
df.aggregate('function_name')
# Generate a pivot table
df.pivot_table(values='value', index='index_column', columns='column_name', aggfunc='function_name')
* * *
## Merging Data
| Function | Description |
| --- | --- |
| pd.concat([df1, df2]) | Concatenate multiple dataframes along rows or columns; |
| pd.merge(df1, df2, on=column_name) | Merge two dataframes based on a specified column. |
## Examples
# Concatenate multiple dataframes along rows or columns
df = pd.concat([df1, df2])
# Merge two dataframes based on a specified column
df = pd.merge(df1, df2, on='column_name')
* * *
## Data Selection and Filtering
| Function | Description |
| --- | --- |
| df.loc[row_indexer, column_indexer] | Select rows and columns by label. |
| df.iloc[row_indexer, column_indexer] | Select rows and columns by position. |
| df[df['column_name'] > value] | Select rows where the column meets a condition. |
| df.query('column_name > value') | Use a string expression to select rows where the column meets a condition. |
* * *
## Data Statistics and Description
| Function | Description |
| --- | --- |
| df.describe() | Calculate basic statistics such as mean, standard deviation, minimum, maximum, etc. |
| df.mean() | Calculate the mean of each column. |
| df.median() | Calculate the median of each column. |
| df.mode() | Calculate the mode of each column. |
| df.count() | Count the number of non-null values in each column. |
* * *
## Example
Suppose we have the following JSON data, saved in a file called data.json:
## data.json File
[
{
"name":"Alice",
"age":25,
"gender":"female",
"score":80
},
{
"name":"Bob",
"age":null,
"gender":"male",
"score":90
},
{
"name":"Charlie",
"age":30,
"gender":"male",
"score":null
},
{
"name":"David",
"age":35,
"gender":"male",
"score":70
}
]
We can use Pandas to read the JSON data and perform data cleaning and processing, data selection and filtering, data statistics and description, etc., as follows:
## Example
import pandas as pd
# Read JSON data
df = pd.read_json('data.json')
# Drop missing values
df = df.dropna()
# Fill missing values with specified values
df = df.fillna({'age': 0,'score': 0})
# Rename columns
df = df.rename(columns={'name': 'Name','age': 'Age','gender': 'Gender','score': 'Score'})
# Sort by score
df = df.sort_values(by='Score', ascending=False)
# Group by gender and calculate average age and score
grouped = df.groupby('Gender').agg({'Age': 'mean','Score': 'mean'})
# Select rows where score is greater than or equal to 90, keeping only name and score columns
df = df.loc[df['Score']>=90,['Name','Score']]
# Calculate basic statistics for each column
stats = df.describe()
# Calculate the mean of each column
mean = df.mean()
# Calculate the median of each column
median = df.median()
# Calculate the mode of each column
mode = df.mode()
# Count the number of non-null values in each column
count = df.count()
The output results are as follows:
# df Name Age Gender Score1 Bob 0 male 90# grouped Age ScoreGender female 25.000000 80 male 27.500000 80# stats Score count 1.0 mean 90.0 std NaN min 90.025% 90.050% 90.075% 90.0 max 90.0# meanScore 90.0 dtype: float64 # medianScore 90.0 dtype: float64 # mode Name Score0 Bob 90.0# countName 1Score 1 dtype: int64
YouTip