Pandas Groupby
`groupby` is one of the most powerful features in Pandas, allowing you to group data by one or more columns and then perform aggregation, transformation, or filtering operations on each group.
* * *
## Basic Usage of groupby
### Single Column Grouping
## Example
import pandas as pd
# Create sample data
df = pd.DataFrame({
"Department": ["Tech","Sales","Tech","Operations","Sales","Tech"],
"Name": ["Zhang San","Li Si","Wang Wu","Zhao Liu","Qian Qi","Sun Ba"],
"Salary": [12000,15000,11000,18000,14000,13000]
})
print("Original Data:")
print(df)
print()
# Group by department
grouped = df.groupby("Department")
print(f"Group object: {type(grouped)}")
print(f"Number of groups: {len(grouped)}")
print()
# View groups
print("Data for each group:")
for name, group in grouped:
print(f"n Department: {name}")
print(group)
### Grouping and Aggregation
## Example
import pandas as pd
df = pd.DataFrame({
"Department": ["Tech","Sales","Tech","Operations","Sales"],
"Salary": [12000,15000,11000,18000,14000]
})
print("Average salary by department:")
print(df.groupby("Department").mean())
print()
print("Sum by department:")
print(df.groupby("Department").sum())
print()
print("Count by department:")
print(df.groupby("Department").size())
* * *
## Multiple Column Grouping
## Example
import pandas as pd
df = pd.DataFrame({
"Year": ["2023","2023","2024","2024"],
"Department": ["Tech","Sales","Tech","Sales"],
"Salary": [12000,15000,13000,16000]
})
print("Group by year and department:")
print(df.groupby(["Year","Department"]).sum())
print()
# Output as DataFrame
result = df.groupby(["Year","Department"]).agg({
"Salary": "sum"
}).reset_index()
print("Result DataFrame:")
print(result)
* * *
## Common Aggregation Functions
### Single Aggregation
| Function | Description |
| --- | --- |
| `sum()` | Sum |
| `mean()` | Mean |
| `median()` | Median |
| `std()` | Standard deviation |
| `min() / max()` | Minimum / Maximum |
| `count()` | Count |
| `first() / last()` | First / Last value |
## Example
import pandas as pd
df = pd.DataFrame({
"Department": ["Tech","Sales","Tech","Operations"],
"Salary": [12000,15000,11000,18000]
})
# Chain multiple aggregations
print("Chained aggregation:")
print(df.groupby("Department").agg(["sum","mean","max","min"]))
### Custom Aggregation
## Example
import pandas as pd
import numpy as np
df = pd.DataFrame({
"Department": ["Tech","Sales","Tech","Operations"],
"Salary": [12000,15000,11000,18000]
})
# Custom aggregation function
def range_func(x):
return x.max() - x.min()
print("Using custom function:")
print(df.groupby("Department").agg(range_func))
print()
# Named aggregation (recommended)
print("Named aggregation:")
print(df.groupby("Department").agg(
Min_Salary=("Salary","min"),
Max_Salary=("Salary","max"),
Avg_Salary=("Salary","mean")
))
* * *
## transform
transform allows you to compute values for each group while maintaining the original data shape and returning results with the same shape as the original data.
## Example
import pandas as pd
df = pd.DataFrame({
"Department": ["Tech","Sales","Tech","Operations","Sales"],
"Salary": [12000,15000,11000,18000,14000]
})
print("Original data:")
print(df)
print()
# Calculate average salary per department and broadcast to each row
df= df.groupby("Department").transform("mean")
print("Added department average salary:")
print(df)
print()
# Calculate salary proportion per person
df= df / df
print("Salary proportion:")
print(df)
* * *
## filter
filter can be used to filter data based on group conditions.
## Example
import pandas as pd
df = pd.DataFrame({
"Department": ["Tech","Sales","Tech","Operations","Sales","Tech"],
"Salary": [12000,15000,11000,18000,14000,13000]
})
print("Original data:")
print(df)
print()
# Filter departments with average salary greater than 13000
print("Average salary > 13000:")
filtered = df.groupby("Department").filter(lambda x: x.mean()>13000)
print(filtered)
print()
# Filter groups with more than 2 members
print("Member count > 2:")
filtered2 = df.groupby("Department").filter(lambda x: len(x)>2)
print(filtered2)
* * *
## apply
apply allows applying a custom function to each group.
## Example
import pandas as pd
import numpy as np
df = pd.DataFrame({
"Department": ["Tech","Sales","Tech","Operations"],
"Salary": [12000,15000,11000,18000]
})
# Apply custom function to each group
result = df.groupby("Department").apply(
lambda x: pd.Series({
"Total": x.sum(),
"Mean": x.mean()
})
)
print("Custom aggregation result:")
print(result)
* * *
## Practical Example: Sales Data Analysis
## Example
import pandas as pd
# Simulate sales data
sales = pd.DataFrame({
"Date": pd.date_range("2024-01-01", periods=30, freq="D"),
"Product": ["Phone","Computer","Tablet"] * 10,
"Channel": * 15 + * 15,
"Sales": [1000,2000,1500] * 10
})
print("=== Sales Data Analysis ===n")
# 1. Summary by product
print("1. Summary by product:")
product_summary = sales.groupby("Product").agg(["sum","mean","count"])
print(product_summary)
print()
# 2. Summary by channel
print("2. Summary by channel:")
channel_summary = sales.groupby("Channel").sum()
print(channel_summary)
print()
# 3. Cross-tabulation by product and channel
print("3. Cross-tabulation by product and channel:")
cross_summary = sales.groupby(["Product","Channel"]).sum().unstack()
print(cross_summary)
print()
# 4. Calculate percentage of sales by product
print("4. Sales percentage:")
total = sales.sum()
product_pct = sales.groupby("Product").sum() / total * 100
print(product_pct.round(2))
> groupby is a core operation in data analysis; mastering it enables efficient completion of various statistical analysis tasks.
YouTip