Pandas Df Groupby
[ Pandas Common Functions](#)
* * *
`groupby()` is one of the most powerful grouping operations in Pandas. It allows you to split data into different groups based on the values of one or more columns, and then perform various operations on each group.
Simply put, `groupby()` implements the "Split-Apply-Combine" workflow: first split the data by conditions, apply the corresponding function to each group, and finally merge the results together.
This is very common in data analysis, such as calculating average employee salary by department, monthly sales statistics, user count by region, etc.
* * *
## Basic Syntax and Parameters
`groupby()` is a member function of DataFrame, called through the dot operator `.`. After calling, it returns a `GroupBy` object, which does not directly display results by itself and needs to be used together with aggregation functions.
### Syntax Format
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, dropna=True)
### Parameter Description
| Parameter | Type | Description | Default Value |
| --- | --- | --- | --- |
| by | str, list, or dict | Column name or list of column names used for grouping. If it's a dictionary or function, group by its results. | None |
| axis | int | The axis direction for grouping, 0 means by rows (default), 1 means by columns. | 0 |
| level | int or str | If it's a MultiIndex, group by the specified level. | None |
| as_index | bool | If True, the grouping column will be used as the index of the returned result; if False, the grouping column will be retained as a regular column. | True |
| sort | bool | Whether to sort the grouping labels. Setting to False can improve performance. | True |
| group_keys | bool | When calling `apply()`, whether to add grouping keys as index in the result. | True |
| observed | bool | If True, only show actual observed values of categorical variables, not all possible values. | False |
| dropna | bool | If True, groups containing NA/null values will be dropped. | True |
### Return Value
* **Return Type**: `DataFrameGroupBy` or `SeriesGroupBy` object
* **Description**: Returns a grouping object, not the final result. You need to call aggregation functions (such as `sum()`, `mean()`, `count()`, etc.) to get the specific calculation results.
* * *
## Examples
Let's thoroughly master the usage of `groupby()` through a series of examples from simple to complex.
### Example 1: Group by Single Column
The most basic usage is to group by the values of a certain column. Suppose we have a sales data table and need to calculate total sales by region.
## Example
import pandas as pd
# Create a simple sales data DataFrame
# Simulate a table containing region, product, and sales amount
data ={
'Region': ['North China','East China','South China','North China','East China','South China','North China','East China'],
'Product': ['A','B','C','B','A','C','A','B'],
'Sales': [1000,2000,1500,1800,2200,1600,1200,2100]
}
# Create DataFrame
df = pd.DataFrame(data)
print("Original data:")
print(df)
print()
# Group by "Region" column and calculate total sales for each region
# as_index=True means region is returned as index
grouped = df.groupby('Region', as_index=True)['Sales'].sum()
print("Total sales after grouping by region:")
print(grouped)
print()
# When as_index=False, the grouping column is retained as a regular column
grouped_df = df.groupby('Region', as_index=False)['Sales'].sum()
print("Result when as_index=False:")
print(grouped_df)
**Expected Output:**
Original data: Region Product Sales0 East China B 20001 South China C 15002 North China A 10003 East China B 18004 South China C 22005 North China A 16006 East China A 12007 South China B 2100Total sales after grouping by region:RegionEast China 7100South China 7300North China 3600 dtype: int64Result when as_index=False: Region Sales0 East China 71001 South China 73002 North China 3600
**Code Analysis:**
1. `df.groupby('Region')` splits the data into three groups based on the values of the "Region" column: East China, South China, North China.
2. `['Sales'].sum()` means only perform sum aggregation on the "Sales" column.
3. When `as_index=True` (default), the returned Series uses region as index; when `as_index=False`, the returned DataFrame retains region as a regular column, which is more suitable for subsequent processing.
### Example 2: Group by Multiple Columns
Sometimes you need to group by multiple columns simultaneously, such as calculating sales by region and product.
## Example
import pandas as pd
# Create sales data
data ={
'Region': ['North China','East China','South China','North China','East China','South China','North China','East China'],
'Product': ['A','B','C','B','A','C','A','B'],
'Sales': [1000,2000,1500,1800,2200,1600,1200,2100]
}
df = pd.DataFrame(data)
print("Original data:")
print(df)
print()
# Group by "Region" and "Product" columns and calculate total sales
# Use a list to specify multiple grouping columns
grouped = df.groupby(['Region','Product'], as_index=False)['Sales'].sum()
print("Total sales after grouping by region and product:")
print(grouped)
print()
# Use pivot_table to display results more intuitively
pivot = df.pivot_table(values='Sales', index='Region', columns='Product', aggfunc='sum', fill_value=0)
print("Display using pivot_table:")
print(pivot)
**Expected Output:**
Original data: Region Product Sales0 North China A 10001 East China B 20002 South China C 15003 North China B 18004 East China A 22005 South China C 16006 North China A 12007 East China B 2100Total sales after grouping by region and product: Region Product Sales0 East China A 22001 East China B 41002 South China C 31003 North China A 22004 North China B 1800Display using pivot_table:Product A B C RegionNorth China 2200 1800 0East China 2200 4100 0South China 0 0 3100
**Code Analysis:**
* `['Region', 'Product']` Using a list allows grouping by multiple columns simultaneously, and the result will produce a MultiIndex.
* When `as_index=False`, the grouping columns are retained as regular columns in the result, making it easier to view and process subsequently.
* `pivot_table()` provides similar cross-tabulation functionality, presenting grouping results in a more intuitive way.
### Example 3: Grouping Using Dictionaries and Functions
`groupby()` not only supports grouping by column names, but also allows defining grouping rules through dictionaries or functions, which is very useful when you need custom grouping logic.
## Example
import pandas as pd
# Create student score data
data ={
'Name': ['Zhang San','Li Si','Wang Wu','Zhao Liu','Sun Qi','Zhou Ba'],
'Chinese': [85,92,78,88,95,82],
'Math': [90,85,92,78,88,91],
'English': [88,90,85,92,87,89]
}
df = pd.DataFrame(data)
print("Original student score data:")
print(df)
print()
# 1. Use dictionary for custom grouping
# Suppose we want to group by "surname" (Zhang, Li, Wang in one group, Zhao, Sun, Zhou in another)
def get_surname_group(name):
"""Return group name based on surname"""
if name in['Zhang San','Li Si','Wang Wu']:
return'Group 1'
else:
return'Group 2'
# Use apply function for grouping
grouped = df.groupby(get
YouTip