YouTip LogoYouTip

Pandas Groupby Mean

Pandas groupby.mean() Function

\\\\n\\\\n

Image 1: Pandas Common Functions Pandas Common Functions

\\\\n\\\\n
\\\\n\\\\n

groupby.mean() is a aggAggregate function in Pandas used to compute averages after grouping data. It is commonly used together with groupbygroupby.sum(): first, data is grouped by the values of a specific column, and then the arithmetic mean of numeric columns within each group is calculated.

\\\\n\\\\n

In data analysis, computing averages is a very common requirementβ€”for instance, calculating average salaries per department, average sales per region, or average scores per class. The mean() function enables quick completion of such tasks.

\\\\n\\\\n
\\\\n\\\\n

Basic Syntax and Parameters

\\\\n\\\\n

mean() is a member function of the GroupBy object and must be called after using groupby() to group the data.

\\\\n\\\\n

Syntax

\\\\n\\\\n
GroupBy.mean(numeric_only=False, engine=None, engine_kwargs=None)\\\\n
\\\\n\\\\n

Parameter Description

\\\\n\\\\n\\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n \\\\n
ParameterTypeDescriptionDefault
numeric_onlyboolIf True, only numeric columns are averaged; if False, attempts to average all columns.False
enginestrSpecifies the computation engine: 'cython' or 'numba'. None lets Pandas choose automatically.None
engine_kwargsdictDictionary of additional arguments passed to the underlying engine.None
\\\\n\\\\n

Return Value

\\\\n\\\\n
    \\\\n
  • Return Type: Series or DataFrame
  • \\\\n
  • Description: Returns the result after computing group-wise averages. If applied to a single column, returns a Series; if applied to multiple columns, returns a DataFrame.
  • \\\\n
\\\\n\\\\n
\\\\n\\\\n

Examples

\\\\n\\\\n

Let’s master the usage of groupby.mean() through a series of examples, from simple to complex.

\\\\n\\\\n

Example 1: Basic Usage β€” Compute Averages After Grouping by a Single Column

\\\\n\\\\n

The most basic usage is grouping by one column and computing the average of another column.

\\\\n\\\\n
import pandas as pd\\\\n\\\\n# Create a DataFrame of student scores\\\\n\\\\n# Contains: student name, class, Chinese, Math, English scores\\\\n\\\\ndata = {\\\\n    'Name': ['Zhang San','Li Si','Wang Wu','Zhao Liu','Sun Qi','Zhou Ba','Wu Jiu','Zheng Shi'],\\\\n    'Class': ['A','A','A','B','B','B','B','A'],\\\\n    'Chinese': [85,92,78,88,95,82,90,87],\\\\n    'Math': [90,85,92,78,88,91,85,89],\\\\n    'English': [88,90,85,92,87,89,91,86]\\\\n}\\\\n\\\\n# Create DataFrame\\\\ndf = pd.DataFrame(data)\\\\n\\\\nprint("Student grade data:")\\\\nprint(df)\\\\nprint()\\\\n\\\\n# Group by "Class", compute average scores per class\\\\navg_by_class = df.groupby('Class').mean(numeric_only=True)\\\\n\\\\nprint("Average Score per Class:")\\\\nprint(avg_by_class)\\\\nprint()\\\\n\\\\n# Alternatively, compute average for a specific column only\\\\nmath_avg_by_class = df.groupby('Class')['Math'].mean()\\\\n\\\\nprint("Average Math Score per Class:")\\\\nprint(math_avg_by_class)\\\\n
\\\\n\\\\n

Expected Output:

\\\\n\\\\n
Student grade data:\\\\n   Name Class Chinese Math English\\\\n0  Zhang San  A  85  90  88\\\\n1  Li Si  A  92  85  90\\\\n2  Wang Wu  A  78  92  85\\\\n3  Zhao Liu  B  88  78  92\\\\n4  Sun Qi  B  95  88  87\\\\n5  Zhou Ba  B  82  91  89\\\\n6  Wu Jiu  B  90  85  91\\\\n7  Zheng Shi  A  87  89  86\\\\n\\\\nAverage Score per Class:\\\\n     Chinese       Math       English\\\\nClass                        \\\\nA  85.500000  89.000000  87.250000\\\\nB  88.750000  85.500000  89.750000\\\\n\\\\nAverage Math Score per Class:\\\\nClass\\\\nA    89.0\\\\nB    85.5\\\\nName: Math, dtype: float64\\\\n
\\\\n\\\\n

Code Explanation:

\\\\n\\\\n
    \\\\n
  1. df.groupby('Class')groups students into two groups, A and B, based on the "Class" column.
  2. \\\\n
  3. .mean(numeric_only=True) computes the average for all numeric columns (Chinese, Math, English).
  4. \\\\n
  5. In the result, the class serves as the index, and subject averages become column data.
  6. \\\\n
\\\\n\\\\n

Example 2: Compute Averages After Grouping by Multiple Columns

\\\\n\\\\n

You can group by multiple columns simultaneously and compute averages for numeric columns.

\\\\n\\\\n
import pandas as pd\\\\n\\\\n# Create sales data\\\\ndata = {\\\\n    'Region': ['North China','East','South China','North China','East','South China','North China','East'],\\\\n    'Product': ['A','B','C','B','A','C','A','B'],\\\\n    'SalesAmount': [1000,2000,1500,1800,2200,1600,1200,2100],\\\\n    'Profit': [200,400,300,360,440,320,240,420]\\\\n}\\\\n\\\\ndf = pd.DataFrame(data)\\\\n\\\\nprint("SalesData:")\\\\nprint(df)\\\\nprint()\\\\n\\\\n# Group by "Region" and "Product", compute average sales and profit\\\\navg_grouped = df.groupby(['Region','Product'], as_index=False).mean(numeric_only=True)\\\\n\\\\nprint("Mean SalesAmount and Profit after grouping by Region and Product:")\\\\nprint(avg_grouped)\\\\nprint()\\\\n\\\\n# Keep multi-level index format\\\\navg_indexed = df.groupby(['Region','Product']).mean(numeric_only=True)\\\\n\\\\nprint("Result in MultiIndex format:")\\\\nprint(avg_indexed)\\\\n
\\\\n\\\\n

Expected Output:

\\\\n\\\\n
SalesData:\\\\n   Region Product  SalesAmount  Profit\\\\n0 North China  A  1000  200\\\\n1 East  B  2000  400\\\\n2 South China  C  1500  300\\\\n3 North China  B  1800  360\\\\n4 East  A  2200  440\\\\n5 South China  C  1600  320\\\\n6 North China  A  1200  240\\\\n7 East  B  2100  420\\\\n\\\\nMean SalesAmount and Profit after grouping by Region and Product:\\\\n   Region Product   SalesAmount   Profit\\\\n0 East  A  2200.0  440.0\\\\n1 East  B  2050.0  410.0\\\\n2 South China  C  1550.0  310.0\\\\n3 North China  A  1100.0  220.0\\\\n4 North China  B  1800.0  360.0\\\\n\\\\nResult in MultiIndex format:\\\\n         SalesAmount   Profit\\\\nRegion Product            \\\\nEast A  2200.0  440.0\\\\n   B  2050.0  410.0\\\\nSouth China C  1550.0  310.0\\\\nNorth China A  1100.0  220.0\\\\n   B  1800.0  360.0\\\\n
\\\\n\\\\n

Code Explanation:

\\\\n\\\\n
    \\\\n
  • ['Region', 'Product'] uses a list to group by multiple columns.
  • \\\\n
  • With as_index=False, the result is a DataFrame with grouping columns retained as regular columns.
  • \\\\n
  • The multi-level index format is more concise and suitable for subsequent data analysis.
  • \\\\n
\\\\n\\\\n

Example 3: Computing Averages with Missing Values

\\\\n\\\\n

When data contains missing values (NaN), mean() automatically ignores them during computation.

\\\\n\\\\n
import pandas as pd\\\\nimport numpy as np\\\\n\\\\n# Create employee salary data with missing values\\\\ndata = {\\\\n    'Department': ['Sales','Sales','Sales','technology','technology','technology','Admin','Admin'],\\\\n    'Salary': [5000,6000, np.nan,8000,9000, np.nan,4500, np.nan]\\\\n}\\\\n\\\\ndf = pd.DataFrame(data)\\\\n\\\\nprint("Employee Salary Data (including missing values):")\\\\nprint(df)\\\\nprint()\\\\n\\\\n# By default, mean() ignores NaN values\\\\navg_with_nan = df.groupby('Department')['Salary'].mean()\\\\n\\\\nprint("Default mean calculation (ignoring NaN):")\\\\nprint(avg_with_nan)\\\\nprint()\\\\n\\\\n# To treat NaN as 0, fill missing values first\\\\navg_filled = df.groupby('Department')['Salary'].apply(lambda x: x.fillna(0).mean())\\\\n\\\\nprint("SetNaNMean Salary after treating as 0:")\\\\nprint(avg_filled)\\\\nprint()\\\\n\\\\n# Note: groupby.mean() does not have a skipna parameter, but similar behavior can be achieved via fillna()\\\\n
\\\\n\\\\n

Expected Output:

\\\\n\\\\n
Employee Salary Data (including missing values):\\\\n   Department    Salary\\\\n0 Sales  5000.0\\\\n1 Sales  6000.0\\\\n2 Sales     NaN\\\\n3 technology  8000.0\\\\n4 technology  9000.0\\\\n5 technology     NaN\\\\n6 Admin  4500.0\\\\n7 Admin     NaN\\\\n\\\\nDefault mean calculation (ignoring NaN):\\\\nDepartment\\\\ntechnology    8500.0\\\\nAdmin    4500.0\\\\nSales    5500.0\\\\nName: Salary, dtype: float64\\\\n\\\\nSetNaNMean Salary after treating as 0:\\\\nDepartment\\\\ntechnology    5666.666667\\\\nAdmin    2250.0\\\\nSales    3666.666667\\\\nName: Salary, dtype: float64\\\\n
\\\\n\\\\n

Code Explanation:

\\\\n\\\\n
    \\\\n
  • By default, mean() ignores NaN values in calculations.
  • \\\\n
  • The Sales department has two valid values (5000, 6000), so the average is 5500.
  • \\\\n
  • To treat NaN as 0 before computing the average, use fillna(0) to fill missing values first.
  • \\\\n
\\\\n\\\\n

Example 4: Combining with transform to Compute Within-Group Proportions

\\\\n\\\\n

The transform method broadcasts group-wise averages back to each row of the original data, which is useful for computing within-group proportions or deviations from group averages.

\\\\n\\\\n
import pandas as pd\\\\n\\\\n# Create student score data\\\\ndata = {\\\\n    'Name': ['Zhang San','Li Si','Wang Wu','Zhao Liu','Sun Qi','Zhou Ba'],\\\\n    'Class': ['A','A','A','B','B','B'],\\\\n    'Chinese': [85,92,78,88,95,82],\\\\n    'Math': [90,85,92,78,88,91]\\\\n}\\\\n\\\\ndf = pd.DataFrame(data)\\\\n\\\\nprint("Student grade data:")\\\\nprint(df)\\\\nprint()\\\\n\\\\n# Compute class-level Math average, broadcast to each row\\\\ndf['ClassMathAverage Score'] = df.groupby('Class')['Math'].transform('mean')\\\\n\\\\n# Compute deviation of each student's score from class average\\\\ndf['Difference from average'] = df['Math'] - df['ClassMathAverage Score']\\\\n\\\\nprint("Add ClassAverage Score and the Data after difference:")\\\\nprint(df)\\\\nprint()\\\\n\\\\n# Compute percentage of each student's Math score relative to class average\\\\ndf['ClassWithin-class percentage'] = (df['Math'] / df['ClassMathAverage Score'] * 100).round(2)\\\\n\\\\nprint("Data after adding percentages:")\\\\nprint(df)\\\\n
\\\\n\\\\n

Expected Output:

\\\\n\\\\n
Student grade data:\\\\n   Name Class Chinese Math\\\\n0  Zhang San  A  85  90\\\\n1  Li Si  A  92  85\\\\n2  Wang Wu  A  78  92\\\\n3  Zhao Liu  B  88  78\\\\n4  Sun Qi  B  95  88\\\\n5  Zhou Ba  B  82  91\\\\n\\\\nAdd ClassAverage Score and the Data after difference:\\\\n   Name Class Chinese Math  ClassMathAverage Score  Difference from average\\\\n0  Zhang San  A  85  90  89.000000   1.000000\\\\n1  Li Si  A  92  85  89.000000  -4.000000\\\\n2  Wang Wu  A  78  92  89.000000   3.000000\\\\n3  Zhao Liu  B  88  78  85.666667  -7.666667\\\\n4  Sun Qi  B  95  88  85.666667   2.333333\\\\n5  Zhou Ba  B  82  91  85.666667   5.333333\\\\n\\\\nData after adding percentages:\\\\n   Name Class Chinese Math  ClassMathAverage Score  Difference from Average  Percentage within Class\\\\n0  Zhang San  A  85  90  89.000000   1.000000    101.12\\\\n1  Li Si  A  92  85  89.000000  -4.000000     95.51\\\\n2  Wang Wu  A  78  92  89.000000   3.000000    103.37\\\\n3  Zhao Liu  B  88  78  85.666667  -7.666667     91.04\\\\n4  Sun Qi  B  95  88  85.666667   2.333333    102.72\\\\n5  Zhou Ba  B  82  91  85.666667   5.333333    106.23\\\\n
\\\\n\\\\n

Code Explanation:

\\\\n\\\\n
    \\\\n
  • transform('mean') computes the average for each group and broadcasts it back to every row in the original DataFrame.
  • \\\\n
  • Each student can thus see their class’s average score, facilitating comparison.
  • \\\\n
  • This approach is highly useful in scenarios like analyzing β€œan individual’s position within a group”.
  • \\\\n
\\\\n\\\\n
\\\\n\\\\n
\\\\n

Notes: By default, the mean() function ignores NaN values. If all values in a group are missing, it returns NaN. Unlike sum(), mean() does not have a min_count parameter.

\\\\n
\\\\n\\\\n
\\\\n\\\\n

Image 2: Pandas Common Functions Pandas Common Functions

← Pandas Df Reset IndexPandas Df Rename β†’