YouTip LogoYouTip

Pandas Groupby Agg

Pandas groupby.agg() Function |

\\\\n\\\\n

Image 1: Pandas Common functions Pandas General Functions

\\\\n\\\\n
\\\\n\\\\n

groupby.agg() is one of the most powerful aggregation functions in Pandas, allowing you to apply multiple different aggregation functions simultaneously to grouped data.

\\\\n\\\\n

Different from single aggregation functions like sum(), mean(), etc., agg() can calculate multiple statistical indicators at once, such as the sum, average, maximum, minimum, etc., for each group.

\\\\n\\\\n

In data analysis, it's often necessary to perform multiple statistical analyses on data simultaneously. The agg() function makes this process concise and efficient.

\\\\n\\\\n
\\\\n\\\\n

Basic Syntax and Parameters

\\\\n\\\\n

agg() is a member function of the GroupBy object, and must be called after using groupby() to group the data.

\\\\n\\\\n

Syntax Format

\\\\n\\\\n
GroupBy.agg(func=None, axis=0, *args, engine=None, engine_kwargs=None, **kwargs)
\\\\n\\\\n

Parameter Description

\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n
ParameterTypeDescriptionDefault Value
funcstr, list, dict or callableThe aggregation function. Can be a single function name, a list of functions, or a dictionary specifying different functions for each column.None
axisintThe axis along which the aggregation is applied. 0 means by row (grouping dimension), 1 means by column.0
argstupleAdditional positional arguments passed to the aggregation function.()
enginestrSpecifies the computation engine, either 'cython' or 'numba'. None means Pandas will choose automatically.None
engine_kwargsdictA dictionary of additional arguments passed to the underlying engine.None
\\\\n\\\\n

Return Value

\\\\n\\\\n
    \\\\n
  • Return Type: Series or DataFrame
  • \\\\n
  • Description: Returns the result of grouped aggregation. The structure of the result depends on the func parameter and grouping method.
  • \\\\n
\\\\n\\\\n
\\\\n\\\\n

Examples

\\\\n\\\\n

Let's go through a series of examples from simple to complex to fully master the usage of groupby.agg().

\\\\n\\\\n

Example 1: Using Built-in Aggregation Functions

\\\\n\\\\n

The simplest way is to directly use the built-in aggregation function names provided by Pandas, such as 'sum', 'mean', 'max', 'min', 'count', etc.

\\\\n\\\\n

Example

\\\\n\\\\n
import pandas as pd\\\\n\\\\n# Create sales data DataFrame\\\\n\\\\ndata = {\\\\n\\\\n'Region': ['North China','East','South China','North China','East','South China','North China','East'],\\\\n\\\\n'Product': ['A','B','C','B','A','C','A','B'],\\\\n\\\\n'SalesAmount': [1000,2000,1500,1800,2200,1600,1200,2100],\\\\n\\\\n'Quantity': [10,20,15,18,22,16,12,21]\\\\n\\\\n}\\\\n\\\\ndf = pd.DataFrame(data)\\\\n\\\\nprint("Original sales data:")\\\\nprint(df)\\\\nprint()\\\\n\\\\n# Group by "Region", apply multiple aggregation functions to ""Sales Amount" column\\\\n\\\\n# Use string list to specify multiple aggregation functions\\\\n\\\\nresult = df.groupby('Region')['SalesAmount'].agg(['sum','mean','max','min','count'])\\\\n\\\\nprint("Summary statistics of sales amount by region:")\\\\nprint(result)\\\\nprint()\\\\n\\\\n# Aggregate both "SalesAmount" and "Quantity" columns\\\\n\\\\nresult_all = df.groupby('Region').agg({\\\\n\\\\n'SalesAmount': ['sum','mean'],\\\\n\\\\n'Quantity': ['sum','mean']\\\\n\\\\n})\\\\n\\\\nprint("EachRegionSalesandQuantityComprehensive Statistics:")\\\\nprint(result_all)\\\\n
\\\\n\\\\n

Output:

\\\\n\\\\n
Original sales data:\\\\nRegion Product SalesAmount Quantity\\\\n0 North China A 1000 10\\\\n1 East China B 2000 20\\\\n2 South China C 1500 15\\\\n3 North China B 1800 18\\\\n4 East China A 2200 22\\\\n5 South China C 1600 16\\\\n6 North China A 1200 12\\\\n7 East China B 2100 21\\\\n\\\\nSummary statistics of sales amount by region:\\\\n       sum       mean      max      min  count\\\\nRegion                                              \\\\nEast 7100 1775.000000   2200   2000      4\\\\nSouth China    3100  1550.000000   1600   1500      2\\\\nNorth China    4000  1333.333333   1800   1000      3\\\\n\\\\nEachRegionSalesandQuantityComprehensive Statistics:\\\\n        SalesAmount         Quantity      \\\\n         sum    mean     sum    mean\\\\nRegion                                          \\\\nEast China      7100  1775.0     83   20.75\\\\nSouth China      3100  1550.0     31   15.50\\\\nNorth China      4000  1333.333333  40   13.33\\\\n
\\\\n\\\\n

Code Explanation:

\\\\n\\\\n
    \\\\n
  1. ['sum', 'mean', 'max', 'min', 'count'] using a list allows specifying multiple aggregation functions at once.
  2. \\\\n
  3. The returned result is a DataFrame with multi-level column index composed of function names and original column names.
  4. \\\\n
  5. Using a dictionary allows specifying different aggregation functions for different columns.
  6. \\\\n
\\\\n\\\\n

Example 2: Using Custom Aggregation Functions

\\\\n\\\\n

Besides built-in functions, agg() also supports custom functions, greatly expanding its flexibility.

\\\\n\\\\n

Example

\\\\n\\\\n
import pandas as pd\\\\nimport numpy as np\\\\n\\\\n# Create student score data\\\\n\\\\ndata = {\\\\n\\\\n'Class': ['A','A','A','B','B','B','B','A'],\\\\n\\\\n'Name': ['Zhang San','Li Si','Wang Wu','Zhao Liu','Sun Qi','Zhou Ba','Wu Jiu','Zheng Shi'],\\\\n\\\\n'1. Manager 8000 5\\n2. Manager 12000 8\\n3. Manager 11000 7\\n4. Each Department Salaryand Tenure Comprehensive Statistics:\\n5. Aggregation results using named parameter form:\\n6. Count\\n7. Region\\n8. North China\\n9. East\\n10. North China\\n11. East\\n12. North China\\n13. East\\n14. Product\\n15. Region\\n16. Product\\n17. ByRegionandProductGrouped aggregation results:"\\n18. Region\\n19. Product\\n20. Region Product Sales Amount Profit\\n21. North China A 1000 200\\n22. East China B 2000 400\\n23. North China B 1800 360\\n24. East A 2200 440\\n25. East B 2100 420\\n26. ByRegionandProductGrouped aggregation results:\\n27. Amount         Profit\\n28. Region Product\\n29. East China B    4100  2050': [85,92,78,88,95,82,90,87],\\\\n\\\\n'Math': [90,85,92,78,88,91,85,89]\\\\n\\\\n}\\\\n\\\\ndf = pd.DataFrame(data)\\\\n\\\\nprint("Student grade data:")\\\\nprint(df)\\\\nprint()\\\\n\\\\n# Define custom aggregation functions\\\\n\\\\ndef range_func(x):\\\\n    """Calculate the difference between max and min values (range)"""\\\\n    return x.max() - x.min()\\\\n\\\\ndef coefficient_of_variation(x):\\\\n    """Calculate coefficient of variation (standard deviation / mean)"""\\\\n    return x.std() / x.mean() * 100\\\\n\\\\n# Use custom functions for aggregation\\\\n\\\\n# Can pass function name as string or function object directly\\\\n\\\\nresult = df.groupby('Class').agg({\\\\n\\\\n'1. Manager 8000 5\\n2. Manager 12000 8\\n3. Manager 11000 7\\n4. Each Department Salaryand Tenure Comprehensive Statistics:\\n5. Aggregation results using named parameter form:\\n6. Count\\n7. Region\\n8. North China\\n9. East\\n10. North China\\n11. East\\n12. North China\\n13. East\\n14. Product\\n15. Region\\n16. Product\\n17. ByRegionandProductGrouped aggregation results:"\\n18. Region\\n19. Product\\n20. Region Product Sales Amount Profit\\n21. North China A 1000 200\\n22. East China B 2000 400\\n23. North China B 1800 360\\n24. East A 2200 440\\n25. East B 2100 420\\n26. ByRegionandProductGrouped aggregation results:\\n27. Amount         Profit\\n28. Region Product\\n29. East China B    4100  2050': ['sum','mean', range_func], # For 1. Manager 8000 5\\n2. Manager 12000 8\\n3. Manager 11000 7\\n4. Each Department Salaryand Tenure Comprehensive Statistics:\\n5. Aggregation results using named parameter form:\\n6. Count\\n7. Region\\n8. North China\\n9. East\\n10. North China\\n11. East\\n12. North China\\n13. East\\n14. Product\\n15. Region\\n16. Product\\n17. ByRegionandProductGrouped aggregation results:"\\n18. Region\\n19. Product\\n20. Region Product Sales Amount Profit\\n21. North China A 1000 200\\n22. East China B 2000 400\\n23. North China B 1800 360\\n24. East A 2200 440\\n25. East B 2100 420\\n26. ByRegionandProductGrouped aggregation results:\\n27. Amount         Profit\\n28. Region Product\\n29. East China B    4100  2050: sum, mean, range\\\\n'Math': ['sum','mean', coefficient_of_variation] # For Math: sum, mean, coefficient of variation\\\\n\\\\n})\\\\n\\\\nprint("Custom statistics for each class:")\\\\nprint(result)\\\\nprint()\\\\n\\\\n# Also use custom functions directly in list\\\\n\\\\nresult2 = df.groupby('Class')['Math'].agg(['mean', range_func,'std'])\\\\n\\\\nprint("EachClassMathScore Statistics (including custom range and standard deviation):")\\\\nprint(result2)\\\\n
\\\\n\\\\n

Output:

\\\\n\\\\n
Student grade data:\\\\nClass Name 1. Manager 8000 5\\n2. Manager 12000 8\\n3. Manager 11000 7\\n4. Each Department Salaryand Tenure Comprehensive Statistics:\\n5. Aggregation results using named parameter form:\\n6. Count\\n7. Region\\n8. North China\\n9. East\\n10. North China\\n11. East\\n12. North China\\n13. East\\n14. Product\\n15. Region\\n16. Product\\n17. ByRegionandProductGrouped aggregation results:"\\n18. Region\\n19. Product\\n20. Region Product Sales Amount Profit\\n21. North China A 1000 200\\n22. East China B 2000 400\\n23. North China B 1800 360\\n24. East A 2200 440\\n25. East B 2100 420\\n26. ByRegionandProductGrouped aggregation results:\\n27. Amount         Profit\\n28. Region Product\\n29. East China B    4100  2050 Math\\\\n0 A Zhang San 85 90\\\\n1 A Li Si 92 85\\\\n2 A Wang Wu 78 92\\\\n3 B Zhao Liu 88 78\\\\n4 B Sun Qi 95 88\\\\n5 B Zhou Ba 82 91\\\\n6 B Wu Jiu 90 85\\\\n7 A Zheng Shi 87 89\\\\n\\\\nCustom statistics for each class:\\\\n        1. Manager 8000 5\\n2. Manager 12000 8\\n3. Manager 11000 7\\n4. Each Department Salaryand Tenure Comprehensive Statistics:\\n5. Aggregation results using named parameter form:\\n6. Count\\n7. Region\\n8. North China\\n9. East\\n10. North China\\n11. East\\n12. North China\\n13. East\\n14. Product\\n15. Region\\n16. Product\\n17. ByRegionandProductGrouped aggregation results:"\\n18. Region\\n19. Product\\n20. Region Product Sales Amount Profit\\n21. North China A 1000 200\\n22. East China B 2000 400\\n23. North China B 1800 360\\n24. East A 2200 440\\n25. East B 2100 420\\n26. ByRegionandProductGrouped aggregation results:\\n27. Amount         Profit\\n28. Region Product\\n29. East China B    4100  2050          Math      \\\\n         sum    mean range_func sum    mean coefficient_of_variation\\\\nClass                                                                  \\\\nA        342   85.5     14.0  425  88.25           5.019099\\\\nB        355   88.75   12.0  342  85.5            7.030\\\\n\\\\nEachClassMathScore Statistics (including custom range and statistics):\\\\nClass  Average Range Standard Deviation\\\\nA   88.25  5.0  2.692582\\\\nB   85.50 13.0  6.0\\\\n
\\\\n\\\\n

Code Explanation:

\\\\n\\\\n
    \\\\n
  • Custom functions take a Series as input and return a scalar value.
  • \\\\n
  • Function names can be passed as strings (e.g., 'sum') or directly as function objects.
  • \\\\n
  • Using custom functions enables any complex aggregation logic.
  • \\\\n
\\\\n\\\\n

Example 3: Using String Aliases and Lambda Functions

\\\\n\\\\n

agg() supports various ways to specify functions, including lambda expressions and built-in string aliases.

\\\\n\\\\n

Example

\\\\n\\\\n
import pandas as pd\\\\n\\\\n# Create employee salary data\\\\n\\\\ndata = {\\\\n\\\\n'Department': ['Sales','Sales','technology','technology','Admin','Admin','Sales','technology'],\\\\n\\\\n'Position': ['Specialist','Manager','Specialist','Manager','Specialist','Manager','Specialist','Manager'],\\\\n\\\\n'Salary': [5000,8000,7000,12000,4500,9000,5500,11000],\\\\n\\\\n'Tenure': [2,5,3,8,1,6,2,7]\\\\n\\\\n}\\\\n\\\\ndf = pd.DataFrame(data)\\\\n\\\\nprint("Employee salary data:")\\\\nprint(df)\\\\nprint()\\\\n\\\\n# Use lambda functions for flexible custom calculations\\\\n\\\\n# Calculate the difference between median and mean salary for each department\\\\n\\\\nresult = df.groupby('Department').agg({\\\\n\\\\n'Salary': [\\\\n    ('AverageSalary','mean'), # Give aggregation result an alias\\\\n    ('Max Salary','max'),\\\\n    ('Min Salary','min'),\\\\n    ('SalaryRange',lambda x: x.max() - x.min())\\\\n],\\\\n\\\\n'Tenure': [\\\\n    ('AverageTenure','mean'),\\\\n    ('Max Tenure','max'),\\\\n    ('Min Tenure','min')\\\\n]\\\\n\\\\n})\\\\n\\\\nprint("EachDepartmentSalaryandTenureComprehensive Statistics:")\\\\nprint(result)\\\\nprint()\\\\n\\\\n# Use shorthand form of agg\\\\n\\\\nresult2 = df.groupby('Department')['Salary'].agg(\\\\n    Average='mean',\\\\n    Sum='sum',\\\\n    Count='count'\\\\n)\\\\n\\\\nprint("Aggregation results using named parameter form:")\\\\nprint(result2)\\\\n
\\\\n\\\\n

Output:

\\\\n\\\\n
Employee salary data:\\\\nDepartment Position Salary Tenure\\\\n0 Sales Specialist 5000 2\\\\n1 Sales Manager 8000 5\\\\n2 technology Specialist 7000 3\\\\n3 technology Manager 12000 8\\\\n4 Admin Specialist 4500 1\\\\n5 Admin Manager 9000 6\\\\n6 Sales Specialist 5500 2\\\\n7 technology Manager 11000 7\\\\n\\\\nEachDepartmentSalaryandTenureComprehensive Statistics:\\\\n        Salary              Tenure          \\\\n    AverageSalary Max Salary Min Salary SalaryRange AverageTenure Max Tenure Min Tenure\\\\nDepartment                                                                  \\\\ntechnology   10000.0  12000   7000   5000     6.0      8      3\\\\nAdmin    6750.0   9000   4500   4500     3.5      6      1\\\\nSales    6166.666667  8000   5000   3000     3.0      5      1\\\\n\\\\nAggregation results using named parameter form:\\\\n        Average   Sum  Count\\\\nDepartment                    \\\\ntechnology   10000.0 30000   3\\\\nAdmin    6750.0 13500   2\\\\nSales    6166.666667 18500   3\\\\n
\\\\n\\\\n

Code Explanation:

\\\\n\\\\n
    \\\\n
  • Using tuples ('alias', 'function') gives custom names to aggregated result columns.
  • \\\\n
  • Lambda expressions can be used directly inside agg() for simple custom logic.
  • \\\\n
  • Named parameters (keyword arguments) allow intuitive naming of result columns.
  • \\\\n
\\\\n\\\\n

Example 4: Grouping by Multiple Columns and Using agg

\\\\n\\\\n

agg() can also be used with multi-column grouping to handle more complex data analysis needs.

\\\\n\\\\n

Example

\\\\n\\\\n
import pandas as pd\\\\n\\\\n# Create sales data\\\\n\\\\ndata = {\\\\n\\\\n'Region': ['North China','East','South China','North China','East','South China','North China','East','South China'],\\\\n\\\\n'Product': ['A','B','C','B','A','C','A','B','C'],\\\\n\\\\n'SalesAmount': [1000,2000,1500,1800,2200,1600,1200,2100,1700],\\\\n\\\\n'Profit': [200,400,300,360,440,320,240,420,340]\\\\n\\\\n}\\\\n\\\\ndf = pd.DataFrame(data)\\\\n\\\\nprint("Sales data:")\\\\nprint(df)\\\\nprint()\\\\n\\\\n# Group by region and product, aggregate sales and profit with multiple functions\\\\n\\\\nresult = df.groupby(['Region','Product']).agg({\\\\n\\\\n'SalesAmount': ['sum','mean','count'],\\\\n\\\\n'Profit': ['sum','mean']\\\\n\\\\n})\\\\n\\\\nprint("ByRegionandProductGrouped aggregation results:")\\\\nprint(result)\\\\nprint()\\\\n\\\\n# Convert to regular DataFrame format using reset_index\\\\n\\\\nresult_flat = df.groupby(['Region','Product'], as_index=False).agg({\\\\n\\\\n'SalesAmount': ['sum','mean'],\\\\n\\\\n'Profit': ['sum','mean']\\\\n\\\\n})\\\\n\\\\n# Flatten multi-level column index\\\\n\\\\nresult_flat.columns=['_'.join(col).strip('_') for col in result_flat.columns.values]\\\\n\\\\nprint("Flattened result:")\\\\nprint(result_flat)\\\\n
\\\\n\\\\n

Output:

\\\\n\\\\n
Sales data:\\\\nRegion Product Sales Amount Profit\\\\n0 North China A 1000 200\\\\n1 East China B 2000 400\\\\n2 South China C 1500 300\\\\n3 North China B 1800 360\\\\n4 East A 2200 440\\\\n5 South China C 1600 320\\\\n6 North China A 1200 240\\\\n7 East B 2100 420\\\\n8 South China C 1700 340\\\\n\\\\nByRegionandProductGrouped aggregation results:\\\\n        SalesAmount         Profit      \\\\n         sum    mean count   sum    mean\\\\nRegion Product                                        \\\\nEast China A    2200  2200.0     1  440  440.0\\\\n     B    4100  2050.0     2  820  410.0\\\\nSouth China C    3100  1550.0     2  660  330.0\\\\nNorth China A    2200  1100.0     2  440  220.0\\\\n     B    1800  1800.0     1  360  360.0\\\\n\\\\nFlattened result:\\\\nRegion Product SalesAmount_sum SalesAmount_mean Profit_sum Profit_mean\\\\n0 East China A    2200  2200.0     440  440.0\\\\n1 East China B    4100  2050.0     820  410.0\\\\n2 South China C    3100  1550.0     660  330.0\\\\n3 North China A    2200  1100.0     440  220.0\\\\n4 North China B    1800  1800.0     360  360.0\\\\n
\\\\n\\\\n

Code Explanation:

\\\\n\\\\n
    \\\\n
  • Multi-column grouping creates a MultiIndex.
  • \\\\n
  • Using as_index=False keeps grouping columns as normal columns.
  • \\\\n
  • Rename columns to flatten multi-level column index into single level.
  • \\\\n
\\\\n\\\\n
\\\\n\\\\n

Tip: agg() is one of the most commonly used functions for grouped aggregation. It not only calculates multiple statistics at once but also supports custom functions, offering high flexibility. It is recommended to master all its usage patterns in practical projects.

\\\\n\\\\n
\\\\n\\\\n

Image 2: Pandas Common Functions Pandas General Functions

← Pandas Series QuantilePandas Series Var β†’