Pandas groupby.agg() Function |
\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n
groupby.agg() is one of the most powerful aggregation functions in Pandas, allowing you to apply multiple different aggregation functions simultaneously to grouped data.
Different from single aggregation functions like sum(), mean(), etc., agg() can calculate multiple statistical indicators at once, such as the sum, average, maximum, minimum, etc., for each group.
In data analysis, it's often necessary to perform multiple statistical analyses on data simultaneously. The agg() function makes this process concise and efficient.
\\\\n\\\\n
Basic Syntax and Parameters
\\\\n\\\\nagg() is a member function of the GroupBy object, and must be called after using groupby() to group the data.
Syntax Format
\\\\n\\\\nGroupBy.agg(func=None, axis=0, *args, engine=None, engine_kwargs=None, **kwargs)\\\\n\\\\nParameter Description
\\\\n\\\\n| Parameter | \\\\nType | \\\\nDescription | \\\\nDefault Value | \\\\n
|---|---|---|---|
| func | \\\\nstr, list, dict or callable | \\\\nThe aggregation function. Can be a single function name, a list of functions, or a dictionary specifying different functions for each column. | \\\\nNone | \\\\n
| axis | \\\\nint | \\\\nThe axis along which the aggregation is applied. 0 means by row (grouping dimension), 1 means by column. | \\\\n0 | \\\\n
| args | \\\\ntuple | \\\\nAdditional positional arguments passed to the aggregation function. | \\\\n() | \\\\n
| engine | \\\\nstr | \\\\nSpecifies the computation engine, either 'cython' or 'numba'. None means Pandas will choose automatically. | \\\\nNone | \\\\n
| engine_kwargs | \\\\ndict | \\\\nA dictionary of additional arguments passed to the underlying engine. | \\\\nNone | \\\\n
Return Value
\\\\n\\\\n- \\\\n
- Return Type:
SeriesorDataFrame\\\\n - Description: Returns the result of grouped aggregation. The structure of the result depends on the
funcparameter and grouping method. \\\\n
\\\\n\\\\n
Examples
\\\\n\\\\nLet's go through a series of examples from simple to complex to fully master the usage of groupby.agg().
Example 1: Using Built-in Aggregation Functions
\\\\n\\\\nThe simplest way is to directly use the built-in aggregation function names provided by Pandas, such as 'sum', 'mean', 'max', 'min', 'count', etc.
\\\\n\\\\nExample
\\\\n\\\\nimport pandas as pd\\\\n\\\\n# Create sales data DataFrame\\\\n\\\\ndata = {\\\\n\\\\n'Region': ['North China','East','South China','North China','East','South China','North China','East'],\\\\n\\\\n'Product': ['A','B','C','B','A','C','A','B'],\\\\n\\\\n'SalesAmount': [1000,2000,1500,1800,2200,1600,1200,2100],\\\\n\\\\n'Quantity': [10,20,15,18,22,16,12,21]\\\\n\\\\n}\\\\n\\\\ndf = pd.DataFrame(data)\\\\n\\\\nprint("Original sales data:")\\\\nprint(df)\\\\nprint()\\\\n\\\\n# Group by "Region", apply multiple aggregation functions to ""Sales Amount" column\\\\n\\\\n# Use string list to specify multiple aggregation functions\\\\n\\\\nresult = df.groupby('Region')['SalesAmount'].agg(['sum','mean','max','min','count'])\\\\n\\\\nprint("Summary statistics of sales amount by region:")\\\\nprint(result)\\\\nprint()\\\\n\\\\n# Aggregate both "SalesAmount" and "Quantity" columns\\\\n\\\\nresult_all = df.groupby('Region').agg({\\\\n\\\\n'SalesAmount': ['sum','mean'],\\\\n\\\\n'Quantity': ['sum','mean']\\\\n\\\\n})\\\\n\\\\nprint("EachRegionSalesandQuantityComprehensive Statistics:")\\\\nprint(result_all)\\\\n\\\\n\\\\nOutput:
\\\\n\\\\nOriginal sales data:\\\\nRegion Product SalesAmount Quantity\\\\n0 North China A 1000 10\\\\n1 East China B 2000 20\\\\n2 South China C 1500 15\\\\n3 North China B 1800 18\\\\n4 East China A 2200 22\\\\n5 South China C 1600 16\\\\n6 North China A 1200 12\\\\n7 East China B 2100 21\\\\n\\\\nSummary statistics of sales amount by region:\\\\n sum mean max min count\\\\nRegion \\\\nEast 7100 1775.000000 2200 2000 4\\\\nSouth China 3100 1550.000000 1600 1500 2\\\\nNorth China 4000 1333.333333 1800 1000 3\\\\n\\\\nEachRegionSalesandQuantityComprehensive Statistics:\\\\n SalesAmount Quantity \\\\n sum mean sum mean\\\\nRegion \\\\nEast China 7100 1775.0 83 20.75\\\\nSouth China 3100 1550.0 31 15.50\\\\nNorth China 4000 1333.333333 40 13.33\\\\n\\\\n\\\\nCode Explanation:
\\\\n\\\\n- \\\\n
['sum', 'mean', 'max', 'min', 'count']using a list allows specifying multiple aggregation functions at once. \\\\n- The returned result is a DataFrame with multi-level column index composed of function names and original column names. \\\\n
- Using a dictionary allows specifying different aggregation functions for different columns. \\\\n
Example 2: Using Custom Aggregation Functions
\\\\n\\\\nBesides built-in functions, agg() also supports custom functions, greatly expanding its flexibility.
Example
\\\\n\\\\nimport pandas as pd\\\\nimport numpy as np\\\\n\\\\n# Create student score data\\\\n\\\\ndata = {\\\\n\\\\n'Class': ['A','A','A','B','B','B','B','A'],\\\\n\\\\n'Name': ['Zhang San','Li Si','Wang Wu','Zhao Liu','Sun Qi','Zhou Ba','Wu Jiu','Zheng Shi'],\\\\n\\\\n'1. Manager 8000 5\\n2. Manager 12000 8\\n3. Manager 11000 7\\n4. Each Department Salaryand Tenure Comprehensive Statistics:\\n5. Aggregation results using named parameter form:\\n6. Count\\n7. Region\\n8. North China\\n9. East\\n10. North China\\n11. East\\n12. North China\\n13. East\\n14. Product\\n15. Region\\n16. Product\\n17. ByRegionandProductGrouped aggregation results:"\\n18. Region\\n19. Product\\n20. Region Product Sales Amount Profit\\n21. North China A 1000 200\\n22. East China B 2000 400\\n23. North China B 1800 360\\n24. East A 2200 440\\n25. East B 2100 420\\n26. ByRegionandProductGrouped aggregation results:\\n27. Amount Profit\\n28. Region Product\\n29. East China B 4100 2050': [85,92,78,88,95,82,90,87],\\\\n\\\\n'Math': [90,85,92,78,88,91,85,89]\\\\n\\\\n}\\\\n\\\\ndf = pd.DataFrame(data)\\\\n\\\\nprint("Student grade data:")\\\\nprint(df)\\\\nprint()\\\\n\\\\n# Define custom aggregation functions\\\\n\\\\ndef range_func(x):\\\\n """Calculate the difference between max and min values (range)"""\\\\n return x.max() - x.min()\\\\n\\\\ndef coefficient_of_variation(x):\\\\n """Calculate coefficient of variation (standard deviation / mean)"""\\\\n return x.std() / x.mean() * 100\\\\n\\\\n# Use custom functions for aggregation\\\\n\\\\n# Can pass function name as string or function object directly\\\\n\\\\nresult = df.groupby('Class').agg({\\\\n\\\\n'1. Manager 8000 5\\n2. Manager 12000 8\\n3. Manager 11000 7\\n4. Each Department Salaryand Tenure Comprehensive Statistics:\\n5. Aggregation results using named parameter form:\\n6. Count\\n7. Region\\n8. North China\\n9. East\\n10. North China\\n11. East\\n12. North China\\n13. East\\n14. Product\\n15. Region\\n16. Product\\n17. ByRegionandProductGrouped aggregation results:"\\n18. Region\\n19. Product\\n20. Region Product Sales Amount Profit\\n21. North China A 1000 200\\n22. East China B 2000 400\\n23. North China B 1800 360\\n24. East A 2200 440\\n25. East B 2100 420\\n26. ByRegionandProductGrouped aggregation results:\\n27. Amount Profit\\n28. Region Product\\n29. East China B 4100 2050': ['sum','mean', range_func], # For 1. Manager 8000 5\\n2. Manager 12000 8\\n3. Manager 11000 7\\n4. Each Department Salaryand Tenure Comprehensive Statistics:\\n5. Aggregation results using named parameter form:\\n6. Count\\n7. Region\\n8. North China\\n9. East\\n10. North China\\n11. East\\n12. North China\\n13. East\\n14. Product\\n15. Region\\n16. Product\\n17. ByRegionandProductGrouped aggregation results:"\\n18. Region\\n19. Product\\n20. Region Product Sales Amount Profit\\n21. North China A 1000 200\\n22. East China B 2000 400\\n23. North China B 1800 360\\n24. East A 2200 440\\n25. East B 2100 420\\n26. ByRegionandProductGrouped aggregation results:\\n27. Amount Profit\\n28. Region Product\\n29. East China B 4100 2050: sum, mean, range\\\\n'Math': ['sum','mean', coefficient_of_variation] # For Math: sum, mean, coefficient of variation\\\\n\\\\n})\\\\n\\\\nprint("Custom statistics for each class:")\\\\nprint(result)\\\\nprint()\\\\n\\\\n# Also use custom functions directly in list\\\\n\\\\nresult2 = df.groupby('Class')['Math'].agg(['mean', range_func,'std'])\\\\n\\\\nprint("EachClassMathScore Statistics (including custom range and standard deviation):")\\\\nprint(result2)\\\\n\\\\n\\\\nOutput:
\\\\n\\\\nStudent grade data:\\\\nClass Name 1. Manager 8000 5\\n2. Manager 12000 8\\n3. Manager 11000 7\\n4. Each Department Salaryand Tenure Comprehensive Statistics:\\n5. Aggregation results using named parameter form:\\n6. Count\\n7. Region\\n8. North China\\n9. East\\n10. North China\\n11. East\\n12. North China\\n13. East\\n14. Product\\n15. Region\\n16. Product\\n17. ByRegionandProductGrouped aggregation results:"\\n18. Region\\n19. Product\\n20. Region Product Sales Amount Profit\\n21. North China A 1000 200\\n22. East China B 2000 400\\n23. North China B 1800 360\\n24. East A 2200 440\\n25. East B 2100 420\\n26. ByRegionandProductGrouped aggregation results:\\n27. Amount Profit\\n28. Region Product\\n29. East China B 4100 2050 Math\\\\n0 A Zhang San 85 90\\\\n1 A Li Si 92 85\\\\n2 A Wang Wu 78 92\\\\n3 B Zhao Liu 88 78\\\\n4 B Sun Qi 95 88\\\\n5 B Zhou Ba 82 91\\\\n6 B Wu Jiu 90 85\\\\n7 A Zheng Shi 87 89\\\\n\\\\nCustom statistics for each class:\\\\n 1. Manager 8000 5\\n2. Manager 12000 8\\n3. Manager 11000 7\\n4. Each Department Salaryand Tenure Comprehensive Statistics:\\n5. Aggregation results using named parameter form:\\n6. Count\\n7. Region\\n8. North China\\n9. East\\n10. North China\\n11. East\\n12. North China\\n13. East\\n14. Product\\n15. Region\\n16. Product\\n17. ByRegionandProductGrouped aggregation results:"\\n18. Region\\n19. Product\\n20. Region Product Sales Amount Profit\\n21. North China A 1000 200\\n22. East China B 2000 400\\n23. North China B 1800 360\\n24. East A 2200 440\\n25. East B 2100 420\\n26. ByRegionandProductGrouped aggregation results:\\n27. Amount Profit\\n28. Region Product\\n29. East China B 4100 2050 Math \\\\n sum mean range_func sum mean coefficient_of_variation\\\\nClass \\\\nA 342 85.5 14.0 425 88.25 5.019099\\\\nB 355 88.75 12.0 342 85.5 7.030\\\\n\\\\nEachClassMathScore Statistics (including custom range and statistics):\\\\nClass Average Range Standard Deviation\\\\nA 88.25 5.0 2.692582\\\\nB 85.50 13.0 6.0\\\\n\\\\n\\\\nCode Explanation:
\\\\n\\\\n- \\\\n
- Custom functions take a Series as input and return a scalar value. \\\\n
- Function names can be passed as strings (e.g., 'sum') or directly as function objects. \\\\n
- Using custom functions enables any complex aggregation logic. \\\\n
Example 3: Using String Aliases and Lambda Functions
\\\\n\\\\nagg() supports various ways to specify functions, including lambda expressions and built-in string aliases.
Example
\\\\n\\\\nimport pandas as pd\\\\n\\\\n# Create employee salary data\\\\n\\\\ndata = {\\\\n\\\\n'Department': ['Sales','Sales','technology','technology','Admin','Admin','Sales','technology'],\\\\n\\\\n'Position': ['Specialist','Manager','Specialist','Manager','Specialist','Manager','Specialist','Manager'],\\\\n\\\\n'Salary': [5000,8000,7000,12000,4500,9000,5500,11000],\\\\n\\\\n'Tenure': [2,5,3,8,1,6,2,7]\\\\n\\\\n}\\\\n\\\\ndf = pd.DataFrame(data)\\\\n\\\\nprint("Employee salary data:")\\\\nprint(df)\\\\nprint()\\\\n\\\\n# Use lambda functions for flexible custom calculations\\\\n\\\\n# Calculate the difference between median and mean salary for each department\\\\n\\\\nresult = df.groupby('Department').agg({\\\\n\\\\n'Salary': [\\\\n ('AverageSalary','mean'), # Give aggregation result an alias\\\\n ('Max Salary','max'),\\\\n ('Min Salary','min'),\\\\n ('SalaryRange',lambda x: x.max() - x.min())\\\\n],\\\\n\\\\n'Tenure': [\\\\n ('AverageTenure','mean'),\\\\n ('Max Tenure','max'),\\\\n ('Min Tenure','min')\\\\n]\\\\n\\\\n})\\\\n\\\\nprint("EachDepartmentSalaryandTenureComprehensive Statistics:")\\\\nprint(result)\\\\nprint()\\\\n\\\\n# Use shorthand form of agg\\\\n\\\\nresult2 = df.groupby('Department')['Salary'].agg(\\\\n Average='mean',\\\\n Sum='sum',\\\\n Count='count'\\\\n)\\\\n\\\\nprint("Aggregation results using named parameter form:")\\\\nprint(result2)\\\\n\\\\n\\\\nOutput:
\\\\n\\\\nEmployee salary data:\\\\nDepartment Position Salary Tenure\\\\n0 Sales Specialist 5000 2\\\\n1 Sales Manager 8000 5\\\\n2 technology Specialist 7000 3\\\\n3 technology Manager 12000 8\\\\n4 Admin Specialist 4500 1\\\\n5 Admin Manager 9000 6\\\\n6 Sales Specialist 5500 2\\\\n7 technology Manager 11000 7\\\\n\\\\nEachDepartmentSalaryandTenureComprehensive Statistics:\\\\n Salary Tenure \\\\n AverageSalary Max Salary Min Salary SalaryRange AverageTenure Max Tenure Min Tenure\\\\nDepartment \\\\ntechnology 10000.0 12000 7000 5000 6.0 8 3\\\\nAdmin 6750.0 9000 4500 4500 3.5 6 1\\\\nSales 6166.666667 8000 5000 3000 3.0 5 1\\\\n\\\\nAggregation results using named parameter form:\\\\n Average Sum Count\\\\nDepartment \\\\ntechnology 10000.0 30000 3\\\\nAdmin 6750.0 13500 2\\\\nSales 6166.666667 18500 3\\\\n\\\\n\\\\nCode Explanation:
\\\\n\\\\n- \\\\n
- Using tuples
('alias', 'function')gives custom names to aggregated result columns. \\\\n - Lambda expressions can be used directly inside
agg()for simple custom logic. \\\\n - Named parameters (keyword arguments) allow intuitive naming of result columns. \\\\n
Example 4: Grouping by Multiple Columns and Using agg
\\\\n\\\\nagg() can also be used with multi-column grouping to handle more complex data analysis needs.
Example
\\\\n\\\\nimport pandas as pd\\\\n\\\\n# Create sales data\\\\n\\\\ndata = {\\\\n\\\\n'Region': ['North China','East','South China','North China','East','South China','North China','East','South China'],\\\\n\\\\n'Product': ['A','B','C','B','A','C','A','B','C'],\\\\n\\\\n'SalesAmount': [1000,2000,1500,1800,2200,1600,1200,2100,1700],\\\\n\\\\n'Profit': [200,400,300,360,440,320,240,420,340]\\\\n\\\\n}\\\\n\\\\ndf = pd.DataFrame(data)\\\\n\\\\nprint("Sales data:")\\\\nprint(df)\\\\nprint()\\\\n\\\\n# Group by region and product, aggregate sales and profit with multiple functions\\\\n\\\\nresult = df.groupby(['Region','Product']).agg({\\\\n\\\\n'SalesAmount': ['sum','mean','count'],\\\\n\\\\n'Profit': ['sum','mean']\\\\n\\\\n})\\\\n\\\\nprint("ByRegionandProductGrouped aggregation results:")\\\\nprint(result)\\\\nprint()\\\\n\\\\n# Convert to regular DataFrame format using reset_index\\\\n\\\\nresult_flat = df.groupby(['Region','Product'], as_index=False).agg({\\\\n\\\\n'SalesAmount': ['sum','mean'],\\\\n\\\\n'Profit': ['sum','mean']\\\\n\\\\n})\\\\n\\\\n# Flatten multi-level column index\\\\n\\\\nresult_flat.columns=['_'.join(col).strip('_') for col in result_flat.columns.values]\\\\n\\\\nprint("Flattened result:")\\\\nprint(result_flat)\\\\n\\\\n\\\\nOutput:
\\\\n\\\\nSales data:\\\\nRegion Product Sales Amount Profit\\\\n0 North China A 1000 200\\\\n1 East China B 2000 400\\\\n2 South China C 1500 300\\\\n3 North China B 1800 360\\\\n4 East A 2200 440\\\\n5 South China C 1600 320\\\\n6 North China A 1200 240\\\\n7 East B 2100 420\\\\n8 South China C 1700 340\\\\n\\\\nByRegionandProductGrouped aggregation results:\\\\n SalesAmount Profit \\\\n sum mean count sum mean\\\\nRegion Product \\\\nEast China A 2200 2200.0 1 440 440.0\\\\n B 4100 2050.0 2 820 410.0\\\\nSouth China C 3100 1550.0 2 660 330.0\\\\nNorth China A 2200 1100.0 2 440 220.0\\\\n B 1800 1800.0 1 360 360.0\\\\n\\\\nFlattened result:\\\\nRegion Product SalesAmount_sum SalesAmount_mean Profit_sum Profit_mean\\\\n0 East China A 2200 2200.0 440 440.0\\\\n1 East China B 4100 2050.0 820 410.0\\\\n2 South China C 3100 1550.0 660 330.0\\\\n3 North China A 2200 1100.0 440 220.0\\\\n4 North China B 1800 1800.0 360 360.0\\\\n\\\\n\\\\nCode Explanation:
\\\\n\\\\n- \\\\n
- Multi-column grouping creates a MultiIndex. \\\\n
- Using
as_index=Falsekeeps grouping columns as normal columns. \\\\n - Rename columns to flatten multi-level column index into single level. \\\\n
\\\\n\\\\n
Tip: agg() is one of the most commonly used functions for grouped aggregation. It not only calculates multiple statistics at once but also supports custom functions, offering high flexibility. It is recommended to master all its usage patterns in practical projects.
\\\\n\\\\n
YouTip