Pandas groupby.sum() Function |
\n\n\n\n\n\n
groupby.sum() is an aggregation function in Pandas used for summing after grouping. It is usually used together with the groupby() function, first grouping the data by the values of a certain column, and then performing sum calculations on the numeric columns within each group.
In practical data analysis, sum() is one of the most commonly used aggregation functions. For example, calculating total salary per department, total sales per region, or total quantity per category.
\n\n
Basic Syntax and Parameters
\n\nsum() is a member function of the GroupBy object, and it needs to be called after using groupby() for grouping.
Syntax Format
\n\nGroupBy.sum(numeric_only=False, min_count=-1, engine=None, engine_kwargs=None)\n\n\nParameter Description
\n\n| Parameter | \nType | \nDescription | \nDefault Value | \n
|---|---|---|---|
| numeric_only | \nbool | \nIf True, only numeric columns will be summed; if False, it attempts to sum all columns. | \nFalse | \n
| min_count | \nint | \nThe minimum number of valid values required for summing. If the number of valid values in a group is less than this, NaN is returned. | \n-1 | \n
| engine | \nstr | \nSpecifies the calculation engine, can be 'cython' or 'numba'. None means Pandas chooses automatically. | \nNone | \n
| engine_kwargs | \ndict | \nA dictionary of additional arguments passed to the underlying engine. | \nNone | \n
Return Value
\n\n- \n
- Return Type:
SeriesorDataFrame\n - Description: Returns the result after grouping and summing. If applied to a single column, returns a Series; if applied to multiple columns, returns a DataFrame. \n
\n\n
Examples
\n\nLet's go through a series of examples from simple to complex to fully master the usage of groupby.sum().
Example 1: Basic Usage - Sum by Single Column Grouping
\n\nThe most basic usage is to group by the value of one column and then sum another column.
\n\nExample
\n\nimport pandas as pd\n\n# Create a sales data DataFrame\n\n# Simulate a table containing region, product, and sales amount\n\ndata = {\n 'region': ['North China', 'East China', 'South China', 'North China', 'East China', 'South China', 'North China', 'East China'],\n 'product': ['A', 'B', 'C', 'B', 'A', 'C', 'A', 'B'],\n 'sales_amount': [1000, 2000, 1500, 1800, 2200, 1600, 1200, 2100],\n 'quantity': [10, 20, 15, 18, 22, 16, 12, 21]\n}\n\n# Create DataFrame\ndf = pd.DataFrame(data)\n\nprint("Original Sales Data:")\nprint(df)\nprint()\n\n# Group by "region", calculate total sales amount for each region\nsales_by_region = df.groupby('region')['sales_amount'].sum()\n\nprint("Total Sales Amount by Region:")\nprint(sales_by_region)\nprint()\n\n# Group by "product", calculate total sales quantity for each product\nquantity_by_product = df.groupby('product')['quantity'].sum()\n\nprint("Total Sales Quantity by Product:")\nprint(quantity_by_product)\n\n\nOutput:
\n\nOriginal Sales Data:\n region product sales_amount quantity\n0 North China A 1000 10\n1 East China B 2000 20\n2 South China C 1500 15\n3 North China B 1800 18\n4 East China A 2200 22\n5 South China C 1600 16\n6 North China A 1200 12\n7 East China B 2100 21\n\nTotal Sales Amount by Region:\nregion\nEast China 7100\nNorth China 4000\nSouth China 3100\nName: sales_amount, dtype: int64\n\nTotal Sales Quantity by Product:\nproduct\nA 44\nB 59\nC 31\nName: quantity, dtype: int64\n\n\nCode Explanation:
\n\n- \n
df.groupby('region')['sales_amount']means first group by the "region" column, then select the "sales_amount" column for operation. \n.sum()sums up the sales amounts in each group. \n- The result is a Series where the index is the values of the grouping column (i.e., region names). \n
Example 2: Multi-column Grouping and Summing
\n\nYou can group by multiple columns and then sum the numeric columns.
\n\nExample
\n\nimport pandas as pd\n\n# Create sales data\ndata = {\n 'region': ['North China', 'East China', 'South China', 'North China', 'East China', 'South China', 'North China', 'East China'],\n 'product': ['A', 'B', 'C', 'B', 'A', 'C', 'A', 'B'],\n 'sales_amount': [1000, 2000, 1500, 1800, 2200, 1600, 1200, 2100],\n 'quantity': [10, 20, 15, 18, 22, 16, 12, 21]\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Sales Data:")\nprint(df)\nprint()\n\n# Group by both "region" and "product", calculate total sales amount and quantity\n# as_index=False means that the grouping columns are retained as regular columns in the result\ngrouped = df.groupby(['region', 'product'], as_index=False).sum()\n\nprint("Total by Region and Product:")\nprint(grouped)\nprint()\n\n# You can also not specify columns to sum all numeric columns\ngrouped_all = df.groupby(['region', 'product']).sum()\n\nprint("Sum of All Numeric Columns (returns DataFrame):")\nprint(grouped_all)\n\n\nOutput:
\n\nOriginal Sales Data:\n region product sales_amount quantity\n0 North China A 1000 10\n1 East China B 2000 20\n2 South China C 1500 15\n3 North China B 1800 18\n4 East China A 2200 22\n5 South China C 1600 16\n6 North China A 1200 12\n7 East China B 2100 21\n\nTotal by Region and Product:\n region product sales_amount quantity\n0 East China A 2200 22\n1 East China B 4100 41\n2 South China C 3100 31\n3 North China A 2200 22\n4 North China B 1800 18\n\nSum of All Numeric Columns (returns DataFrame):\n sales_amount quantity\nregion product\nEast China A 2200 22\n B 4100 41\nSouth China C 3100 31\nNorth China A 2200 22\n B 1800 18\n\n\nCode Explanation:
\n\n- \n
['region', 'product']uses a list to group by multiple columns at once. \n- When
as_index=False, the result is in DataFrame format with grouping columns preserved as regular columns. \n - When no columns are specified (e.g.,
.sum()), it sums all numeric columns. \n
Example 3: Using the min_count Parameter
\n\nThe min_count parameter controls the minimum number of valid values required for summing within groups. If a group has fewer valid values than this number, it returns NaN.
Example
\n\nimport pandas as pd\nimport numpy as np\n\n# Create data with missing values\ndata = {\n 'department': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],\n 'salary': [5000, 6000, np.nan, 7000, np.nan, np.nan, 8000, 9000, 10000]\n}\n\ndf = pd.DataFrame(data)\n\nprint("Employee Salary Data (with missing values):")\nprint(df)\nprint()\n\n# By default, sum() ignores NaN values when summing\nsum_default = df.groupby('department')['salary'].sum()\n\nprint("Default Sum (ignoring NaN):")\nprint(sum_default)\nprint()\n\n# Set min_count=3, requiring at least 3 valid values for summing\nsum_min_count = df.groupby('department')['salary'].sum(min_count=3)\n\nprint("Sum Result with min_count=3:")\nprint(sum_min_count)\nprint()\n\n# Set min_count=2\nsum_min_count2 = df.groupby('department')['salary'].sum(min_count=2)\n\nprint("Sum Result with min_count=2:")\nprint(sum_min_count2)\n\n\nOutput:
\n\nEmployee Salary Data (with missing values):\n department salary\n0 A 5000.0\n1 A 6000.0\n2 A NaN\n3 B 7000.0\n4 B NaN\n5 B NaN\n6 C 8000.0\n7 C 9000.0\n8 C 10000.0\n\nDefault Sum (ignoring NaN):\ndepartment\nA 11000.0\nB 7000.0\nC 27000.0\nName: salary, dtype: float64\n\nSum Result with min_count=3:\ndepartment\nA 11000.0\nB NaN\nC 27000.0\nName: salary, dtype: float64\n\nSum Result with min_count=2:\ndepartment\nA 11000.0\nB 7000.0\nC 27000.0\nName: salary, dtype: float64\n\n\nCode Explanation:
\n\n- \n
- By default,
sum()ignores NaN values and only sums valid values. \n min_count=3means each group must have at least 3 valid values to compute the sum; otherwise, it returns NaN. \n- Department A has only 2 valid values (5000 and 6000), which does not meet min_count=3, so it returns NaN; Department C has 3 valid values and can sum normally. \n
\n\n
\n\n\nNote: By default,
\nsum()ignores NaN values during summation, which aligns with our common understanding of handling "empty values". If you want to treat NaN as 0 before summing, you can first use thefillna(0)method to fill missing values.
\n\n
YouTip