Pandas groupby.sum() Function |

\n\n

Image 1: Pandas Common functions Pandas Common Functions

\n\n

groupby.sum() is an aggregation function in Pandas used for summing after grouping. It is usually used together with the groupby() function, first grouping the data by the values of a certain column, and then performing sum calculations on the numeric columns within each group.

\n\n

In practical data analysis, sum() is one of the most commonly used aggregation functions. For example, calculating total salary per department, total sales per region, or total quantity per category.

\n\n

Basic Syntax and Parameters

\n\n

sum() is a member function of the GroupBy object, and it needs to be called after using groupby() for grouping.

\n\n

Syntax Format

\n\n

GroupBy.sum(numeric_only=False, min_count=-1, engine=None, engine_kwargs=None)\n

\n\n

Parameter Description

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n

Parameter	Type	Description	Default Value
numeric_only	bool	If True, only numeric columns will be summed; if False, it attempts to sum all columns.	False
min_count	int	The minimum number of valid values required for summing. If the number of valid values in a group is less than this, NaN is returned.	-1
engine	str	Specifies the calculation engine, can be 'cython' or 'numba'. None means Pandas chooses automatically.	None
engine_kwargs	dict	A dictionary of additional arguments passed to the underlying engine.	None

\n\n

Return Value

\n\n

Return Type: Series or DataFrame
Description: Returns the result after grouping and summing. If applied to a single column, returns a Series; if applied to multiple columns, returns a DataFrame.

\n\n

Examples

\n\n

Let's go through a series of examples from simple to complex to fully master the usage of groupby.sum().

\n\n

Example 1: Basic Usage - Sum by Single Column Grouping

\n\n

The most basic usage is to group by the value of one column and then sum another column.

\n\n

Example

\n\n

import pandas as pd\n\n# Create a sales data DataFrame\n\n# Simulate a table containing region, product, and sales amount\n\ndata = {\n    'region': ['North China', 'East China', 'South China', 'North China', 'East China', 'South China', 'North China', 'East China'],\n    'product': ['A', 'B', 'C', 'B', 'A', 'C', 'A', 'B'],\n    'sales_amount': [1000, 2000, 1500, 1800, 2200, 1600, 1200, 2100],\n    'quantity': [10, 20, 15, 18, 22, 16, 12, 21]\n}\n\n# Create DataFrame\ndf = pd.DataFrame(data)\n\nprint("Original Sales Data:")\nprint(df)\nprint()\n\n# Group by "region", calculate total sales amount for each region\nsales_by_region = df.groupby('region')['sales_amount'].sum()\n\nprint("Total Sales Amount by Region:")\nprint(sales_by_region)\nprint()\n\n# Group by "product", calculate total sales quantity for each product\nquantity_by_product = df.groupby('product')['quantity'].sum()\n\nprint("Total Sales Quantity by Product:")\nprint(quantity_by_product)\n

\n\n

Output:

\n\n

Original Sales Data:\n   region product  sales_amount  quantity\n0  North China       A          1000        10\n1  East China        B          2000        20\n2  South China       C          1500        15\n3  North China       B          1800        18\n4  East China        A          2200        22\n5  South China       C          1600        16\n6  North China       A          1200        12\n7  East China        B          2100        21\n\nTotal Sales Amount by Region:\nregion\nEast China         7100\nNorth China        4000\nSouth China        3100\nName: sales_amount, dtype: int64\n\nTotal Sales Quantity by Product:\nproduct\nA    44\nB    59\nC    31\nName: quantity, dtype: int64\n

\n\n

Code Explanation:

\n\n

df.groupby('region')['sales_amount'] means first group by the "region" column, then select the "sales_amount" column for operation.
.sum() sums up the sales amounts in each group.
The result is a Series where the index is the values of the grouping column (i.e., region names).

\n\n

Example 2: Multi-column Grouping and Summing

\n\n

You can group by multiple columns and then sum the numeric columns.

\n\n

Example

\n\n

import pandas as pd\n\n# Create sales data\ndata = {\n    'region': ['North China', 'East China', 'South China', 'North China', 'East China', 'South China', 'North China', 'East China'],\n    'product': ['A', 'B', 'C', 'B', 'A', 'C', 'A', 'B'],\n    'sales_amount': [1000, 2000, 1500, 1800, 2200, 1600, 1200, 2100],\n    'quantity': [10, 20, 15, 18, 22, 16, 12, 21]\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Sales Data:")\nprint(df)\nprint()\n\n# Group by both "region" and "product", calculate total sales amount and quantity\n# as_index=False means that the grouping columns are retained as regular columns in the result\ngrouped = df.groupby(['region', 'product'], as_index=False).sum()\n\nprint("Total by Region and Product:")\nprint(grouped)\nprint()\n\n# You can also not specify columns to sum all numeric columns\ngrouped_all = df.groupby(['region', 'product']).sum()\n\nprint("Sum of All Numeric Columns (returns DataFrame):")\nprint(grouped_all)\n

\n\n

Output:

\n\n

Original Sales Data:\n   region product  sales_amount  quantity\n0  North China       A          1000        10\n1  East China        B          2000        20\n2  South China       C          1500        15\n3  North China       B          1800        18\n4  East China        A          2200        22\n5  South China       C          1600        16\n6  North China       A          1200        12\n7  East China        B          2100        21\n\nTotal by Region and Product:\n   region product  sales_amount  quantity\n0  East China       A          2200        22\n1  East China       B          4100        41\n2  South China       C          3100        31\n3  North China       A          2200        22\n4  North China       B          1800        18\n\nSum of All Numeric Columns (returns DataFrame):\n            sales_amount  quantity\nregion     product\nEast China A          2200        22\n           B          4100        41\nSouth China C          3100        31\nNorth China A          2200        22\n           B          1800        18\n

\n\n

Code Explanation:

\n\n

['region', 'product'] uses a list to group by multiple columns at once.
When as_index=False, the result is in DataFrame format with grouping columns preserved as regular columns.
When no columns are specified (e.g., .sum()), it sums all numeric columns.

\n\n

Example 3: Using the min_count Parameter

\n\n

The min_count parameter controls the minimum number of valid values required for summing within groups. If a group has fewer valid values than this number, it returns NaN.

\n\n

Example

\n\n

import pandas as pd\nimport numpy as np\n\n# Create data with missing values\ndata = {\n    'department': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],\n    'salary': [5000, 6000, np.nan, 7000, np.nan, np.nan, 8000, 9000, 10000]\n}\n\ndf = pd.DataFrame(data)\n\nprint("Employee Salary Data (with missing values):")\nprint(df)\nprint()\n\n# By default, sum() ignores NaN values when summing\nsum_default = df.groupby('department')['salary'].sum()\n\nprint("Default Sum (ignoring NaN):")\nprint(sum_default)\nprint()\n\n# Set min_count=3, requiring at least 3 valid values for summing\nsum_min_count = df.groupby('department')['salary'].sum(min_count=3)\n\nprint("Sum Result with min_count=3:")\nprint(sum_min_count)\nprint()\n\n# Set min_count=2\nsum_min_count2 = df.groupby('department')['salary'].sum(min_count=2)\n\nprint("Sum Result with min_count=2:")\nprint(sum_min_count2)\n

\n\n

Output:

\n\n

Employee Salary Data (with missing values):\n  department  salary\n0          A  5000.0\n1          A  6000.0\n2          A     NaN\n3          B  7000.0\n4          B     NaN\n5          B     NaN\n6          C  8000.0\n7          C  9000.0\n8          C 10000.0\n\nDefault Sum (ignoring NaN):\ndepartment\nA    11000.0\nB     7000.0\nC    27000.0\nName: salary, dtype: float64\n\nSum Result with min_count=3:\ndepartment\nA    11000.0\nB         NaN\nC    27000.0\nName: salary, dtype: float64\n\nSum Result with min_count=2:\ndepartment\nA    11000.0\nB     7000.0\nC    27000.0\nName: salary, dtype: float64\n

\n\n

Code Explanation:

\n\n

By default, sum() ignores NaN values and only sums valid values.
min_count=3 means each group must have at least 3 valid values to compute the sum; otherwise, it returns NaN.
Department A has only 2 valid values (5000 and 6000), which does not meet min_count=3, so it returns NaN; Department C has 3 valid values and can sum normally.

\n\n

\n
Note: By default, sum() ignores NaN values during summation, which aligns with our common understanding of handling "empty values". If you want to treat NaN as 0 before summing, you can first use the fillna(0) method to fill missing values.
\n

\n\n

Image 2: Pandas Common functions Pandas Common Functions

YouTip