YouTip LogoYouTip

Pandas Pd Crosstab

[![Image 1: PandasCommon Functions](#) PandasCommon Functions](#) * * * `pd.crosstab()` is a function in the Pandas library used for **calculating cross-tabulations**. It can compute frequencies or aggregate statistics of two or more variables, generating tables similar to Excel pivot tables. Cross-tabs are very useful tools in statistical analysis and are often used to analyze relationships between two or more categorical variables, such as counting male and female employees in different departments. **Word Definition**: `crosstab` is an abbreviation for "cross tabulation", a statistical table that displays joint distributions of two or more variables. * * * ## Basic Syntax and Parameters `pd.crosstab()` is a top-level function in the Pandas library used for calculating cross-tabulations. ### Syntax Format pd.crosstab(index, columns, values=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False) ### Parameter Description * **Parameter**: `index` * Type: array-like, Series, or list. * Description: Data used as row index. Can be a single column or multiple columns (list). * **Parameter**: `columns` * Type: array-like, Series, or list. * Description: Data used as column index. Can be a single column or multiple columns (list). * **Parameter**: `values` * Type: array-like or None. * Description: Values to be aggregated. If specified, it needs to be used with the `aggfunc` parameter. * **Parameter**: `aggfunc` * Type: function or string. * Description: Aggregation function (such as 'sum', 'mean', 'count', etc.). Required when `values` is specified. * **Parameter**: `margins` * Type: boolean. * Description: If `True`, add row and column totals. Default is `False`. * **Parameter**: `normalize` * Type: boolean or 'all', 'index', 'columns'. * Description: If `True`, divide values by the total to generate proportions (0-1). Can also specify 'index' or 'columns' to normalize by rows or columns respectively. ### Function Description * **Return Value**: Returns a DataFrame representing the cross relationship between two or more variables. * **Effect**: Generates a table where rows and columns represent different categorical variables, and cells display corresponding frequencies or aggregated values. * * * ## Examples Let's go through a series of examples from simple to complex to fully master the usage of `pd.crosstab()`. ### Example 1: Basic Usage - Cross-tabulation of Two Categorical Variables ## Example import pandas as pd # 1. Create employee data employees = pd.DataFrame({ 'department': ['Sales','Sales','Sales','Engineering','Engineering','HR','HR','Sales','Engineering','HR'], 'gender': ['Male','Female','Female','Male','Male','Female','Male','Female','Female','Male'] }) print("=== Employee Data ===") print(employees) # 2. Calculate cross-tabulation of department and gender (frequency) result = pd.crosstab(employees['department'], employees['gender']) print("n=== pd.crosstab() Department vs Gender Cross-tabulation ===") print(result) **Expected Output:** === Employee Data === department gender 0 Sales Male1 Sales Female2 Sales Female3 Engineering Male4 Engineering Male5 HR Female6 HR Male7 Sales Female8 Engineering Female9 HR Male=== pd.crosstab() Department vs Gender Cross-tabulation === gender Female Male department Engineering 1 2 HR 1 2Sales 3 1 **Code Explanation:** 1. The result table shows the number of male and female employees in each department. ### Example 2: Adding Margins and Normalization Using the `margins` parameter adds totals, and the `normalize` parameter allows viewing proportional distributions. ## Example import pandas as pd # Using the above employee data employees = pd.DataFrame({ 'department': ['Sales','Sales','Sales','Engineering','Engineering','HR','HR','Sales','Engineering','HR'], 'gender': ['Male','Female','Female','Male','Male','Female','Male','Female','Female','Male'] }) # 1. Add row/column totals result_margins = pd.crosstab(employees['department'], employees['gender'], margins=True) print("=== Adding Totals (margins=True) ===") print(result_margins) # 2. Normalize - convert to proportions print("n=== Overall Normalization (normalize=True) ===") result_norm = pd.crosstab(employees['department'], employees['gender'], normalize=True) print(result_norm.round(3)) # 3. Normalize by row (each row sums to 1) print("n=== Row-wise Normalization (normalize='index') ===") result_norm_index = pd.crosstab(employees['department'], employees['gender'], normalize='index') print(result_norm_index.round(3)) # 4. Normalize by column (each column sums to 1) print("n=== Column-wise Normalization (normalize='columns') ===") result_norm_col = pd.crosstab(employees['department'], employees['gender'], normalize='columns') print(result_norm_col.round(3)) **Expected Output:** === Adding Totals (margins=True) === gender Female Male All department Engineering 1 2 3 HR 1 2 3Sales 3 1 4All 5 5 10=== Overall Normalization (normalize=True) === gender Female Male department Engineering 0.1 0.2 HR 0.1 0.2Sales 0.3 0.1=== Row-wise Normalization (normalize='index') === gender Female Male department Engineering 0.333 0.667 HR 0.333 0.667Sales 0.75 0.25=== Column-wise Normalization (normalize='columns') === gender Female Male department Engineering 0.2 0.4 HR 0.2 0.4Sales 0.6 0.2 **Code Explanation:** * Normalization helps analyze proportional relationships and understand relative distributions of genders within departments. * Row-wise normalization shows percentage distribution of genders within each department. * Column-wise normalization shows the proportion of each gender across departments. ### Example 3: Multiple Variables and Aggregation Functions Multiple row/column variables can be used along with aggregation functions to calculate statistics. ## Example import pandas as pd import numpy as np # 1. Create sales data sales = pd.DataFrame({ 'region': ['North','North','South','South','East','East','West','West','North','South'], 'product': ['A','B','A','B','A','B','A','B','B','A'], 'quarter': ['Q1','Q1','Q1','Q1','Q2','Q2','Q2','Q2','Q2','Q2'], 'sales': [100,150,200,180,120,160,190,170,140,210] }) print("=== Sales Data ===") print(sales) # 2. Use multiple row variables print("n=== Region and Product Cross-tabulation (Frequency) ===") result_multi = pd.crosstab([sales['region'], sales['product']], sales['quarter']) print(result_multi) # 3. Use values and aggfunc to calculate sales amounts print("n=== Total Sales by Region and Quarter ===") result_sum = pd.crosstab(sales['region'], sales['quarter'], values=sales['sales'], aggfunc='sum') print(result_sum) # 4. Calculate average sales print("n=== Average Sales by Region and Quarter ===") result_mean = pd.crosstab(sales['region'], sales['quarter'], values=sales['sales'], aggfunc='mean') print(result_mean.round(1)) **Expected Output:** === Sales Data === region product quarter sales 0 North A Q1 1001 North B Q1 1502 South A Q1 2003 South B Q1 1804 East A Q2 1205 East B Q2 1606 West A Q2 1907 West B Q2 1708 North B Q2 1409 South A Q2 210=== Region and Product Cross-tabulation (Frequency) === quarter Q1 Q2 region product East A 0 1 B 0 1North A 1 0 B 1 1South A 1 1 B 1 0West A 0 1 B 0 1=== Total Sales by Region and Quarter === quarter Q1 Q2 region East NaN 280North 250 140South 380 210West NaN 360=== Average Sales by Region and Quarter === quarter Q1 Q2 region East NaN 140.0North 125.0 140.0South 190.0 210.0West NaN 180.0 **Code Explanation:** * Using multiple row variables (in list form) creates multi-level indexed cross-tabulations. * Each cell shows the frequency of specific region and product combinations in a particular quarter. * Using the `values` and `aggfunc` parameters allows calculation of actual numerical aggregations (sums, averages, etc.). ### Example 4: Handling Missing Values The `dropna` parameter controls whether missing categories are included. ## Example import pandas as pd import numpy as np # 1. Data with missing values data = pd.DataFrame({ 'A': ['a','a','b', np.nan,'a','b'], 'B': ['x','y','x','x', np.nan,'y'] }) print("=== Data with Missing Values ===") print(data) # 2. Default removes combinations with missing values result_drop = pd.crosstab(data['A'], data['B']) print("n=== Default Remove Missing Values (dropna=True) ===") print(result_drop) # 3. Include missing categories result_keep = pd.crosstab(data['A'], data['B'], dropna=False) print("n=== Include Missing Categories (dropna=False) ===") print(result_keep) **Expected Output:** === Data with Missing Values === A B 0 a x 1 a y 2 b x 3 NaN x 4 a NaN5 b y === Default Remove Missing Values (dropna=True) === B x y A a 1 1 b 1 1=== Include Missing Categories (dropna=False) === B x y NaN A a 1 1 1 b 1 1 0NaN 1 0 0 **Code Explanation:** * By default, `dropna=True` removes rows and columns containing missing values, showing only valid data. * Using `dropna=False` allows viewing the frequency of combinations including missing values. * * * > **Tip:** `pd.crosstab()` is suitable for analyzing relationships between categorical variables. For more complex data summaries (e.g., calculating multiple aggregation metrics), consider using the `pd.pivot_table()` function. [![Image 2: PandasCommon Functions](#) PandasCommon Functions](#)
← Pandas Pd Read CsvPandas Pd Value Counts β†’