YouTip LogoYouTip

Pandas Pd Get Dummies

Pandas pd.get_dummies() Function |

\n\n

Image 1: Pandas Common Functions Pandas General Functions

\n\n
\n\n

pd.get_dummies() is a function in the Pandas library used for one-hot encoding of categorical variables. It converts categorical variables into binary (0/1) column formats, with each category corresponding to one column.

\n\n

One-hot encoding is a common technique in machine learning preprocessing because most algorithms cannot directly handle categorical data and need to convert it into numerical form.

\n\n

Word Definition: In get_dummies, "dummy" here means "dummy variable," which refers to binary variables used to represent categorical variables in statistics and econometrics.

\n\n
\n\n

Basic Syntax and Parameters

\n\n

pd.get_dummies() is a top-level function in the Pandas library used to convert categorical variables into one-hot encoded format.

\n\n

Syntax Format

\n\n
pd.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, drop_first=False, dtype=None)
\n\n

Parameter Description

\n\n
    \n
  • Parameter: data\n
      \n
    • Type: Series, DataFrame, or array-like object.
    • \n
    • Description: The data to be one-hot encoded. Usually a Series or DataFrame containing categorical variables.
    • \n
    \n
  • \n
  • Parameter: prefix\n
      \n
    • Type: String, list of strings, or dictionary.
    • \n
    • Description: Prefix for the new column names. If not specified, the original column names are used as prefixes.
    • \n
    \n
  • \n
  • Parameter: prefix_sep\n
      \n
    • Type: String.
    • \n
    • Description: Separator between prefix and category name. Default is underscore '_'.
    • \n
    \n
  • \n
  • Parameter: dummy_na\n
      \n
    • Type: Boolean.
    • \n
    • Description: If True, creates a separate column for missing values (NaN). Default is False.
    • \n
    \n
  • \n
  • Parameter: columns\n
      \n
    • Type: List or None.
    • \n
    • Description: Column names to encode. If not specified, all object, category, or boolean type columns are encoded.
    • \n
    \n
  • \n
  • Parameter: drop_first\n
      \n
    • Type: Boolean.
    • \n
    • Description: If True, drops the first column of each categorical variable to avoid multicollinearity. Can prevent the dummy variable trap in models like logistic regression. Default is False.
    • \n
    \n
  • \n
\n\n

Function Description

\n\n
    \n
  • Return Value: Returns a DataFrame where each column corresponds to a category, with values of 0 or 1.
  • \n
  • Effect: Converts categorical variables into numeric binary columns, making them suitable for machine learning algorithms.
  • \n
\n\n
\n\n

Examples

\n\n

Let's go through a series of examples from simple to complex to fully master the usage of pd.get_dummies().

\n\n

Example 1: Basic Usage - One-Hot Encoding a Series

\n\n

Example

\n\n
import pandas as pd\n\n# 1. Create a Series with categorical variables\ncolors = pd.Series(['red','blue','green','red','green','blue'])\n\nprint("=== Original Series ===")\nprint(colors)\n\n# 2. Use pd.get_dummies() for one-hot encoding\nresult = pd.get_dummies(colors)\n\nprint("n=== pd.get_dummies() One-Hot Encoding Result ===")\nprint(result)\n
\n\n

Expected Output:

\n\n
=== Original Series ===\n0    red\n1   blue\n2  green\n3    red\n4  green\n5   blue\ndtype: object\n\n=== pd.get_dummies() One-Hot Encoding Result ===\n   blue  green  red\n0  False  False   True\n1   True  False  False\n2  False   True  False\n3  False  False   True\n4  False   True  False\n5   True  False  False\n
\n\n

Code Explanation:

\n\n
    \n
  1. The original Series contains three colors: red, blue, green.
  2. \n
  3. After one-hot encoding, each color becomes a column, represented by True/False indicating whether that row belongs to that category.
  4. \n
  5. Each row has exactly one True, corresponding to the original color value.
  6. \n
\n\n

Example 2: Encoding Specific Columns in a DataFrame

\n\n

In data analysis, we often only want to encode specific categorical columns while keeping numerical columns unchanged.

\n\n

Example

\n\n
import pandas as pd\n\n# 1. Create a DataFrame with numerical and categorical variables\ndf = pd.DataFrame({\n    'name': ['Alice','Bob','Charlie','Diana'],\n    'age': [25,30,35,28],\n    'city': ['Beijing','Shanghai','Beijing','Guangzhou'],\n    'department': ['Sales','Engineering','Sales','HR']\n})\n\nprint("=== Original DataFrame ===")\nprint(df)\n\n# 2. Perform one-hot encoding on specific categorical columns\nresult = pd.get_dummies(df, columns=['city','department'])\n\nprint("n=== One-Hot Encoding on city and department columns ===")\nprint(result)\n
\n\n

Expected Output:

\n\n
=== Original DataFrame ===\n      name  age      city department\n0    Alice   25   Beijing      Sales\n1      Bob   30  Shanghai  Engineering\n2  Charlie   35   Beijing      Sales\n3    Diana   28  Guangzhou         HR\n\n=== One-Hot Encoding on city and department columns ===\n      name  age  city_Beijing  city_Guangzhou  city_Shanghai  department_Engineering  department_HR  department_Sales\n0    Alice   25          True           False          False                   False           True              True\n1      Bob   30         False           False          True                    True          False             False\n2  Charlie   35          True           False          False                   False          False              True\n3    Diana   28         False            True         False                   False          False             False\n
\n\n

Code Explanation:

\n\n
    \n
  • The columns parameter specifies that only the city and department columns are encoded.
  • \n
  • Numerical column age and text column name remain unchanged.
  • \n
  • New column names use the default separator underscore, such as city_Beijing.
  • \n
\n\n

Example 3: Custom Prefixes and Separators

\n\n

You can customize new column names using the prefix and prefix_sep parameters.

\n\n

Example

\n\n
import pandas as pd\n\n# 1. Create a DataFrame\ndf = pd.DataFrame({\n    'color': ['red','blue','green','red'],\n    'size': ['S','M','L','XL']\n})\n\nprint("=== Original DataFrame ===")\nprint(df)\n\n# 2. Use prefix parameter to customize prefixes\nresult_prefix = pd.get_dummies(df, prefix=['color','size'])\n\nprint("n=== Using Custom Prefixes ===")\nprint(result_prefix)\n\n# 3. Use prefix_sep to customize separator\nresult_sep = pd.get_dummies(df, prefix=['color','size'], prefix_sep='-')\n\nprint("n=== Using Custom Separator '-' ===")\nprint(result_sep)\n
\n\n

Expected Output:

\n\n
=== Original DataFrame ===\n  color size\n0   red    S\n1  blue    M\n2 green    L\n3   red   XL\n\n=== Using Custom Prefixes ===\n   color_blue color_green color_red size_L size_M size_S size_XL\n0       False       False        True   False   False     True    False\n1        True       False       False    False    True    False    False\n2       False        True       False    True   False    False    False\n3       False       False        True   False   False    False     True\n\n=== Using Custom Separator '-' ===\n   color-blue color-green color-red size-L size-M size-S size-XL\n0       False       False        True   False   False     True    False\n1        True       False       False    False    True    False    False\n2       False        True       False    True   False    False    False\n3       False       False        True   False   False    False     True\n
\n\n

Code Explanation:

\n\n
    \n
  • prefix=['color', 'size'] assigns different prefixes to different columns.
  • \n
  • prefix_sep='-' changes the default underscore to a hyphen, resulting in column names like color-red.
  • \n
\n\n

Example 4: Handling Missing Values and drop_first Parameter

\n\n

Example

\n\n
import pandas as pd\nimport numpy as np\n\n# 1. Data with missing values\ndf = pd.DataFrame({\n    'color': ['red','blue', np.nan,'red','green'],\n    'size': ['S','M','L', np.nan,'XL']\n})\n\nprint("=== DataFrame with Missing Values ===")\nprint(df)\n\n# 2. Default behavior does not handle missing values\nresult_default = pd.get_dummies(df, columns=['color'])\n\nprint("n=== Default Behavior (No Missing Value Handling) ===")\nprint(result_default)\n\n# 3. dummy_na=True creates a separate column for missing values\nresult_na = pd.get_dummies(df, columns=['color'], dummy_na=True)\n\nprint("n=== dummy_na=True Creates Column for Missing Values ===")\nprint(result_na)\n\n# 4. drop_first=True removes the first column to avoid multicollinearity\nresult_drop = pd.get_dummies(df, columns=['size'], drop_first=True)\n\nprint("n=== drop_first=True Removes First Column ===")\nprint(result_drop)\n
\n\n

Expected Output:

\n\n
=== DataFrame with Missing Values ===\n  color  size\n0   red     S\n1  blue     M\n2  None     L\n3   red  None\n4 green    XL\n\n=== Default Behavior (No Missing Value Handling) ===\n  size  color_blue  color_green  color_red\n0    S       False        False       True\n1    M        True        False      False\n2    L       False        False      False\n3    S       False        False       True\n4   XL       False         True      False\n\n=== dummy_na=True Creates Column for Missing Values ===\n  size  color_True  color_blue  color_green  color_red\n0    S       False       False        False       True\n1    M       False        True        False      False\n2    L        True       False        False      False\n3    S       False       False        False       True\n4   XL       False       False         True      False\n\n=== drop_first=True Removes First Column ===\n  color_blue color_green color_red\n0      False       False        True\n1       True       False      False\n2      False       False      False\n3      False       False        True\n4      False        True      False\n
← Pandas Pd QcutPandas Pd Notna β†’