Pandas pd.get_dummies() Function |
\n\n\n\n\n\n
pd.get_dummies() is a function in the Pandas library used for one-hot encoding of categorical variables. It converts categorical variables into binary (0/1) column formats, with each category corresponding to one column.
One-hot encoding is a common technique in machine learning preprocessing because most algorithms cannot directly handle categorical data and need to convert it into numerical form.
\n\nWord Definition: In get_dummies, "dummy" here means "dummy variable," which refers to binary variables used to represent categorical variables in statistics and econometrics.
\n\n
Basic Syntax and Parameters
\n\npd.get_dummies() is a top-level function in the Pandas library used to convert categorical variables into one-hot encoded format.
Syntax Format
\n\npd.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, drop_first=False, dtype=None)\n\nParameter Description
\n\n- \n
- Parameter:
data\n- \n
- Type: Series, DataFrame, or array-like object. \n
- Description: The data to be one-hot encoded. Usually a Series or DataFrame containing categorical variables. \n
\n - Parameter:
prefix\n- \n
- Type: String, list of strings, or dictionary. \n
- Description: Prefix for the new column names. If not specified, the original column names are used as prefixes. \n
\n - Parameter:
prefix_sep\n- \n
- Type: String. \n
- Description: Separator between prefix and category name. Default is underscore
'_'. \n
\n - Parameter:
dummy_na\n- \n
- Type: Boolean. \n
- Description: If
True, creates a separate column for missing values (NaN). Default isFalse. \n
\n - Parameter:
columns\n- \n
- Type: List or None. \n
- Description: Column names to encode. If not specified, all object, category, or boolean type columns are encoded. \n
\n - Parameter:
drop_first\n- \n
- Type: Boolean. \n
- Description: If
True, drops the first column of each categorical variable to avoid multicollinearity. Can prevent the dummy variable trap in models like logistic regression. Default isFalse. \n
\n
Function Description
\n\n- \n
- Return Value: Returns a DataFrame where each column corresponds to a category, with values of 0 or 1. \n
- Effect: Converts categorical variables into numeric binary columns, making them suitable for machine learning algorithms. \n
\n\n
Examples
\n\nLet's go through a series of examples from simple to complex to fully master the usage of pd.get_dummies().
Example 1: Basic Usage - One-Hot Encoding a Series
\n\nExample
\n\nimport pandas as pd\n\n# 1. Create a Series with categorical variables\ncolors = pd.Series(['red','blue','green','red','green','blue'])\n\nprint("=== Original Series ===")\nprint(colors)\n\n# 2. Use pd.get_dummies() for one-hot encoding\nresult = pd.get_dummies(colors)\n\nprint("n=== pd.get_dummies() One-Hot Encoding Result ===")\nprint(result)\n\n\nExpected Output:
\n\n=== Original Series ===\n0 red\n1 blue\n2 green\n3 red\n4 green\n5 blue\ndtype: object\n\n=== pd.get_dummies() One-Hot Encoding Result ===\n blue green red\n0 False False True\n1 True False False\n2 False True False\n3 False False True\n4 False True False\n5 True False False\n\n\nCode Explanation:
\n\n- \n
- The original Series contains three colors: red, blue, green. \n
- After one-hot encoding, each color becomes a column, represented by
True/Falseindicating whether that row belongs to that category. \n - Each row has exactly one
True, corresponding to the original color value. \n
Example 2: Encoding Specific Columns in a DataFrame
\n\nIn data analysis, we often only want to encode specific categorical columns while keeping numerical columns unchanged.
\n\nExample
\n\nimport pandas as pd\n\n# 1. Create a DataFrame with numerical and categorical variables\ndf = pd.DataFrame({\n 'name': ['Alice','Bob','Charlie','Diana'],\n 'age': [25,30,35,28],\n 'city': ['Beijing','Shanghai','Beijing','Guangzhou'],\n 'department': ['Sales','Engineering','Sales','HR']\n})\n\nprint("=== Original DataFrame ===")\nprint(df)\n\n# 2. Perform one-hot encoding on specific categorical columns\nresult = pd.get_dummies(df, columns=['city','department'])\n\nprint("n=== One-Hot Encoding on city and department columns ===")\nprint(result)\n\n\nExpected Output:
\n\n=== Original DataFrame ===\n name age city department\n0 Alice 25 Beijing Sales\n1 Bob 30 Shanghai Engineering\n2 Charlie 35 Beijing Sales\n3 Diana 28 Guangzhou HR\n\n=== One-Hot Encoding on city and department columns ===\n name age city_Beijing city_Guangzhou city_Shanghai department_Engineering department_HR department_Sales\n0 Alice 25 True False False False True True\n1 Bob 30 False False True True False False\n2 Charlie 35 True False False False False True\n3 Diana 28 False True False False False False\n\n\nCode Explanation:
\n\n- \n
- The
columnsparameter specifies that only thecityanddepartmentcolumns are encoded. \n - Numerical column
ageand text columnnameremain unchanged. \n - New column names use the default separator underscore, such as
city_Beijing. \n
Example 3: Custom Prefixes and Separators
\n\nYou can customize new column names using the prefix and prefix_sep parameters.
Example
\n\nimport pandas as pd\n\n# 1. Create a DataFrame\ndf = pd.DataFrame({\n 'color': ['red','blue','green','red'],\n 'size': ['S','M','L','XL']\n})\n\nprint("=== Original DataFrame ===")\nprint(df)\n\n# 2. Use prefix parameter to customize prefixes\nresult_prefix = pd.get_dummies(df, prefix=['color','size'])\n\nprint("n=== Using Custom Prefixes ===")\nprint(result_prefix)\n\n# 3. Use prefix_sep to customize separator\nresult_sep = pd.get_dummies(df, prefix=['color','size'], prefix_sep='-')\n\nprint("n=== Using Custom Separator '-' ===")\nprint(result_sep)\n\n\nExpected Output:
\n\n=== Original DataFrame ===\n color size\n0 red S\n1 blue M\n2 green L\n3 red XL\n\n=== Using Custom Prefixes ===\n color_blue color_green color_red size_L size_M size_S size_XL\n0 False False True False False True False\n1 True False False False True False False\n2 False True False True False False False\n3 False False True False False False True\n\n=== Using Custom Separator '-' ===\n color-blue color-green color-red size-L size-M size-S size-XL\n0 False False True False False True False\n1 True False False False True False False\n2 False True False True False False False\n3 False False True False False False True\n\n\nCode Explanation:
\n\n- \n
prefix=['color', 'size']assigns different prefixes to different columns. \nprefix_sep='-'changes the default underscore to a hyphen, resulting in column names likecolor-red. \n
Example 4: Handling Missing Values and drop_first Parameter
\n\nExample
\n\nimport pandas as pd\nimport numpy as np\n\n# 1. Data with missing values\ndf = pd.DataFrame({\n 'color': ['red','blue', np.nan,'red','green'],\n 'size': ['S','M','L', np.nan,'XL']\n})\n\nprint("=== DataFrame with Missing Values ===")\nprint(df)\n\n# 2. Default behavior does not handle missing values\nresult_default = pd.get_dummies(df, columns=['color'])\n\nprint("n=== Default Behavior (No Missing Value Handling) ===")\nprint(result_default)\n\n# 3. dummy_na=True creates a separate column for missing values\nresult_na = pd.get_dummies(df, columns=['color'], dummy_na=True)\n\nprint("n=== dummy_na=True Creates Column for Missing Values ===")\nprint(result_na)\n\n# 4. drop_first=True removes the first column to avoid multicollinearity\nresult_drop = pd.get_dummies(df, columns=['size'], drop_first=True)\n\nprint("n=== drop_first=True Removes First Column ===")\nprint(result_drop)\n\n\nExpected Output:
\n\n=== DataFrame with Missing Values ===\n color size\n0 red S\n1 blue M\n2 None L\n3 red None\n4 green XL\n\n=== Default Behavior (No Missing Value Handling) ===\n size color_blue color_green color_red\n0 S False False True\n1 M True False False\n2 L False False False\n3 S False False True\n4 XL False True False\n\n=== dummy_na=True Creates Column for Missing Values ===\n size color_True color_blue color_green color_red\n0 S False False False True\n1 M False True False False\n2 L True False False False\n3 S False False False True\n4 XL False False True False\n\n=== drop_first=True Removes First Column ===\n color_blue color_green color_red\n0 False False True\n1 True False False\n2 False False False\n3 False False True\n4 False True False\n
YouTip