Pandas Categorical
Categorical is a data type in Pandas used for handling finite categorical values, especially suitable for processing enumerated data such as gender, education level, grades, etc. The categorical type can significantly reduce memory usage and improve computational performance.
* * *
## Creating Categorical Data
### Creating from Series
## Example
import pandas as pd
# Create categorical data
s = pd.Series(["male","female","male","female","male"], dtype="category")
print("Basic categorical data:")
print(s)
print(f"Type: {s.dtype}")
print()
# Specify category order
s2 = pd.Series(
["low","medium","high","medium","low"],
dtype=pd.CategoricalDtype(categories=["low","medium","high"], ordered=True)
)
print("Ordered categories:")
print(s2)
print(f"Category order: {s2.dtype.categories.tolist()}")
### Creating from List
## Example
import pandas as pd
# Using pd.Categorical
categories = pd.Categorical(
["A","B","A","C","B"],
categories=["A","B","C"]
)
s = pd.Series(categories)
print("Created from Categorical:")
print(s)
* * *
## Category Attributes and Methods
### Accessing Category Information
## Example
import pandas as pd
s = pd.Series(
["low","medium","high","medium","low"],
dtype=pd.CategoricalDtype(categories=["low","medium","high"], ordered=True)
)
# Category attributes
print(f"Categories: {s.dtype.categories.tolist()}")
print(f"Is ordered: {s.dtype.ordered}")
print()
# Count occurrences of each category
print("Category counts:")
print(s.value_counts())
### Modifying Categories
## Example
import pandas as pd
s = pd.Series(["A","B","C","A","B"], dtype="category")
print("Renaming categories:")
s2 = s.cat.rename_categories({"A": "Excellent","B": "Good","C": "Pass"})
print(s2)
print()
# Add new categories
print("Adding categories:")
s3 = s.cat.add_categories()
print(s3.dtype.categories.tolist())
print()
# Remove categories
print("Removing categories:")
s4 = s.cat.remove_categories()
print(s4.dtype.categories.tolist())
* * *
## Ordered Categories
Ordered categories support comparison operations, suitable for classified data with an order or hierarchy.
## Example
import pandas as pd
# Create ordered categories
s = pd.Series(
["low","medium","high","high","low","medium"],
dtype=pd.CategoricalDtype(categories=["low","medium","high"], ordered=True)
)
# Comparison operations
print("high > medium:", s >"medium")
print("medium > low:", s >"low")
print()
# Sorting
print("After sorting:")
print(s.sort_values())
* * *
## Practical Example: Data Grouping
## Example
import pandas as pd
# Simulate user level data
users = pd.DataFrame({
"UserID": range(1,1001),
"Level": pd.Categorical(
* 100 + * 300 + * 400 + * 200,
categories=["Regular","Silver","Gold","VIP"],
ordered=True
),
"Spending": [1000,500,100,50] * 250
})
print("User level distribution:")
print(users.value_counts().sort_index())
print()
# Group by level and summarize
print("Summary by level:")
print(users.groupby("Level", observed=True).sum())
* * *
## Memory Optimization
The categorical type can greatly reduce memory usage for string data.
## Example
import pandas as pd
import numpy as np
# Create large amounts of repeated string data
np.random.seed(42)
s_str = pd.Series(np.random.choice(["Beijing","Shanghai","Guangzhou","Shenzhen"],1000000))
# Convert to categorical type
s_cat = s_str.astype("category")
# Memory comparison
print(f"String type memory: {s_str.memory_usage(deep=True) / 1024 / 1024:.2f} MB")
print(f"Categorical type memory: {s_cat.memory_usage(deep=True) / 1024 / 1024:.2f} MB")
print(f"Saved: {(1 - s_cat.memory_usage(deep=True) / s_str.memory_usage(deep=True)) * 100:.1f}%")
> When there are many duplicate string values in the data, using categorical types can significantly reduce memory usage, especially suitable for large datasets.
YouTip