YouTip LogoYouTip

Pandas Categorical

Categorical is a data type in Pandas used for handling finite categorical values, especially suitable for processing enumerated data such as gender, education level, grades, etc. The categorical type can significantly reduce memory usage and improve computational performance. * * * ## Creating Categorical Data ### Creating from Series ## Example import pandas as pd # Create categorical data s = pd.Series(["male","female","male","female","male"], dtype="category") print("Basic categorical data:") print(s) print(f"Type: {s.dtype}") print() # Specify category order s2 = pd.Series( ["low","medium","high","medium","low"], dtype=pd.CategoricalDtype(categories=["low","medium","high"], ordered=True) ) print("Ordered categories:") print(s2) print(f"Category order: {s2.dtype.categories.tolist()}") ### Creating from List ## Example import pandas as pd # Using pd.Categorical categories = pd.Categorical( ["A","B","A","C","B"], categories=["A","B","C"] ) s = pd.Series(categories) print("Created from Categorical:") print(s) * * * ## Category Attributes and Methods ### Accessing Category Information ## Example import pandas as pd s = pd.Series( ["low","medium","high","medium","low"], dtype=pd.CategoricalDtype(categories=["low","medium","high"], ordered=True) ) # Category attributes print(f"Categories: {s.dtype.categories.tolist()}") print(f"Is ordered: {s.dtype.ordered}") print() # Count occurrences of each category print("Category counts:") print(s.value_counts()) ### Modifying Categories ## Example import pandas as pd s = pd.Series(["A","B","C","A","B"], dtype="category") print("Renaming categories:") s2 = s.cat.rename_categories({"A": "Excellent","B": "Good","C": "Pass"}) print(s2) print() # Add new categories print("Adding categories:") s3 = s.cat.add_categories() print(s3.dtype.categories.tolist()) print() # Remove categories print("Removing categories:") s4 = s.cat.remove_categories() print(s4.dtype.categories.tolist()) * * * ## Ordered Categories Ordered categories support comparison operations, suitable for classified data with an order or hierarchy. ## Example import pandas as pd # Create ordered categories s = pd.Series( ["low","medium","high","high","low","medium"], dtype=pd.CategoricalDtype(categories=["low","medium","high"], ordered=True) ) # Comparison operations print("high > medium:", s >"medium") print("medium > low:", s >"low") print() # Sorting print("After sorting:") print(s.sort_values()) * * * ## Practical Example: Data Grouping ## Example import pandas as pd # Simulate user level data users = pd.DataFrame({ "UserID": range(1,1001), "Level": pd.Categorical( * 100 + * 300 + * 400 + * 200, categories=["Regular","Silver","Gold","VIP"], ordered=True ), "Spending": [1000,500,100,50] * 250 }) print("User level distribution:") print(users.value_counts().sort_index()) print() # Group by level and summarize print("Summary by level:") print(users.groupby("Level", observed=True).sum()) * * * ## Memory Optimization The categorical type can greatly reduce memory usage for string data. ## Example import pandas as pd import numpy as np # Create large amounts of repeated string data np.random.seed(42) s_str = pd.Series(np.random.choice(["Beijing","Shanghai","Guangzhou","Shenzhen"],1000000)) # Convert to categorical type s_cat = s_str.astype("category") # Memory comparison print(f"String type memory: {s_str.memory_usage(deep=True) / 1024 / 1024:.2f} MB") print(f"Categorical type memory: {s_cat.memory_usage(deep=True) / 1024 / 1024:.2f} MB") print(f"Saved: {(1 - s_cat.memory_usage(deep=True) / s_str.memory_usage(deep=True)) * 100:.1f}%") > When there are many duplicate string values in the data, using categorical types can significantly reduce memory usage, especially suitable for large datasets.
← Pandas FilterPandas Multiindex β†’