Pandas Multiindex
MultiIndex is a powerful indexing feature in Pandas that allows creating multi-level hierarchies on rows or columns. It is particularly useful when dealing with high-dimensional data, group statistics, and panel data.
* * *
## Creating MultiIndex
### Creating from Lists
## Example
import pandas as pd
# Method 1: Create using arrays
arrays =[
["A","A","B","B","C","C"],
[1,2,1,2,1,2]
]
# Using pd.MultiIndex.from_arrays
index = pd.MultiIndex.from_arrays(arrays, names=["Category","Number"])
print("Created from arrays:")
print(index)
print()
# Method 2: Create using tuple list
tuples =[
("A",1),("A",2),("B",1),("B",2),("C",1),("C",2)
]
index = pd.MultiIndex.from_tuples(tuples, names=["Category","Number"])
print("Created from tuples:")
print(index)
print()
# Method 3: Using product
index = pd.MultiIndex.from_product(
[["A","B","C"],[1,2,3]],
names=["Category","Number"]
)
print("Created from product:")
print(index)
### Creating DataFrame with MultiIndex
## Example
import pandas as pd
import numpy as np
# Create DataFrame with MultiIndex
df = pd.DataFrame(
np.random.randn(6,4),
index=pd.MultiIndex.from_tuples(
[("2024","Q1"),("2024","Q2"),("2024","Q3"),
("2025","Q1"),("2025","Q2"),("2025","Q3")]
),
columns=["Beijing","Shanghai","Guangzhou","Shenzhen"]
)
df.index.names=["Year","Quarter"]
print("DataFrame with MultiIndex:")
print(df)
* * *
## Accessing MultiIndex
### Using loc/iloc
## Example
import pandas as pd
# Create sample data
df = pd.DataFrame({
"Chinese": [85,92,78,88],
"Math": [90,88,95,82]
}, index=pd.MultiIndex.from_tuples(
[("Grade 1","Class A"),("Grade 1","Class B"),("Grade 2","Class A"),("Grade 2","Class B")],
names=["Grade","Class"]
))
print("Original data:")
print(df)
print()
# Access outer index
print("Access all Grade 1:")
print(df.loc)
print()
# Access inner index
print("Access Class A:")
print(df.loc[:,"Class A"])
print()
# Access multi-level index
print("Access Grade 1 Class A:")
print(df.loc[("Grade 1","Class A")])
### Using xs
## Example
import pandas as pd
# Create sample data
df = pd.DataFrame({
"Chinese": [85,92,78,88],
"Math": [90,88,95,82]
}, index=pd.MultiIndex.from_tuples(
[("Grade 1","Class A"),("Grade 1","Class B"),("Grade 2","Class A"),("Grade 2","Class B")],
names=["Grade","Class"]
))
# Use xs to access specific level values
print("Using xs to access Grade 1:")
print(df.xs("Grade 1", level="Grade"))
print()
print("Using xs to access Class A:")
print(df.xs("Class A", level="Class"))
* * *
## Transforming MultiIndex
### Stacking and Unstacking
## Example
import pandas as pd
import numpy as np
# Create wide format DataFrame
df = pd.DataFrame(
np.arange(12).reshape(3,4),
index=pd.MultiIndex.from_tuples(
[("Beijing","2024"),("Shanghai","2024"),("Guangzhou","2024")]
),
columns=pd.MultiIndex.from_tuples(
[("Q1","Revenue"),("Q1","Profit"),("Q2","Revenue"),("Q2","Profit")]
)
)
print("Original data (nested columns):")
print(df)
print()
# unstack: convert inner index to columns
df_unstacked = df.unstack()
print("After unstack:")
print(df_unstacked)
print()
# stack: convert columns back to inner index
df_stacked = df_unstacked.stack()
print("After stack:")
print(df_stacked)
* * *
## Sorting MultiIndex
## Example
import pandas as pd
# Create DataFrame with shuffled indices
df = pd.DataFrame({
"Value": [1,2,3,4,5,6]
}, index=pd.MultiIndex.from_tuples(
[("C",2),("A",1),("B",2),("A",2),("C",1),("B",1)],
names=["Letter","Number"]
))
print("Shuffled data:")
print(df)
print()
# Sort by outer level
df_sorted1 = df.sort_index()
print("Sorted by outer level:")
print(df_sorted1)
print()
# Sort by inner level
df_sorted2 = df.sort_index(level=1)
print("Sorted by inner level:")
print(df_sorted2)
print()
# Multi-level sort
df_sorted3 = df.sort_index(level=[0,1])
print("Sorted by multiple levels:")
print(df_sorted3)
* * *
## Practical Example: Group Statistics
MultiIndex is ideal for group statistics and pivot analysis.
## Example
import pandas as pd
import numpy as np
# Create sales data
np.random.seed(42)
df = pd.DataFrame({
"Year": * 6 + * 6,
"Quarter": ["Q1","Q2","Q3","Q4"] * 3,
"Product": ["Phone","Phone","Computer","Computer"] * 3,
"Region": ["East China","South China","North China"] * 4,
"Sales": np.random.randint(100,500,12)
})
print("Original sales data:")
print(df)
print()
# Set MultiIndex and perform group statistics
df_grouped = df.set_index(["Year","Quarter","Product","Region"])
print("Grouped by Year, Quarter, Product, Region:")
print(df_grouped)
# Summarize by year
yearly = df_grouped.groupby(level="Year").sum()
print("n Annual Sales:")
print(yearly)
# Summarize by year and quarter
quarterly = df_grouped.groupby(level=["Year","Quarter"]).sum()
print("n Quarterly Sales:")
print(quarterly)
* * *
## Common Issues and Considerations
**1. Confused Index Levels**
Be careful when manipulating data to avoid level confusion. It's recommended to check `df.index.names` before and after operations.
**2. Incorrect Index Access**
`loc` uses label-based access, while `iloc` uses position-based access. Do not mix them.
**3. Duplicate Indices During concat**
When concatenating data with `concat`, duplicate indices may cause unexpected results. You can use `ignore_index=True` to reset indices.
> MultiIndex is a core feature of Pandas for handling high-dimensional data. Proper use of MultiIndex can make data organization clearer and statistical analysis more convenient.
YouTip