Pandas Index |
\\nIndex is the core component of Pandas data structures, determining how data is organized and accessed. Pandas supports multiple types of indexes, from simple RangeIndex to complex MultiIndex. This section will detail the usage of various index types.
\\n\\n
Basic Concepts of Index
\\nIndex is similar to a primary key in a database or row numbers in Excel, used to uniquely identify each row of data. DataFrame has row index (index) and column index (columns).
\\nExample
import pandas as pd\\n\\n# Create simple Index's DataFrameοΌDefault uses RangeIndexοΌ\\ndf = pd.DataFrame({\\n "Name": ["Zhang San","Li Si","Wang Wu"],\\n "Age": [25,30,28],\\n "City": ["Beijing","Shanghai","Guangzhou"]\\n})\\n\\nprint("DataFrame info:")\\nprint(f"lineIndex: {df.index.tolist()}")\\nprint(f"columnIndex: {df.columns.tolist()}")\\nprint("n Data:")\\nprint(df)\\n\\n
RangeIndex
\\nRangeIndex is the default integer index, similar to Python's range(n), starting from 0 and incrementing.
Creation and Usage
\\nExample
import pandas as pd\\n\\n# Create RangeIndex\\nidx = pd.RangeIndex(start=0, stop=10, step=1)\\nprint(f"RangeIndex: {idx}")\\nprint(f"Type: {type(idx)}")\\n\\n# DataFrame Default uses RangeIndex\\ndf = pd.DataFrame({"A": [1,2,3]}, index=range(3))\\nprint(f"n Default Index Type: {type(df.index)}")\\n\\n# Also returns RangeIndex after resetting Index\\ndf_reset = df.reset_index()\\nprint(f"Index Type after reset: {type(df_reset.index)}")Characteristics of RangeIndex
\\n- \\n
- Minimum memory usage \\n
- Supports default integer position access \\n
- Easy conversion to other index types \\n
\\n
Index Type Conversion
\\nIndex supports conversion between multiple types.
\\nConverting to Other Index Types
\\nExample
import pandas as pd\\n\\n# Create Example DataFrame\\ndf = pd.DataFrame({"Value": [1,2,3,4]}, index=[10,20,30,40])\\n\\n# Convert to Index type\\nprint("Original Index:", df.index)\\nprint("IndexType:",type(df.index))\\n\\n# Convert to column/table\\nidx_list = df.index.tolist()\\nprint(f"Convert to column/table: {idx_list}")\\n\\n# Convert to NumPy array\\nidx_array = df.index.values\\nprint(f"Convert to array: {idx_array}")\\n\\n# Reset Index to RangeIndex\\ndf = df.reset_index(drop=True)\\nprint(f"Reset to RangeIndex: {df.index}")\\n
Setting Custom Index
\\nYou can use DataFrame columns to set the row index.
\\nUsing set_index
\\nExample
import pandas as pd\\n\\n# Create DataFrame\\ndf = pd.DataFrame({\\n "Student ID": ["S001","S002","S003","S004"],\\n "Name": ["Zhang San","Li Si","Wang Wu","Zhao Liu"],\\n "Score": [85,92,78,90]\\n})\\n\\nprint("Original data: ")\\nprint(df)\\nprint()\\n\\n# Set"Student ID"columnSet as Index\\ndf1 = df.set_index("Student ID")\\nprint("Set Student ID as Index:")\\nprint(df1)\\nprint()\\n\\n# Set Multiple Indexes (Creates MultiIndex)\\ndf2 = df.set_index(["Student ID","Name"])\\nprint("Set Multiple Indexes:")\\nprint(df2)Using index Parameter When Creating
\\nExample
import pandas as pd\\n\\n# Specify Index directly when creating DataFrame\\ndf = pd.DataFrame(\\n {"Name": ["Zhang San","Li Si","Wang Wu"],"Age": [25,30,28]},\\n index=["A001","A002","A003"]\\n)\\n\\nprint(df)\\nprint(f"n Index: {df.index.tolist()}")\\n\\n# Use DatetimeIndex\\ndates = pd.date_range("2024-01-01", periods=3, freq="D")\\ndf_date = pd.DataFrame({"Value": [100,200,300]}, index=dates)\\n\\nprint("n Use Date Index:")\\nprint(df_date)\\nprint(f"IndexType: {type(df_date.index)}")\\n
Index Operations
\\nResetting Index
\\nExample
import pandas as pd\\n\\n# Create with custom Index's DataFrame\\ndf = pd.DataFrame(\\n {"Name": ["Zhang San","Li Si"],"Score": [85,92]},\\n index=["A001","A002"]\\n)\\n\\n# Reset Index to default's RangeIndex\\ndf_reset = df.reset_index()\\nprint("Reset Index:")\\nprint(df_reset)\\n\\n# drop Parameter: Whether to drop the original index column\\ndf_reset2 = df.reset_index(drop=True)\\nprint("n Reset and Drop Original Index:")\\nprint(df_reset2)\\nReindexing
\\nExamples
\\nimport pandas as pd\\n\\n# Create DataFrame\\ndf = pd.DataFrame({"A": [1,2,3],"B": [4,5,6]}, index=[1,2,3])\\n\\n# Reindex (Change Index Order):\\ndf_reindex = df.reindex([1,2,3,4,5])\\nprint("Reindex (fill missing with NaN):")\\nprint(df_reindex)\\n\\n# Use fill_value to fill missing values\\ndf_reindex2 = df.reindex([1,2,3,4,5], fill_value=0)\\nprint("n Reindex (fill with 0):")\\nprint(df_reindex2)\\nIndex Level Operations
\\nExamples
\\nimport pandas as pd\\n\\n# Create a MultiIndex DataFrame\\ndf = pd.DataFrame({\\n "Chinese": [85,92,78],\\n "Math": [90,88,95]\\n}, index=pd.MultiIndex.from_tuples(\\n [("Grade 10","AClass"),("Grade 10","BClass"),("Grade 11","AClass")],\\n names=["Grade","Class"]\\n))\\n\\nprint("Multi-Level Index DataFrame:")\\nprint(df)\\nprint()\\n\\n# Get Outer Index\\nprint(f"Outer Index (Grade): {df.index.get_level_values(0).tolist()}")\\nprint(f"Inner Index (Class): {df.index.get_level_values(1).tolist()}")\\n\\n
Index Attributes and Methods
\\n| Attribute/Method | \\nDescription | \\nExample | \\n
|---|---|---|
index.tolist() | \\nConvert to Python list | \\ndf.index.tolist() | \\n
index.values | \\nConvert to NumPy array | \\ndf.index.values | \\n
index.unique() | \\nGet unique values | \\ndf.index.unique() | \\n
index.is_unique | \\nCheck if index is unique | \\ndf.index.is_unique | \\n
index.astype() | \\nConvert index type | \\ndf.index.astype(str) | \\n
Index Attribute Usage Examples
\\nExample
import pandas as pd\\n\\n# Create Example DataFrame\\ndf = pd.DataFrame(\\n {"Value": [1,2,3,4]},\\n index=["a","b","c","d"]\\n)\\n\\n# View Index Attributes\\nprint(f"IndexIs unique?: {df.index.is_unique}")\\nprint(f"IndexLength: {len(df.index)}")\\nprint(f"IndexData Type: {df.index.dtype}")\\n\\n# Convert to string type\\nstr_index = df.index.astype(str)\\nprint(f"Type After Conversion: {str_index.dtype}")\\n
Practical: Using Index to Improve Query Efficiency
\\nIn actual data analysis, properly setting indexes can significantly improve query efficiency.
\\nExample
import pandas as pd\\n\\n# Simulated business data\\ndf = pd.DataFrame({\\n "Order ID": range(1000),\\n "Customer ID": [f"C{i%100:03d}"for i in range(1000)],\\n "Product": [f"Product{i%20}"for i in range(1000)],\\n "Amount": [round(i * 1.5,2)for i in range(1000)]\\n})\\n\\n# Set frequently queried column as Index\\ndf_indexed = df.set_index(["Customer ID","Product"])\\n\\n# Use Index for fast lookup (similar to database primary key query)\\nresult = df_indexed.loc[("C001","Product1")]\\nprint("Use Index to query a single customer'sProductοΌ")\\nprint(result)\\n\\n# Group Statistics by Customer ID\\nprint("n Total Spending per Customer:")\\ncustomer_total = df.groupby("Customer ID").sum()\\nprint(customer_total.head(10))\\n\\n
Important Notes
\\n1. Index values must be unique
\\nIf index values are not unique, certain operations (such as loc lookup) will return multiple matching rows.
2. Indexes can have names
\\nNaming indexes can improve code readability: df.index.name = "Student ID"
3. Indexes follow data operations
\\nSlicing, filtering, and other operations preserve the index. Pay attention to the correspondence between index and data.
\\n\\nIndex is key to Pandas performance. Setting frequently queried columns as indexes can significantly improve lookup speed, but too many indexes will increase write overhead, requiring careful trade-offs.
\\n
YouTip