YouTip LogoYouTip

Pandas Duplicate

Duplicate data is a common issue in data analysis that can affect the accuracy of statistical results. Pandas provides complete functionality for detecting and handling duplicate data. * * * ## Detecting Duplicate Data ### duplicated Method ## Example import pandas as pd # Create a DataFrame with duplicate rows df = pd.DataFrame({ "Name": ["Zhang San","Li Si","Wang Wu","Zhang San","Li Si","Zhao Liu"], "Age": [25,30,28,25,30,35], "City": ["Beijing","Shanghai","Guangzhou","Beijing","Shenzhen","Beijing"] }) print("Original Data:") print(df) print() # Detect duplicate rows (default marks from first row, keep='first') print("Detecting duplicate rows:") print(df.duplicated()) print() # Count duplicate rows print(f"Number of duplicate rows: {df.duplicated().sum()}") print() # Detect duplicates in specific columns print("Detecting duplicates by 'Name' column:") print(df.duplicated(subset=)) ### keep Parameter ## Example import pandas as pd df = pd.DataFrame({ "Name": ["Zhang San","Li Si","Zhang San","Wang Wu","Li Si"], "Age": [25,30,25,28,30] }) print("Original Data:") print(df) print() # keep='first': keep the first occurrence (default) print("keep='first' (default):") print(df.duplicated(keep="first")) print() # keep='last': keep the last occurrence print("keep='last':") print(df.duplicated(keep="last")) print() # keep=False: mark all duplicates (don't keep any) print("keep=False (mark all duplicates):") print(df.duplicated(keep=False)) * * * ## Removing Duplicate Data ### drop_duplicates Basic Usage ## Example import pandas as pd df = pd.DataFrame({ "Name": ["Zhang San","Li Si","Zhang San","Wang Wu","Li Si"], "Age": [25,30,25,28,30], "City": ["Beijing","Shanghai","Beijing","Guangzhou","Shenzhen"] }) print("Original Data:") print(df) print() # Remove duplicate rows (default keeps first) print("Removing duplicate rows (keeping first):") print(df.drop_duplicates()) print() # Keep last print("Removing duplicate rows (keeping last):") print(df.drop_duplicates(keep="last")) print() # Don't keep any duplicates print("Removing all duplicate rows:") print(df.drop_duplicates(keep=False)) ### Remove Duplicates by Column ## Example import pandas as pd df = pd.DataFrame({ "Name": ["Zhang San","Li Si","Zhang San","Wang Wu"], "Age": [25,30,35,28], "City": ["Beijing","Shanghai","Beijing","Guangzhou"] }) print("Original Data:") print(df) print() # Remove duplicates by single column print("Remove duplicates by 'Name' column:") print(df.drop_duplicates(subset=)) print() # Remove duplicates by combination of multiple columns print("Remove duplicates by 'Name' + 'City' combination:") print(df.drop_duplicates(subset=["Name","City"])) print() # Keep the row with maximum value in a specific column print("Group by 'Name', keep the one with highest age:") print(df.sort_values("Age", ascending=False).drop_duplicates(subset=, keep="first")) * * * ## Practical Application: Data Cleaning ### Comprehensive Case Study ## Example import pandas as pd import numpy as np # Create simulated order data (with duplicates) orders = pd.DataFrame({ "Order ID": ["O001","O002","O001","O003","O002","O004"], "Customer ID": ["C001","C002","C001","C003","C002","C004"], "Product": ["iPhone","MacBook","iPhone","iPad","AirPods","Apple Watch"], "Amount": [8000,12000,8000,5000,1500,3000], "Date": ["2024-01-01","2024-01-02","2024-01-01","2024-01-03","2024-01-02","2024-01-04"] }) print("Original Order Data:") print(orders) print() # 1. Detect duplicate orders print("Duplicate Order Detection:") duplicates = orders[orders.duplicated(subset=, keep=False)] print(duplicates) print() # 2. Remove completely duplicated orders orders_clean = orders.drop_duplicates() print("After removing completely duplicated entries:") print(f"Original {len(orders)} rows -> {len(orders_clean)} rows") print() # 3. Remove duplicates by order ID, keeping the latest # Assume the last entry is the latest orders_clean = orders.sort_values("Date").drop_duplicates( subset=, keep="last" ).sort_index() print("Remove duplicates by order ID (keeping latest):") print(orders_clean) * * * ## Advanced Usage ### Keep Specific Rows Per Group ## Example import pandas as pd df = pd.DataFrame({ "Customer": ["A","A","A","B","B","B"], "Month": ["01","02","03","01","02","03"], "Spending": [100,200,150,300,250,280] }) print("Original Data:") print(df) print() # Keep the month with highest spending for each customer print("Month with highest spending per customer:") result = df.sort_values("Spending", ascending=False).drop_duplicates( subset=, keep="first" ).sort_values("Customer") print(result) print() # Keep the first record for each customer (by month) print("First record for each customer:") result = df.sort_values("Month").drop_duplicates( subset=, keep="first" ) print(result) ### Mark Duplicates Instead of Removing ## Example import pandas as pd df = pd.DataFrame({ "Name": ["Zhang San","Li Si","Zhang San","Wang Wu","Li Si"], "City": ["Beijing","Shanghai","Beijing","Guangzhou","Shenzhen"] }) # Add a duplicate marking column df= df.duplicated(subset=, keep=False) print("Marking duplicate rows:") print(df) print() # Or use groupby to mark df= df.groupby("Name").transform("count") df=~df.duplicated(subset=, keep="first") print("Grouped statistics on duplicate counts:") print(df) * * * ## Important Notes **1. Distinguish Between "Fully Duplicate" and "Partially Duplicate"** Fully duplicate means all column values are identical; partially duplicate means specific columns are the same. Use the `subset` parameter to specify which columns to check for duplicates. **2. Choose Retention Strategy** Choose to keep the first, last, or none based on business needs. Which row to keep may impact subsequent analysis results. **3. Impact of Data Types** When checking for duplicates, data types are considered. For example, integer 1 and float 1.0 are treated as different values. > Before removing duplicate data, it's recommended to analyze the cause and distribution of duplicates to ensure that the removal operation does not lose important information.
← Pandas DatetimePandas Filter β†’