\n\n
df.fillna() is a function in Pandas used to fill missing values.
Unlike dropna(), which removes missing values, fillna() allows us to fill missing data with specified values, means, medians, forward fill, or backward fill. This is very useful for maintaining data integrity and analysis accuracy.
\n\n
Basic Syntax and Parameters
\n\nfillna() is a member function of DataFrame, called using the dot operator ..
Syntax Format
\n\nDataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)\n\nParameter Description
\n\n| Parameter | \nType | \nRequired | \nDescription | \nDefault | \n
|---|---|---|---|---|
| value | \nscalar, dict, Series, DataFrame | \nOptional | \nThe value(s) used to fill missing values. Can be a constant, dictionary (specifying different values for different columns), Series, or DataFrame. | \nNone | \n
| method | \nstr | \nOptional | \nFill method. 'ffill' or 'pad' means forward fill (use previous value); 'bfill' or 'backfill' means backward fill (use next value). | \n None | \n
| axis | \nint or str | \nOptional | \nSpecifies the direction of filling. 0 or 'index' fills by row; 1 or 'columns' fills by column. Only valid when method is not None. | \n 0 | \n
| inplace | \nbool | \nOptional | \nIf True, modifies the original DataFrame directly without returning a new object; if False, returns a new DataFrame, leaving the original unchanged. | \n False | \n
| limit | \nint | \nOptional | \nMaximum number of consecutive values to fill. For example, setting limit=1 means only one missing value is filled at a time. | \n None | \n
| downcast | \ndict or str | \nOptional | \nRules for downcasting data types, such as converting float64 to int64. | \nNone | \n
Return Value Description
\n\n- \n
- Returns a new DataFrame (if
inplace=False) orNone(ifinplace=True). \n - The returned DataFrame has its missing values filled. \n
\n\n
Examples
\n\nLet's explore several examples to fully understand how to use fillna().
Example 1: Fill All Missing Values with a Constant
\n\nThe simplest usage is to fill all missing values with a constant.
\n\nExample
\n\nimport pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with missing values\n\ndata = {\n\n'Name': ['Zhang San','Li Si','Wang Wu','Zhao Liu'],\n\n'Age': [25, np.nan,35, np.nan],# Missing value\n\n'Salary': [5000,6000, np.nan,8000],# Missing value\n\n'Department': ['Tech','Marketing','Tech', np.nan]# Missing value\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Fill all missing values with constant 0\n\ndf_filled = df.fillna(0)\n\nprint("Data after filling with 0:")\n\nprint(df_filled)\n\nExpected Output:
\n\nOriginal Data:\n Name Age Salary Department\n0 Zhang San 25.0 5000.0 Tech\n1 Li Si NaN 6000.0 Marketing\n2 Wang Wu 35.0 NaN Tech\n3 Zhao Liu NaN 8000.0 None\n\n==================================================\nData after filling with 0:\n Name Age Salary Department\n0 Zhang San 25.0 5000.0 Tech\n1 Li Si 0.0 6000.0 Marketing\n2 Wang Wu 35.0 0.0 Tech\n3 Zhao Liu 0.0 8000.0 0\n\nCode Explanation:
\n\n- \n
- We created a DataFrame with multiple missing values. \n
- Used
df.fillna(0)to replace all missing values with 0. \n - This approach is simple but may not be suitable for all scenarios, e.g., filling age or salary with 0 might not make sense. \n
Example 2: Specify Different Fill Values for Different Columns Using a Dictionary
\n\nYou can specify different fill values for different columns to make the data more reasonable.
\n\nExample
\n\nimport pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with missing values\n\ndata = {\n\n'Name': ['Zhang San','Li Si','Wang Wu','Zhao Liu'],\n\n'Age': [25, np.nan,35, np.nan],\n\n'Salary': [5000,6000, np.nan,8000],\n\n'Department': ['Tech','Marketing','Tech', np.nan],\n\n'Performance': [85,90, np.nan,95]\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Use a dictionary to specify different fill values for different columns\n\nfill_values = {\n\n'Age': 30,# Fill missing age with 30\n\n'Salary': 6500,# Fill missing salary with 6500\n\n'Department': 'Unassigned',# Fill missing department with "Unassigned"\n\n'Performance': 80# Fill missing performance with 80\n\n}\n\ndf_filled = df.fillna(fill_values)\n\nprint("Data after filling different columns with different values:")\n\nprint(df_filled)\n\nExpected Output:
\n\nOriginal Data:\n Name Age Salary Department Performance\n0 Zhang San 25.0 5000.0 Tech 85\n1 Li Si NaN 6000.0 Marketing 90\n2 Wang Wu 35.0 NaN Tech NaN\n3 Zhao Liu NaN 8000.0 None 95\n\n==================================================\nData after filling different columns with different values:\n Name Age Salary Department Performance\n0 Zhang San 25.0 5000.0 Tech 85\n1 Li Si 30.0 6000.0 Marketing 90\n2 Wang Wu 35.0 6500.0 Tech 80\n3 Zhao Liu 30.0 8000.0 Unassigned 95\n\nCode Explanation:
\n\n- \n
- Using the dictionary
fill_values, we specified reasonable fill values for each column. \n - This method is more flexible and allows you to set appropriate default values based on business logic. \n
Example 3: Fill Numeric Columns with Mean or Median
\n\nFor numeric data, filling with mean or median is a common and reasonable approach.
\n\nExample
\n\nimport pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with missing values\n\ndata = {\n\n'Name': ['Zhang San','Li Si','Wang Wu','Zhao Liu','Qian Qi'],\n\n'Math': [85,90, np.nan,78,92],\n\n'English': [88, np.nan,82,85,90],\n\n'Physics': [np.nan,85,90,88,95]\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Calculate the mean for each column\n\nmath_mean = df['Math'].mean()\n\nenglish_mean = df['English'].mean()\n\nphysics_mean = df['Physics'].mean()\n\nprint(f"Math Mean: {math_mean:.2f}")\n\nprint(f"English Mean: {english_mean:.2f}")\n\nprint(f"Physics Mean: {physics_mean:.2f}")\n\nprint("=" * 50)\n\n# Fill with means\n\ndf_filled = df.fillna({\n\n'Math': math_mean,\n\n'English': english_mean,\n\n'Physics': physics_mean\n\n})\n\nprint("Data after filling with means:")\n\nprint(df_filled)\n\nExpected Output:
\n\nOriginal Data:\n Name Math English Physics\n0 Zhang San 85.0 88.0 NaN\n1 Li Si 90.0 NaN 85.0\n2 Wang Wu NaN 82.0 90.0\n3 Zhao Liu 78.0 85.0 88.0\n4 Qian Qi 92.0 90.0 95.0\n\n==================================================\nMath Mean: 86.25\nEnglish Mean: 86.25\nPhysics Mean: 89.50\n\n==================================================\nData after filling with means:\n Name Math English Physics\n0 Zhang San 85.0 88.0 89.5\n1 Li Si 90.0 86.25 85.0\n2 Wang Wu 86.25 82.0 90.0\n3 Zhao Liu 78.0 85.0 88.0\n4 Qian Qi 92.0 90.0 95.0\n\nCode Explanation:
\n\n- \n
- Use
df['column_name'].mean()to calculate the mean for each column. \n - Use the calculated means as fill values to maintain overall distribution characteristics while solving missing value issues. \n
Example 4: Forward Fill (ffill) and Backward Fill (bfill)
\n\nForward fill uses the previous valid value to fill missing values, backward fill uses the next valid value.
\n\nExample
\n\nimport pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with missing values\n\ndata = {\n\n'Date': ['2024-01-01','2024-01-02','2024-01-03','2024-01-04','2024-01-05'],\n\n'Temperature': [20, np.nan,22, np.nan,25],\n\n'Humidity': [60,65, np.nan,70,75]\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Forward fill: fill missing values with previous values\n\ndf_ffill = df.fillna(method='ffill')\n\nprint("Data after forward fill (ffill):")\n\nprint(df_ffill)\n\nprint("=" * 50)\n\n# Backward fill: fill missing values with next values\n\ndf_bfill = df.fillna(method='bfill')\n\nprint("Data after backward fill (bfill):")\n\nprint(df_bfill)\n\nExpected Output:
\n\nOriginal Data:\n Date Temperature Humidity\n0 2024-01-01 20.0 60.0\n1 2024-01-02 NaN 65.0\n2 2024-01-03 22.0 NaN\n3 2024-01-04 NaN 70.0\n4 2024-01-05 25.0 75.0\n\n==================================================\nData after forward fill (ffill):\n Date Temperature Humidity\n0 2024-01-01 20.0 60.0\n1 2024-01-02 20.0 65.0 # Filled with previous value 20\n2 2024-01-03 22.0 65.0 # Filled with previous value 65\n3 2024-01-04 22.0 70.0 # Filled with previous value 22\n4 2024-01-05 25.0 75.0\n\n==================================================\nData after backward fill (bfill):\n Date Temperature Humidity\n0 2024-01-01 20.0 60.0\n1 2024-01-02 22.0 65.0 # Filled with next value 22\n2 2024-01-03 22.0 70.0 # Filled with next value 70\n3 2024-01-04 25.0 70.0 # Filled with next value 25\n4 2024-01-05 25.0 75.0\n\nCode Explanation:
\n\n- \n
- Forward fill (
ffill): The missing temperature at row 1 is filled with 20 from row 0; the missing temperature at row 3 is filled with 22 from row 2. \n - Backward fill (
bfill): The missing temperature at row 1 is filled with 22 from row 2; the missing temperature at row 3 is filled with 25 from row 4. \n - This method is suitable for time series data like stock prices or temperature records. \n
Example 5: Limit the Number of Fills Using the limit Parameter
\n\nThe limit parameter restricts the maximum number of consecutive values to fill.
Example
\n\nimport pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with multiple consecutive missing values\n\ndata = {\n\n'A': [1, np.nan, np.nan, np.nan,5],\n\n'B': [1,2, np.nan, np.nan,5]\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Forward fill, but limit to filling only 1 consecutive missing value at a time\n\ndf_filled = df.fillna(method='ffill', limit=1)\n\nprint("Data after limiting consecutive fills to 1:")\n\nprint(df_filled)\n\nExpected Output:
\n\nOriginal Data:\n A B\n0 1.0 1.0\n1 NaN 2.0\n2 NaN NaN\n3 NaN NaN\n4 5.0 5.0\n\n==================================================\nData after limiting consecutive fills to 1:\n A B\n0 1.0 1.0\n1 1.0 2.0 # Filled 1\n2 NaN 2.0 # Reached limit, no further fill\n3 NaN NaN # No fill\n4 5.0 5.0\n\nCode Explanation:
\n\n- \n
- For column A, the missing value at row 1 is filled with 1, but rows 2 and 3 are not filled due to reaching
limit=1. \n - For column B, the missing value at row 2 is filled with 2, but row 3 is not filled due to reaching
limit=1. \n
Example 6: Fill Missing Values Using Interpolation
\n\nPandas also provides the interpolate() method for interpolation filling, suitable for numeric data.
Example
\n\nimport pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with missing values\n\ndata = {\n\n'x': [1,2,3,4,5],\n\n'y': [10, np.nan,30, np.nan,50]\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Use linear interpolation to fill\n\ndf_interpolated = df.copy()\n\ndf_interpolated['y'] = df_interpolated['y'].interpolate(method='linear')\n\nprint("Data after linear interpolation:")\n\nprint(df_interpolated)\n\nExpected Output:
\n\nOriginal Data:\n x y\n0 1 10.0\n1 2 NaN\n2 3 30.0\n3 4 NaN\n4 5 50.0\n\n==================================================\nData after linear interpolation:\n x y\n0 1 10.0\n1 2 20.0 # Interpolation: 10 + (30-10)/(3-1)*(2-1) = 20\n2 3 30.0\n3 4 40.0 # Interpolation: 30 + (50-30)/(5-3)*(4-3) = 40\n4 5 50.0\n\nCode Explanation:
\n\n- \n
- Linear interpolation calculates missing values based on known values before and after. \n
- Row 1: 10 + (30-10)/(3-1)*(2-1) = 20. \n
- Row 3: 30 + (50-30)/(5-3)*(4-3) = 40. \n
- Interpolation is suitable for data that changes linearly. \n
\n\n
Notes
\n\n- \n
fillna()does not modify the original DataFrame by default. To modify it in place, use theinplace=Trueparameter. \n - Note that the
methodparameter may be deprecated in future versions of Pandas. It's recommended to useffill()andbfill()methods instead. \n - When using mean filling, consider using median if there are outliers in the data for better robustness. \n
- For categorical data (e.g., department, position), it's recommended to fill with the most frequent value (mode) or a clear category (e.g., "Unknown"). \n
- Before filling missing values, analyze the reasons and patterns of the missing data to choose the most appropriate filling method. \n
\n\n
YouTip