YouTip LogoYouTip

Pandas Df Fillna

Pandas df.fillna() Function |\n\n

Image 1: Pandas Common Functions Pandas General Functions

\n\n
\n\n

df.fillna() is a function in Pandas used to fill missing values.

\n\n

Unlike dropna(), which removes missing values, fillna() allows us to fill missing data with specified values, means, medians, forward fill, or backward fill. This is very useful for maintaining data integrity and analysis accuracy.

\n\n
\n\n

Basic Syntax and Parameters

\n\n

fillna() is a member function of DataFrame, called using the dot operator ..

\n\n

Syntax Format

\n\n
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
\n\n

Parameter Description

\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ParameterTypeRequiredDescriptionDefault
valuescalar, dict, Series, DataFrameOptionalThe value(s) used to fill missing values. Can be a constant, dictionary (specifying different values for different columns), Series, or DataFrame.None
methodstrOptionalFill method. 'ffill' or 'pad' means forward fill (use previous value); 'bfill' or 'backfill' means backward fill (use next value).None
axisint or strOptionalSpecifies the direction of filling. 0 or 'index' fills by row; 1 or 'columns' fills by column. Only valid when method is not None.0
inplaceboolOptionalIf True, modifies the original DataFrame directly without returning a new object; if False, returns a new DataFrame, leaving the original unchanged.False
limitintOptionalMaximum number of consecutive values to fill. For example, setting limit=1 means only one missing value is filled at a time.None
downcastdict or strOptionalRules for downcasting data types, such as converting float64 to int64.None
\n\n

Return Value Description

\n\n
    \n
  • Returns a new DataFrame (if inplace=False) or None (if inplace=True).
  • \n
  • The returned DataFrame has its missing values filled.
  • \n
\n\n
\n\n

Examples

\n\n

Let's explore several examples to fully understand how to use fillna().

\n\n

Example 1: Fill All Missing Values with a Constant

\n\n

The simplest usage is to fill all missing values with a constant.

\n\n

Example

\n\n
import pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with missing values\n\ndata = {\n\n'Name': ['Zhang San','Li Si','Wang Wu','Zhao Liu'],\n\n'Age': [25, np.nan,35, np.nan],# Missing value\n\n'Salary': [5000,6000, np.nan,8000],# Missing value\n\n'Department': ['Tech','Marketing','Tech', np.nan]# Missing value\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Fill all missing values with constant 0\n\ndf_filled = df.fillna(0)\n\nprint("Data after filling with 0:")\n\nprint(df_filled)
\n\n

Expected Output:

\n\n
Original Data:\n   Name   Age  Salary Department\n0  Zhang San  25.0  5000.0       Tech\n1     Li Si   NaN  6000.0  Marketing\n2    Wang Wu  35.0     NaN       Tech\n3    Zhao Liu   NaN  8000.0      None\n\n==================================================\nData after filling with 0:\n   Name   Age  Salary Department\n0  Zhang San  25.0  5000.0       Tech\n1     Li Si   0.0  6000.0  Marketing\n2    Wang Wu  35.0     0.0       Tech\n3    Zhao Liu   0.0  8000.0        0
\n\n

Code Explanation:

\n\n
    \n
  1. We created a DataFrame with multiple missing values.
  2. \n
  3. Used df.fillna(0) to replace all missing values with 0.
  4. \n
  5. This approach is simple but may not be suitable for all scenarios, e.g., filling age or salary with 0 might not make sense.
  6. \n
\n\n

Example 2: Specify Different Fill Values for Different Columns Using a Dictionary

\n\n

You can specify different fill values for different columns to make the data more reasonable.

\n\n

Example

\n\n
import pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with missing values\n\ndata = {\n\n'Name': ['Zhang San','Li Si','Wang Wu','Zhao Liu'],\n\n'Age': [25, np.nan,35, np.nan],\n\n'Salary': [5000,6000, np.nan,8000],\n\n'Department': ['Tech','Marketing','Tech', np.nan],\n\n'Performance': [85,90, np.nan,95]\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Use a dictionary to specify different fill values for different columns\n\nfill_values = {\n\n'Age': 30,# Fill missing age with 30\n\n'Salary': 6500,# Fill missing salary with 6500\n\n'Department': 'Unassigned',# Fill missing department with "Unassigned"\n\n'Performance': 80# Fill missing performance with 80\n\n}\n\ndf_filled = df.fillna(fill_values)\n\nprint("Data after filling different columns with different values:")\n\nprint(df_filled)
\n\n

Expected Output:

\n\n
Original Data:\n   Name   Age  Salary Department  Performance\n0  Zhang San  25.0  5000.0       Tech         85\n1     Li Si   NaN  6000.0  Marketing         90\n2    Wang Wu  35.0     NaN       Tech        NaN\n3    Zhao Liu   NaN  8000.0      None         95\n\n==================================================\nData after filling different columns with different values:\n   Name   Age  Salary Department  Performance\n0  Zhang San  25.0  5000.0       Tech         85\n1     Li Si  30.0  6000.0  Marketing         90\n2    Wang Wu  35.0  6500.0       Tech         80\n3    Zhao Liu  30.0  8000.0 Unassigned         95
\n\n

Code Explanation:

\n\n
    \n
  • Using the dictionary fill_values, we specified reasonable fill values for each column.
  • \n
  • This method is more flexible and allows you to set appropriate default values based on business logic.
  • \n
\n\n

Example 3: Fill Numeric Columns with Mean or Median

\n\n

For numeric data, filling with mean or median is a common and reasonable approach.

\n\n

Example

\n\n
import pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with missing values\n\ndata = {\n\n'Name': ['Zhang San','Li Si','Wang Wu','Zhao Liu','Qian Qi'],\n\n'Math': [85,90, np.nan,78,92],\n\n'English': [88, np.nan,82,85,90],\n\n'Physics': [np.nan,85,90,88,95]\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Calculate the mean for each column\n\nmath_mean = df['Math'].mean()\n\nenglish_mean = df['English'].mean()\n\nphysics_mean = df['Physics'].mean()\n\nprint(f"Math Mean: {math_mean:.2f}")\n\nprint(f"English Mean: {english_mean:.2f}")\n\nprint(f"Physics Mean: {physics_mean:.2f}")\n\nprint("=" * 50)\n\n# Fill with means\n\ndf_filled = df.fillna({\n\n'Math': math_mean,\n\n'English': english_mean,\n\n'Physics': physics_mean\n\n})\n\nprint("Data after filling with means:")\n\nprint(df_filled)
\n\n

Expected Output:

\n\n
Original Data:\n   Name   Math  English  Physics\n0  Zhang San   85.0     88.0      NaN\n1     Li Si   90.0      NaN     85.0\n2    Wang Wu    NaN     82.0     90.0\n3    Zhao Liu   78.0     85.0     88.0\n4    Qian Qi   92.0     90.0     95.0\n\n==================================================\nMath Mean: 86.25\nEnglish Mean: 86.25\nPhysics Mean: 89.50\n\n==================================================\nData after filling with means:\n   Name   Math  English  Physics\n0  Zhang San   85.0     88.0     89.5\n1     Li Si   90.0     86.25    85.0\n2    Wang Wu   86.25    82.0     90.0\n3    Zhao Liu   78.0     85.0     88.0\n4    Qian Qi   92.0     90.0     95.0
\n\n

Code Explanation:

\n\n
    \n
  • Use df['column_name'].mean() to calculate the mean for each column.
  • \n
  • Use the calculated means as fill values to maintain overall distribution characteristics while solving missing value issues.
  • \n
\n\n

Example 4: Forward Fill (ffill) and Backward Fill (bfill)

\n\n

Forward fill uses the previous valid value to fill missing values, backward fill uses the next valid value.

\n\n

Example

\n\n
import pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with missing values\n\ndata = {\n\n'Date': ['2024-01-01','2024-01-02','2024-01-03','2024-01-04','2024-01-05'],\n\n'Temperature': [20, np.nan,22, np.nan,25],\n\n'Humidity': [60,65, np.nan,70,75]\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Forward fill: fill missing values with previous values\n\ndf_ffill = df.fillna(method='ffill')\n\nprint("Data after forward fill (ffill):")\n\nprint(df_ffill)\n\nprint("=" * 50)\n\n# Backward fill: fill missing values with next values\n\ndf_bfill = df.fillna(method='bfill')\n\nprint("Data after backward fill (bfill):")\n\nprint(df_bfill)
\n\n

Expected Output:

\n\n
Original Data:\n        Date  Temperature  Humidity\n0  2024-01-01         20.0      60.0\n1  2024-01-02          NaN      65.0\n2  2024-01-03         22.0       NaN\n3  2024-01-04          NaN      70.0\n4  2024-01-05         25.0      75.0\n\n==================================================\nData after forward fill (ffill):\n        Date  Temperature  Humidity\n0  2024-01-01         20.0      60.0\n1  2024-01-02         20.0      65.0 # Filled with previous value 20\n2  2024-01-03         22.0      65.0 # Filled with previous value 65\n3  2024-01-04         22.0      70.0 # Filled with previous value 22\n4  2024-01-05         25.0      75.0\n\n==================================================\nData after backward fill (bfill):\n        Date  Temperature  Humidity\n0  2024-01-01         20.0      60.0\n1  2024-01-02         22.0      65.0 # Filled with next value 22\n2  2024-01-03         22.0      70.0 # Filled with next value 70\n3  2024-01-04         25.0      70.0 # Filled with next value 25\n4  2024-01-05         25.0      75.0
\n\n

Code Explanation:

\n\n
    \n
  • Forward fill (ffill): The missing temperature at row 1 is filled with 20 from row 0; the missing temperature at row 3 is filled with 22 from row 2.
  • \n
  • Backward fill (bfill): The missing temperature at row 1 is filled with 22 from row 2; the missing temperature at row 3 is filled with 25 from row 4.
  • \n
  • This method is suitable for time series data like stock prices or temperature records.
  • \n
\n\n

Example 5: Limit the Number of Fills Using the limit Parameter

\n\n

The limit parameter restricts the maximum number of consecutive values to fill.

\n\n

Example

\n\n
import pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with multiple consecutive missing values\n\ndata = {\n\n'A': [1, np.nan, np.nan, np.nan,5],\n\n'B': [1,2, np.nan, np.nan,5]\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Forward fill, but limit to filling only 1 consecutive missing value at a time\n\ndf_filled = df.fillna(method='ffill', limit=1)\n\nprint("Data after limiting consecutive fills to 1:")\n\nprint(df_filled)
\n\n

Expected Output:

\n\n
Original Data:\n     A    B\n0  1.0  1.0\n1  NaN  2.0\n2  NaN  NaN\n3  NaN  NaN\n4  5.0  5.0\n\n==================================================\nData after limiting consecutive fills to 1:\n     A    B\n0  1.0  1.0\n1  1.0  2.0 # Filled 1\n2  NaN  2.0 # Reached limit, no further fill\n3  NaN  NaN # No fill\n4  5.0  5.0
\n\n

Code Explanation:

\n\n
    \n
  • For column A, the missing value at row 1 is filled with 1, but rows 2 and 3 are not filled due to reaching limit=1.
  • \n
  • For column B, the missing value at row 2 is filled with 2, but row 3 is not filled due to reaching limit=1.
  • \n
\n\n

Example 6: Fill Missing Values Using Interpolation

\n\n

Pandas also provides the interpolate() method for interpolation filling, suitable for numeric data.

\n\n

Example

\n\n
import pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with missing values\n\ndata = {\n\n'x': [1,2,3,4,5],\n\n'y': [10, np.nan,30, np.nan,50]\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Use linear interpolation to fill\n\ndf_interpolated = df.copy()\n\ndf_interpolated['y'] = df_interpolated['y'].interpolate(method='linear')\n\nprint("Data after linear interpolation:")\n\nprint(df_interpolated)
\n\n

Expected Output:

\n\n
Original Data:\n   x    y\n0  1  10.0\n1  2   NaN\n2  3  30.0\n3  4   NaN\n4  5  50.0\n\n==================================================\nData after linear interpolation:\n   x     y\n0  1  10.0\n1  2  20.0 # Interpolation: 10 + (30-10)/(3-1)*(2-1) = 20\n2  3  30.0\n3  4  40.0 # Interpolation: 30 + (50-30)/(5-3)*(4-3) = 40\n4  5  50.0
\n\n

Code Explanation:

\n\n
    \n
  • Linear interpolation calculates missing values based on known values before and after.
  • \n
  • Row 1: 10 + (30-10)/(3-1)*(2-1) = 20.
  • \n
  • Row 3: 30 + (50-30)/(5-3)*(4-3) = 40.
  • \n
  • Interpolation is suitable for data that changes linearly.
  • \n
\n\n
\n\n

Notes

\n\n
    \n
  • fillna() does not modify the original DataFrame by default. To modify it in place, use the inplace=True parameter.
  • \n
  • Note that the method parameter may be deprecated in future versions of Pandas. It's recommended to use ffill() and bfill() methods instead.
  • \n
  • When using mean filling, consider using median if there are outliers in the data for better robustness.
  • \n
  • For categorical data (e.g., department, position), it's recommended to fill with the most frequent value (mode) or a clear category (e.g., "Unknown").
  • \n
  • Before filling missing values, analyze the reasons and patterns of the missing data to choose the most appropriate filling method.
  • \n
\n\n
\n\n

Image 2: Pandas Common Functions Pandas General Functions

← Pandas Df Drop DuplicatesPandas Df To Parquet β†’