Ml Overfitting Underfitting Bias And Variance

Overfitting, Underfitting, Bias and Variance

\\n\\n

In the world of machine learning, building a model is like training a student. Our goal is for this student to not only memorize the examples in the textbook (training data), but to deeply understand the underlying principles, so they can also perform well on new, never-before-seen exam questions (test data). However, this student may encounter two typical problems during the learning process:

\\n\\n

One is learning too rigidly, only knowing how to mechanically apply examples (underfitting);
The other is learning too cleverly, memorizing even the punctuation marks and handwriting characteristics of the examples, leading to confusion when facing new problems (overfitting).

\\n\\n

Understanding overfitting and underfitting, as well as the deeper theoretical concepts behind them—bias and variance—is a crucial step for every machine learning practitioner from beginner to expert. They explain why models make mistakes and point us in the direction of model improvement.

\\n\\n

I. Core Concepts: Model Performance and "Fitting" States

\\n\\n

First, let's understand what fitting means through an intuitive example. Suppose we want to use a mathematical model to fit a set of scattered data points.

\\n\\n

Example

\\n\\n

import numpy as np\\n\\nimport matplotlib.pyplot as plt\\n\\n# -------------------------- Set Chinese font start --------------------------\\n\\n plt.rcParams['font.sans-serif']=[\\n\\n# Windows Priority\\n\\n'SimHei','Microsoft YaHei',\\n\\n# macOS Priority\\n\\n'PingFang SC','Heiti TC',\\n\\n# Linux Priority\\n\\n'WenQuanYi Micro Hei','DejaVu Sans'\\n\\n]\\n\\n# Fix issue where minus signs display as squares\\n\\n plt.rcParams['axes.unicode_minus']=False\\n\\n# -------------------------- Set Chinese font end --------------------------\\n\\n# Generate simulated data: Add some random noise to a sine curve\\n\\n np.random.seed(42)\\n\\n X = np.linspace(0,10,20)\\n\\n y_true = np.sin(X)# True underlying pattern (unknown to us)\\n\\n y_noise = np.random.randn(20) * 0.3# Random noise\\n\\n y = y_true + y_noise # Data we actually observe\\n\\nplt.scatter(X, y, label='Observed data (With noise)', color='blue', alpha=0.6)\\n\\n plt.plot(X, y_true, label='True pattern (y=sin(x))', color='green', linewidth=2)\\n\\n plt.xlabel('X')\\n\\n plt.ylabel('y')\\n\\n plt.title('Data and underlying pattern')\\n\\n plt.legend()\\n\\n plt.grid(True)\\n\\n plt.show()

\\n\\n

Our goal is to find a curve (model) that best describes the pattern reflected by these blue scattered points (data).

\\n\\n

The degree to which a model describes data is fitting.

\\n\\n

1. Underfitting

\\n\\n

Underfitting occurs when a model is too simple to capture the basic patterns or trends in the data. It's like a student who has only learned addition being asked to solve calculus problems.

\\n\\n

Performance: The model performs poorly on training data (e.g., low accuracy, high error).
Causes: Model complexity is too low, insufficient features, or inadequate training.
Analogy: Using a straight line (first-order polynomial) to fit data with obvious curved trends.

\\n\\n

Example

\\n\\n

from sklearn.linear_model import LinearRegression\\n\\nfrom sklearn.preprocessing import PolynomialFeatures\\n\\nfrom sklearn.metrics import mean_squared_error\\n\\n# Try fitting with a 1st-degree polynomial (straight line)\\n\\n poly = PolynomialFeatures(degree=1)\\n\\n X_poly1 = poly.fit_transform(X.reshape(-1,1))\\n\\n model_under = LinearRegression()\\n\\n model_under.fit(X_poly1, y)\\n\\n y_pred_under = model_under.predict(X_poly1)\\n\\nmse_train_under = mean_squared_error(y, y_pred_under)\\n\\nprint(f"UnderfittingMean Squared Error (MSE) on the training set: {mse_train_under:.4f}")

\\n\\n

Output:

\\n\\n

UnderfittingMean Squared Error (MSE) on the training set: 0.4402UnderfittingMean Squared Error (MSE) on the training set: 0.4402

\\n\\n

2. Good Fit

\\n\\n

This is the ideal state. The model is complex enough to learn the key patterns in the data, but not so complex that it learns random noise. It performs well on both the training set and unknown test sets.

\\n\\n

Performance: Error is low on both training and test sets, and the two are close to each other.
Analogy: Using a polynomial of appropriate degree (e.g., 3rd order) to fit the data.

\\n\\n

Example

\\n\\n

# Try fitting with a 3rd-degree polynomial\\n\\n poly = PolynomialFeatures(degree=3)\\n\\n X_poly3 = poly.fit_transform(X.reshape(-1,1))\\n\\n model_good = LinearRegression()\\n\\n model_good.fit(X_poly3, y)\\n\\n y_pred_good = model_good.predict(X_poly3)\\n\\nmse_train_good = mean_squared_error(y, y_pred_good)\\n\\nprint(f"Good fitMean Squared Error (MSE) on the training set: {mse_train_good:.4f}")

\\n\\n

Output:

\\n\\n

UnderfittingMean Squared Error (MSE) on the training set: 0.4402Good fitMean Squared Error (MSE) on the training set: 0.3988

\\n\\n

3. Overfitting

\\n\\n

Overfitting occurs when a model is too complex—it not only learns the true patterns in the data but also "memorizes" the random noise and outliers in the training data.

\\n\\n

Performance: The model performs extremely well on training data (minimal error), but performance drops sharply on new, unseen data, with poor generalization ability.
Causes: Excessive model complexity, too little training data.
Analogy: Using a very high-order polynomial (e.g., 15th order) to fit data, making the curve pass through almost every data point, becoming extremely distorted.

\\n\\n

Example

\\n\\n

import numpy as np\\n\\nimport matplotlib.pyplot as plt\\n\\nfrom sklearn.linear_model import LinearRegression\\n\\nfrom sklearn.preprocessing import PolynomialFeatures\\n\\nfrom sklearn.metrics import mean_squared_error\\n\\n# -------------------------- Set Chinese font start --------------------------\\n\\n plt.rcParams['font.sans-serif']=[\\n\\n# Windows Priority\\n\\n'SimHei','Microsoft YaHei',\\n\\n# macOS Priority\\n\\n'PingFang SC','Heiti TC',\\n\\n# Linux Priority\\n\\n'WenQuanYi Micro Hei','DejaVu Sans'\\n\\n]\\n\\n# Fix issue where minus signs display as squares\\n\\n plt.rcParams['axes.unicode_minus']=False\\n\\n# -------------------------- Set Chinese font end --------------------------\\n\\n# Generate simulated data: Add some random noise to a sine curve\\n\\n np.random.seed(42)\\n\\n X = np.linspace(0,10,20)\\n\\n y_true = np.sin(X)# True underlying pattern (unknown to us)\\n\\n y_noise = np.random.randn(20) * 0.3# Random noise\\n\\n y = y_true + y_noise # Data we actually observe\\n\\n# Try fitting with a 1st-degree polynomial (straight line)\\n\\n poly = PolynomialFeatures(degree=1)\\n\\n X_poly1 = poly.fit_transform(X.reshape(-1,1))\\n\\n model_under = LinearRegression()\\n\\n model_under.fit(X_poly1, y)\\n\\n y_pred_under = model_under.predict(X_poly1)\\n\\nmse_train_under = mean_squared_error(y, y_pred_under)\\n\\nprint(f"UnderfittingMean Squared Error (MSE) on the training set: {mse_train_under:.4f}")\\n\\n# Try fitting with a 3rd-degree polynomial\\n\\n poly = PolynomialFeatures(degree=3)\\n\\n X_poly3 = poly.fit_transform(X.reshape(-1,1))\\n\\n model_good = LinearRegression()\\n\\n model_good.fit(X_poly3, y)\\n\\n y_pred_good = model_good.predict(X_poly3)\\n\\nmse_train_good = mean_squared_error(y, y_pred_good)\\n\\nprint(f"Good fitMean Squared Error (MSE) on the training set: {mse_train_good:.4f}")\\n\\n# Try fitting with a 15th-degree polynomial (highly prone to overfitting)\\n\\n poly = PolynomialFeatures(degree=15)\\n\\n X_poly15 = poly.fit_transform(X.reshape(-1,1))\\n\\n model_over = LinearRegression()\\n\\n model_over.fit(X_poly15, y)\\n\\n y_pred_over = model_over.predict(X_poly15)\\n\\nmse_train_over = mean_squared_error(y, y_pred_over)\\n\\nprint(f"MSE of overfitting model on training set (MSE): {mse_train_over:.4f}")\\n\\n# Visualize the three fitting states\\n\\n plt.figure(figsize=(15,4))\\n\\n# Underfitting\\n\\n plt.subplot(1,3,1)\\n\\n plt.scatter(X, y, alpha=0.6)\\n\\n plt.plot(X, y_pred_under, color='red', linewidth=2, label='Underfitting (1Degree)')\\n\\n plt.plot(X, y_true, color='green', linestyle='--', label='True pattern')\\n\\n plt.title(f'Underfittingn Training MSE: {mse_train_under:.4f}')\\n\\n plt.legend()\\n\\n plt.grid(True)\\n\\n# Good fit\\n\\n plt.subplot(1,3,2)\\n\\n plt.scatter(X, y, alpha=0.6)\\n\\n plt.plot(X, y_pred_good, color='red', linewidth=2, label='Good fit (3Degree)')\\n\\n plt.plot(X, y_true, color='green', linestyle='--', label='True pattern')\\n\\n plt.title(f'Good fitn Training MSE: {mse_train_good:.4f}')\\n\\n plt.legend()\\n\\n plt.grid(True)\\n\\n# Overfitting\\n\\n plt.subplot(1,3,3)\\n\\n plt.scatter(X, y, alpha=0.6)\\n\\n plt.plot(X, y_pred_over, color='red', linewidth=2, label='Overfitting (15Degree)')\\n\\n plt.plot(X, y_true, color='green', linestyle='--', label='True pattern')\\n\\n plt.title(f'Overfitting\n Training MSE: {mse_train_over:.4f}')\\n\\n plt.legend()\\n\\n plt.grid(True)\\n\\nplt.tight_layout()\\n\\n plt.show()

\\n\\n

From the figure, we can clearly see:

\\n\\n

Underfitting (left): The red straight line completely fails to capture the fluctuation trend of the data.
Good fit (middle): The red curve roughly follows the trend of the green true pattern.
Overfitting (right): The red curve fluctuates violently, trying to pass through every blue scatter point, including noise points, completely losing the smooth shape of the sine curve.

\\n\\n

II. Theoretical Foundation: Bias-Variance Decomposition

\\n\\n

Bias and variance provide a theoretical framework for understanding overfitting and underfitting. They describe two different sources of model error.

\\n\\n

We can decompose the model's total error into: Bias² + Variance + Irreducible Error.

\\n\\n

1. Bias

\\n\\n

Definition: The gap between the expected value of model predictions (i.e., average prediction) and the true value. Reflects the model's systematic error, i.e., whether the model's assumptions about the nature of the problem are correct.
High bias performance: The model is too simple to capture data characteristics, leading to underfitting. Regardless of what data is used for training, the results deviate from the true values.
Example: Always using the simple linear model "house price = area × 1000" to predict various houses, ignoring important factors like location and floor level—this is high bias.

\\n\\n

2. Variance

\\n\\n

Definition: The range of fluctuation of the model's predictions themselves. Reflects the model's sensitivity to random noise in the training data.
High variance performance: The model is too complex, overreacting to small changes in training data (including noise), leading to overfitting. Training with a different set of data may result in a completely different model.
Example: A deep neural network, if unconstrained, might generate a completely different, extremely complex set of prediction rules for each unique training dataset—this is high variance.

\\n\\n

3. Bias-Variance Tradeoff

\\n\\n

This is a core tradeoff in machine learning. We cannot minimize both bias and variance simultaneously.

\\n\\n

Increase model complexity: Usually can reduce bias (model capability increases), but will increase variance (more likely to learn noise).
Decrease model complexity: Usually can reduce variance (model is more stable), but will increase bias (model capability decreases).

\\n\\n

Our goal is to find the "sweet spot" in the figure, where total error is minimized.

\\n\\n

III. Diagnosis and Response Strategies

\\n\\n

How to determine which state a model is in? How to solve it?

\\n\\n

1. Diagnostic Method: Learning Curves

\\n\\n

Learning curves plot the model's performance (e.g., error) on the training set and validation set as a function of training sample size or model complexity.

\\n\\n

Example

\\n\\n

import numpy as np\\n\\nimport matplotlib.pyplot as plt\\n\\nfrom sklearn.datasets import load_diabetes\\n\\nfrom sklearn.model_selection import train_test_split\\n\\nfrom sklearn.model_selection import learning_curve\\n\\nfrom sklearn.pipeline import make_pipeline\\n\\nfrom sklearn.linear_model import LinearRegression\\n\\nfrom sklearn.preprocessing import PolynomialFeatures, StandardScaler\\n\\nfrom sklearn.metrics import mean_squared_error\\n\\nimport warnings\\n\\nwarnings.filterwarnings('ignore')\\n\\n# -------------------------- Set Chinese font start --------------------------\\n\\n plt.rcParams['font.sans-serif']=[\\n\\n# Windows Priority\\n\\n'SimHei','Microsoft YaHei',\\n\\n# macOS Priority\\n\\n'PingFang SC','Heiti TC',\\n\\n# Linux Priority\\n\\n'WenQuanYi Micro Hei','DejaVu Sans'\\n\\n]\\n\\n# Fix issue where minus signs display as squares\\n\\n plt.rcParams['axes.unicode_minus']=False\\n\\n# Set plot style\\n\\n plt.rcParams['figure.figsize']=(10,6)\\n\\n plt.rcParams['axes.grid']=True\\n\\n plt.rcParams['grid.alpha']=0.3\\n\\n# -------------------------- Set Chinese font end --------------------------\\n\\n# Load data\\n\\n data = load_diabetes()\\n\\n X, y = data.data, data.target\\n\\n# Use only one feature (more suitable for polynomial

YouTip