Regularization
\\n\\nRegularization
\\n\\nImagine that we are learning to ride a bicycle. At the beginning, we might be very nervous, gripping the handlebars tightly with both hands, body stiff, trying to remember every detail of the movement. This state of over-focusing on details and trying to perfectly control every tiny movement is called Overfitting in machine learning.
\\n\\nOur model (like you when you first started learning) is too complex. It perfectly memorizes every sample in the training data, even including noise and random fluctuations. This leads to poor performance when facing new, unseen data (like actually riding on the road), lacking generalization ability.
\\n\\nRegularization is a core technology born to solve this problem. Its core idea is: to cool down the model's learning enthusiasm, prevent it from getting bogged down in details, and thereby improve its adaptability to new environments. Simply put, regularization adds an extra penalty term to the model's training objective function (loss function) to limit the complexity of the model and avoid it relying too heavily on specific patterns in the training data.
\\n\\nThis article will take you deep into the principles of regularization, common methods, and their applications in engineering practice.
\\n\\n\\n\\n
Basic Concepts: The Bias-Variance Tradeoff
\\n\\nBefore diving into regularization, we need to understand the two core sources of error in machine learning models: Bias and Variance. This helps us understand exactly what regularization is adjusting.
\\n\\n- \\n
- Bias measures the systematic error of the model itself. High bias means the model is too simple and fails to learn even the basic patterns in the training data (underfitting). \\n
- Variance measures the model's sensitivity to random fluctuations in the training data. High variance means the model is too complex and treats noise in the training data as patterns to learn (overfitting). \\n
Our goal is to find the optimal point of the Bias-Variance Tradeoff to minimize the total error. Regularization is an effective means to improve the overall generalization performance of the model by increasing a little bias (making the model slightly simpler) to significantly reduce variance.
\\n\\n\\n\\n
L1 and L2 Regularization
\\n\\nThe most classic regularization methods involve directly adding a penalty term based on the model's weight parameters to the loss function. Depending on how the penalty term is calculated, they are mainly divided into L1 and L2 regularization.
\\n\\nChanges in Loss Function
\\n\\nUnregularized loss function (using Mean Squared Error MSE as an example): Loss = (1/n) * Ξ£(True value - Predicted value)Β²
Loss function after adding a regularization term: Loss_Regularization = Loss + Ξ» * Penalty(Weights)
Where:
\\n\\n- \\n
Ξ»(lambda) is the Regularization Strength Coefficient, a hyperparameter greater than 0. It controls the intensity of the penalty. The largerΞ»is, the heavier the penalty on model complexity, and the simpler the model becomes. \\n Penalty(Weights)is the penalty term, defined differently for L1 and L2. \\n
L1 Regularization (Lasso Regression)
\\n\\n- \\n
- Penalty Term: The sum of the absolute values of all model weight parameters. \\n
- Formula:
Penalty = Ξ£|w_i|, wherew_iis the i-th weight. \\n - Loss Function:
Loss_L1 = Loss + Ξ» * Ξ£|w_i|\\n
Core Characteristics and Effects:
\\n\\n- \\n
- Feature Selection: L1 regularization tends to produce a sparse weight matrix, meaning it will compress the weights of many unimportant features directly to 0. This is equivalent to automatically performing feature selection, where the model retains only the most important features. \\n
- Geometric Interpretation: Its constraint condition is geometrically a "diamond" (in 2D it is a diamond). The optimal solution point is more likely to hit the "corners" of this diamond, and points on the corners mean certain coordinates are 0. \\n
Code Example:
\\n\\nExample
\\nfrom sklearn.linear_model import Lasso\\n\\nfrom sklearn.datasets import make_regression\\n\\nfrom sklearn.model_selection import train_test_split\\n\\n# Generate simulated data\\n\\n X, y = make_regression(n_samples=100, n_features=10, noise=0.1, random_state=42)\\n\\n X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\\n\\n# Create L1 Regularization model (Lasso), set regularization strength alpha (i.e. Ξ»)\\n\\n lasso_model = Lasso(alpha=0.1)# alpha The larger, the stronger the penalty, more weights become 0\\n\\n lasso_model.fit(X_train, y_train)\\n\\n# View model coefficients (weights), observe sparsity\\n\\nprint("Lasso Model coefficients:")\\n\\nfor i, coef in enumerate(lasso_model.coef_):\\n\\nprint(f" Feature {i}: {coef:.4f}")\\n\\n# Count the Number of Non-zero Weights\\n\\n non_zero_count =sum(lasso_model.coef_!=0)\\n\\nprint(f"n Number of features with non-zero weights: {non_zero_count} / {X.shape}")\\n\\n\\nOutput:
\\nLasso Model coefficients: Feature 0: 16.6855 Feature 1: 54.0447 Feature 2: 5.0302 Feature 3: 63.5492 Feature 4: 93.4587 Feature 5: 70.5421 Feature 6: 86.9569 Feature 7: 10.2711 Feature 8: 3.0697 Feature 9: 70.7835Number of features with non-zero weights: 10 / 10
\\n\\nL2 Regularization (Ridge Regression)
\\n\\n- \\n
- Penalty Term: The sum of the squares of all model weight parameters. \\n
- Formula:
Penalty = Ξ£(w_i)Β²\\n - Loss Function:
Loss_L2 = Loss + Ξ» * Ξ£(w_i)Β²\\n
Core Characteristics and Effects:
\\n\\n- \\n
- Weight Decay: L2 regularization tends to make all weight parameters approach 0, but usually not equal to 0. It shrinks all weights evenly, preventing any single weight from becoming too large. \\n
- Improve Ill-posed Problems: For data where there is multicollinearity (high correlation) between features, ordinary linear regression may be unstable. L2 regularization can effectively improve this problem, making the solution more stable. \\n
- Geometric Interpretation: Its constraint condition is geometrically a "circle" (in 2D it is a circle). The optimal solution point is more likely to hit the "edge" of this circle, rather than a sharp corner. \\n
Code Example:
\\n\\nExample
\\nfrom sklearn.linear_model import Ridge\\n\\n# Create L2 regularization model (Ridge)\\n\\n ridge_model = Ridge(alpha=1.0)# alpha i.e. Ξ»\\n\\n ridge_model.fit(X_train, y_train)\\n\\n# View model coefficients, observe weight decay\\n\\nprint("Ridge Model coefficients:")\\n\\nfor i, coef in enumerate(ridge_model.coef_):\\n\\nprint(f" Feature {i}: {coef:.4f}")\\n\\n# Compare the difference in Coefficients between Lasso and Ridge\\n\\nprint("n Coefficient Comparison (Lasso vs Ridge):")\\n\\nprint("Feature | Lasso Coefficients | Ridge Coefficients")\\n\\nprint("-" * 35)\\n\\nfor i in range(len(lasso_model.coef_)):\\n\\nprint(f"{i:4d} | {lasso_model.coef_:11.4f} | {ridge_model.coef_:11.4f}")\\n\\n\\nSummary of L1 vs L2 Comparison
\\n\\n| Feature | \\nL1 Regularization (Lasso) | \\nL2 Regularization (Ridge) | \\n
|---|---|---|
| Penalty Term | \\nΞ£ | w_i | | \\nΞ£ (w_i)Β² | \\n
| Solution Characteristics | \\nSparse Solution, many weights are 0 | \\nDense Solution, weights are close to but not 0 | \\n
| Core Function | \\nFeature Selection | \\nWeight Decay, Stable Solution | \\n
| Geometric Shape | \\nDiamond / Polyhedron | \\nCircle / Sphere | \\n
| Computation | \\nOptimization is more complex (non-differentiable everywhere) | \\nOptimization is simple (differentiable everywhere) | \\n
| Applicable Scenarios | \\nLarge number of features, and only a few are believed to be relevant | \\nFeatures may all contribute, or multicollinearity exists | \\n
Elastic Net
\\n\\nElastic Net is a compromise between L1 and L2 regularization, containing penalty terms from both. Loss_ElasticNet = Loss + Ξ»1 * Ξ£|w_i| + Ξ»2 * Ξ£(w_i)Β²
It combines the feature selection capability of L1 and the stability of L2, suitable for situations where feature dimensions are very high and correlations exist between features.
\\n\\nExample
\\nfrom sklearn.linear_model import ElasticNet\\n\\nelastic_model = ElasticNet(alpha=0.1, l1_ratio=0.5)# l1_ratio Control the mixing ratio of L1 and L2\\n\\n elastic_model.fit(X_train, y_train)\\n\\n\\n\\n\\n
Other Regularization Techniques
\\n\\nBeyond directly modifying the loss function, there are other methods to achieve regularization by changing the training process or model structure.
\\n\\nDropout (for Neural Networks)
\\n\\nDropout is an extremely effective regularization technique in neural networks. During the training process, it randomly causes a portion of neurons in the network to temporarily "deactivate" (setting their output to 0).
\\n\\nWorking Principle:
\\n\\n- \\n
- In each training batch, randomly discard a portion of neurons with probability
p(e.g., 0.5). \\n - Forward propagation and backpropagation are performed only on the remaining neurons. \\n
- During testing or prediction, use all neurons, but the neuron's output must be multiplied by
(1-p)to keep the expected value consistent. \\n
Core Idea: Prevent complex co-adaptation between neurons, forcing the network to learn more robust and distributed feature representations. This is like a team; it cannot always rely on a few core members. Everyone needs to have the ability to work independently, so that even if someone is absent, the team can still function normally.
\\n\\nCode Example (using TensorFlow/Keras):
\\n\\nExample
\\nimport tensorflow as tf\\n\\nfrom tensorflow.keras.models import Sequential\\n\\nfrom tensorflow.keras.layers import Dense, Dropout\\n\\nmodel = Sequential([\\n\\n Dense(128, activation='relu', input_shape=(input_dim,)),\\n\\n Dropout(0.5),# inAdd Dropout Layer after the previous Layer, dropout rate 50%\\n\\n Dense(64, activation='relu'),\\n\\n Dropout(0.3),# Dropout rate 30%\\n\\n Dense(1, activation='sigmoid')# Output layer\\n\\n])\\n\\nmodel.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])\\n\\n\\nEarly Stopping
\\n\\nEarly Stopping is a simple yet efficient regularization strategy. It does not modify the loss function but monitors the model's performance on the validation set.
\\n\\nOperation Steps:
\\n\\n- \\n
- Split the data into a training set and a validation set. \\n
- Train the model on the training set and periodically evaluate performance on the validation set (e.g., after every epoch). \\n
- Once the performance on the validation set (such as loss) no longer improves or even starts to decrease for consecutive periods, stop training immediately. \\n
Core Idea: Stop training at the moment the model is about to start overfitting the training data (i.e., when the validation set error starts to rise), thereby obtaining the model weights with the best generalization ability.
\\n\\nExample
\\nfrom tensorflow.keras.callbacks import EarlyStopping\\n\\n# Define Early Stopping Callback\\n\\n# monitor: metric to monitor, such as 'val_loss'\\n\\n# patience: patience, stop if validation performance does not improve within this many epochs\\n\\n# restore_best_weights: Whether to restore weights from the epoch with the best monitored metric\\n\\n early_stopping = EarlyStopping(\\n\\n monitor='val_loss',\\n\\n patience=10,\\n\\n restore_best_weights=True\\n\\n)\\n\\n# in model.fit used in\\n\\n history = model.fit(\\n\\n X_train, y_train,\\n\\n validation_data=(X_val, y_val),\\n\\n epochs=100,\\n\\n callbacks=# Pass in a list of callback functions\\n\\n)\\n\\n\\n\\n\\n
Practice Exercise: Comprehensive Comparison of Regularization Effects
\\n\\nLet's compare the effects of different regularization methods on a regression task through a complete example.
\\n\\nExample
\\nimport numpy as np\\n\\nimport matplotlib.pyplot as plt\\n\\nfrom sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet\\n\\nfrom sklearn.preprocessing import PolynomialFeatures\\n\\nfrom sklearn.pipeline import make_pipeline\\n\\nfrom sklearn.metrics import mean_squared_error\\n\\n# 1. Generate Non-linear Data with Noise\\n\\n np.random.seed(42)\\n\\n X = np.linspace(-3,3,100).reshape(-1,1)\\n\\n y_true =0.5 * X.ravel()**2 + X.ravel()# True Quadratic Relationship\\n\\n y = y_true + np.random.randn(100) * 0.8# Add Noise\\n\\n# 2. Create models of different complexities (using polynomial features)\\n\\n degree =10# Use 10th Degree Polynomial, Which Is Prone to Overfitting\\n\\nmodels ={\\n\\n'No Regularization': make_pipeline(PolynomialFeatures(degree), LinearRegression()),\\n\\n'L1 (Lasso)': make_pipeline(PolynomialFeatures(degree), Lasso(alpha=0.01, max_iter=10000)),\\n\\n'L2 (Ridge)': make_pipeline(PolynomialFeatures(degree), Ridge(alpha=0.1)),\\n\\n'ElasticNet': make_pipeline(PolynomialFeatures(degree), ElasticNet(alpha=0.01, l1_ratio=0.5))\\n\\n}\\n\\n# 3. Train and predict\\n\\n X_plot = np.linspace(-3.5,3.5,200).reshape(-1,1)\\n\\n plt.figure(figsize=(12,8))\\n\\n plt.scatter(X, y, s=20, alpha=0.6, label='Training data (With noise)')\\n\\n plt.plot(X, y_true,'k-', linewidth=3, label='True function')\\n\\nfor name, model in models.items():\\n\\n model.fit(X, y)\\n\\n y_plot = model.predict(X_plot)\\n\\n mse = mean_squared_error(y, model.predict(X))\\n\\n plt.plot(X_plot, y_plot,'--', linewidth=2, label=f'{name} (MSE: {mse:.3f})')\\n\\nplt.xlabel('X')\\n\\n plt.ylabel('y')\\n\\n plt.title('Comparison of the suppression effect of different Regularization methods on overfitting (10th degree polynomial)')\\n\\n plt.legend(loc='best')\\n\\n plt.grid(True, alpha=0.3)\\n\\n plt.show()\\n\\n\\nExercise Tasks:
\\n\\n- \\n
- Run the above code and observe how the unregularized model fluctuates violently to fit the noise (overfitting), while the curves of the regularized models are smoother and closer to the true function. \\n
- Try adjusting the
degree(polynomial order) and thealpha(regularization strength) parameters of each model to observe their impact on the model fitting effect. \\n - (Advanced) Split the data into a training set and a test set, calculate the MSE of each model on the test set, and verify the improvement in generalization ability by regularization. \\n
\\n\\n
Summary and Engineering Suggestions
\\n\\nRegularization is an essential tool in the machine learning engineer's toolbox. To apply it effectively, please keep the following points in mind:
\\n\\n- \\n
- Understand the Nature of the Problem: First, determine if the model is facing an overfitting (high variance) problem through learning curves and validation set performance. \\n
- Start Simple: Usually, you can try L2 Regularization first because it is stable and easy to tune. If the feature dimension is extremely high and feature selection is needed, then consider L1 or Elastic Net. \\n
- Tuning is Key: The regularization strength
Ξ»(oralpha) is a crucial hyperparameter. It must be carefully selected through cross-validation. \\n - Combine and Use: In practice, regularization techniques are often used in combination. For example, when training deep neural networks, Dropout + L2 Weight Decay + Early Stopping is an extremely common combination. \\n
- Domain Adaptation: For computer vision tasks, Dropout and Batch Normalization (which also has a certain regularization effect) are very effective. For sequence models (like RNNs, Transformers), Dropout and weight decay are commonly used. \\n
The ultimate goal of regularization is to guide the model from "rote memorization" of training data to "deep understanding" of the universal laws behind the data, thereby making more reliable predictions in the real world. Master it, and you master the key to improving model generalization ability.
YouTip