Ml Multiple Linear Regression
## Multiple Linear Regression
In the previous article, we explored simple linear regression, which helped us understand how one feature (independent variable) affects a target (dependent variable). However, real-world problems are often more complex. For example, when predicting house prices, we cannot just look at the house area; we also need to consider the number of bedrooms, location, age of the house, and many other factors. At this point, we need **multiple linear regression**.
Simply put, multiple linear regression is a natural extension of simple linear regression. It allows us to analyze the impact of **multiple independent variables** on a dependent variable simultaneously. This article will take you from zero to fully understanding the core concepts, mathematical principles, implementation methods, and practical applications of multiple linear regression.
* * *
## 1. What is Multiple Linear Regression?
### 1.1 Core Concepts
Multiple linear regression is a statistical method used to establish a linear relationship between **multiple independent variables** (also called features, explanatory variables) and **one continuous dependent variable** (also called target, response variable).
**A vivid analogy:** Imagine you are a chef, adjusting the taste of a soup (target). Simple linear regression is like you only adjusting the saltiness by the amount of salt. Multiple linear regression is like you simultaneously controlling the amounts of salt, sugar, pepper, soy sauce, and other seasonings (features) to comprehensively determine the final taste of the soup. Multiple regression allows you to analyze the specific contribution of each seasoning to the taste.
### 1.2 Model Formula
The mathematical expression of the multiple linear regression model is as follows:
`y = Ξ²β + Ξ²βxβ + Ξ²βxβ + ... + Ξ²βxβ + Ξ΅`
Let's break down each part of this formula:
| Symbol | Name | Meaning |
| --- | --- | --- |
| `y` | Dependent Variable | The target value we want to predict (e.g., house price). |
| `xβ, xβ, ..., xβ` | Independent Variables | Features used to predict `y` (e.g., area, number of bedrooms, age). |
| `Ξ²β` | Intercept | The predicted baseline value of `y` when all independent variables are 0. |
| `Ξ²β, Ξ²β, ..., Ξ²β` | Regression Coefficients | The weight of each independent variable `xα΅’`. It means: **when other features remain unchanged, for every 1 unit increase in `xα΅’`, `y` changes by an average of `Ξ²α΅’` units**. This is the core of multiple regression analysis. |
| `Ξ΅` | Error Term | Random fluctuations that the model cannot explain (e.g., measurement errors, unknown factors). |
**Formula Interpretation Example:** Suppose our model for predicting house price (`y`) is: `Price = 50000 + 3000 * Area + 10000 * Bedrooms - 2000 * Age`
* `Ξ²β = 50000`: Theoretically, the baseline price of a "house" with 0 area, 0 bedrooms, and 0 age.
* `Ξ²β = 3000`: When the number of bedrooms and age are the same, for every 1 square meter increase in area, the house price increases by an average of 3000 yuan.
* `Ξ²β = 10000`: When area and age are the same, for each additional bedroom, the house price increases by an average of 10000 yuan.
* `Ξ²β = -2000`: When area and number of bedrooms are the same, for every additional year of age, the house price decreases by an average of 2000 yuan.
* * *
## 2. How to "Train" a Multiple Linear Regression Model?
"Training" a model essentially means finding a set of optimal regression coefficients `(Ξ²β, Ξ²β, ..., Ξ²β)` based on our existing data, so that the model's predicted values are as close as possible to the true values. This process is usually accomplished through the **least squares method**.
### 2.1 Goal: Minimize the Loss Function
We use the **residual sum of squares** as the loss function to measure the model's prediction error. `RSS = Ξ£(yα΅’ - Ε·α΅’)Β²` where `yα΅’` is the i-th true value, and `Ε·α΅’` is the model's predicted value for the i-th sample.
**The training goal is to find a set of coefficients that minimizes the RSS value.**
### 2.2 Solution Process (Matrix Form)
When the number of features is large, using matrix operations can represent and solve the problem more efficiently. The model can be written as: `Y = XΞ² + Ξ΅` where:
* `Y` is a column vector containing all target values.
* `X` is the design matrix, with the first column usually all 1s (corresponding to the intercept term `Ξ²β`), and each subsequent column corresponding to a feature.
* `Ξ²` is a column vector containing all regression coefficients.
* `Ξ΅` is the error term vector.
Through least squares derivation, the optimal solution for the coefficients `Ξ²` (closed-form solution) can be obtained: `Ξ² = (Xα΅X)β»ΒΉXα΅Y` This formula is theoretically perfect, but in actual programming, we usually use numerical optimization libraries (such as `scikit-learn`) to calculate efficiently and stably, which automatically handles complex operations such as matrix inversion.
* * *
## 3. Implementing Multiple Linear Regression with Python
Let's build a multiple linear regression model using the popular machine learning library `scikit-learn` through a complete example.
### 3.1 Environment Preparation and Data Loading
First, ensure that the necessary libraries are installed, and load a sample dataset.
## Example
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import fetch_california_housing # A classic multivariate dataset
# Load California housing dataset
california = fetch_california_housing()
df = pd.DataFrame(california.data, columns=california.feature_names)
df['MedHouseVal']= california.target# Add target column: median house value
print("Dataset shape:", df.shape)
print("\n First 5 rows:")
print(df.head())
print("\n Feature description:")
print(california.DESCR[:500])# Print partial description
### 3.2 Data Exploration and Preprocessing
Before modeling, understanding the basic situation of the data is crucial.
## Example
# 1. View basic data information
print(df.info())
print("\n Basic statistical description:")
print(df.describe())
# 2. Split features (X) and target (y)
X = df.drop('MedHouseVal', axis=1)# Feature matrix: contains all columns except house price
y = df['MedHouseVal']# Target vector: house price
# 3. Split training set and test set (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(f"\n Training set samples: {X_train.shape}, Test set samples: {X_test.shape}")
### 3.3 Creating, Training, and Evaluating the Model
Now, we create a linear regression model, "fit" it with training data, and evaluate its performance on the test set.
## Example
# 1. Create model instance
model = LinearRegression()
# 2. Train model (fit data)
model.fit(X_train, y_train)
# 3. Use trained model for prediction
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# 4. Evaluate model performance
# Mean Squared Error (MSE) - smaller is better
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
# Coefficient of Determination (RΒ²) - closer to 1 is better, indicating stronger model explanatory power
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
print("=== Model Performance Evaluation ===")
print(f"Training set MSE: {train_mse:.4f}, RΒ²: {train_r2:.4f}")
print(f"Test set MSE: {test_mse:.4f}, RΒ²: {test_r2:.4f}")
# 5. View learned model parameters
print("\n=== Model Parameters ===")
print(f"Intercept (Ξ²β): {model.intercept_:.4f}")
print("Regression coefficients (Ξ²β, Ξ²β, ...):")
for feature, coef in zip(X.columns, model.coef_):
print(f" {feature}: {coef:.4f}")
### 3.4 Interpreting Results and Visualization
Understand the meaning of coefficients and evaluation metrics.
## Example
# Visualization: True values vs Predicted values (test set)
plt.figure(figsize=(8,6))
plt.scatter(y_test, y_test_pred, alpha=0.5)
plt.plot([y.min(), y.max()],[y.min(), y.max()],'r--', lw=2)# Plot ideal diagonal line
plt.xlabel('True House Price')
plt.ylabel('Predicted House Price')
plt.title('Multiple Linear Regression: True vs Predicted Values (Test Set)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()
# Visualization: Feature importance (approximated by absolute coefficient values)
features = X.columns
coefs = model.coef_
plt.figure(figsize=(10,6))
bars = plt.barh(features, np.abs(coefs))# Use absolute values to compare impact sizes
plt.xlabel('Absolute Value of Regression Coefficient')
plt.title('Feature Impact on House Price (Based on Absolute Coefficient Values)')
# Add numerical labels to bars
for bar, coef in zip(bars, coefs):
width = bar.get_width()
plt.text(width + 0.01, bar.get_y() + bar.get_height()/2, f'{coef:.4f}', va='center')
plt.grid(True, axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
* * *
## 4. Considerations and Challenges of Multiple Linear Regression
### 4.1 Key Assumptions
The validity of the linear regression model is based on several statistical assumptions:
1. **Linearity**: There is a linear relationship between independent variables and the dependent variable.
2. **Independence**: Observations are independent of each other.
3. **Homoscedasticity**: The variance of the error term remains constant across all observations.
4. **Normality**: The error term follows a normal distribution.
5. **No Multicollinearity**: Independent variables should not be highly correlated with each other.
### 4.2 Common Challenges and Solutions
| Challenge | Description | Possible Consequences | Solutions |
| --- | --- | --- | --- |
| **Multicollinearity** | Features are highly correlated with each other. | Coefficient estimates become unstable, making it difficult to interpret the impact of individual features. | 1. Use correlation matrix to check and remove highly correlated features. 2. Use Principal Component Analysis (PCA) for dimensionality reduction. 3. Use regularization methods (such as Ridge regression). |
| **Overfitting** | The model is too complex, perfectly fitting the noise in the training data, performing poorly on the test set. | Test set error is much greater than training set error. | 1. Collect more data. 2. Use fewer features (feature selection). 3. Use regularization. |
| **Nonlinear Relationships** | The data relationship is inherently nonlinear. | Poor model prediction, low RΒ² value. | 1. Transform features (such as polynomial features, logarithmic transformation). 2. Use nonlinear models (such as decision trees, neural networks). |
* * *
## 5. Hands-on Practice: Consolidate Your Understanding
Now, it's your turn to practice! Please complete the following exercises in order to consolidate your understanding of multiple linear regression.
### Exercise 1: Model Interpretation
Using the model coefficients output by the code above, answer the following questions:
1. Which feature has the **largest positive impact** on California house prices? (Judge based on coefficient value)
2. Which feature has a **negative impact** on California house prices?
3. How to interpret the coefficient of `AveRooms` (average number of rooms)? Please describe in one complete sentence.
### Exercise 2: Diagnosing Multicollinearity
1. Calculate the correlation matrix of features `X` (`df.corr()`).
2. Find feature pairs with absolute correlation greater than 0.7. They may have multicollinearity.
3. (Optional) Try removing one of the highly correlated features from the model, retrain, and observe how the coefficients and RΒ² score change.
### Exercise 3: Try a New Dataset
1. Load another regression dataset from `sklearn.datasets`, such as `load_diabetes` (diabetes dataset).
2. Repeat the modeling steps in this article: data exploration, splitting, training, evaluation, visualization.
3. Compare the model performance on the two datasets and think about possible reasons.
YouTip