Ml Multiple Linear Regression

## Multiple Linear Regression In the previous article, we explored simple linear regression, which helped us understand how one feature (independent variable) affects a target (dependent variable). However, real-world problems are often more complex. For example, when predicting house prices, we cannot just look at the house area; we also need to consider the number of bedrooms, location, age of the house, and many other factors. At this point, we need **multiple linear regression**. Simply put, multiple linear regression is a natural extension of simple linear regression. It allows us to analyze the impact of **multiple independent variables** on a dependent variable simultaneously. This article will take you from zero to fully understanding the core concepts, mathematical principles, implementation methods, and practical applications of multiple linear regression. * * * ## 1. What is Multiple Linear Regression? ### 1.1 Core Concepts Multiple linear regression is a statistical method used to establish a linear relationship between **multiple independent variables** (also called features, explanatory variables) and **one continuous dependent variable** (also called target, response variable). **A vivid analogy:** Imagine you are a chef, adjusting the taste of a soup (target). Simple linear regression is like you only adjusting the saltiness by the amount of salt. Multiple linear regression is like you simultaneously controlling the amounts of salt, sugar, pepper, soy sauce, and other seasonings (features) to comprehensively determine the final taste of the soup. Multiple regression allows you to analyze the specific contribution of each seasoning to the taste. ### 1.2 Model Formula The mathematical expression of the multiple linear regression model is as follows: `y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε` Let's break down each part of this formula: | Symbol | Name | Meaning | | --- | --- | --- | | `y` | Dependent Variable | The target value we want to predict (e.g., house price). | | `x₁, x₂, ..., xₙ` | Independent Variables | Features used to predict `y` (e.g., area, number of bedrooms, age). | | `β₀` | Intercept | The predicted baseline value of `y` when all independent variables are 0. | | `β₁, β₂, ..., βₙ` | Regression Coefficients | The weight of each independent variable `xᵢ`. It means: **when other features remain unchanged, for every 1 unit increase in `xᵢ`, `y` changes by an average of `βᵢ` units**. This is the core of multiple regression analysis. | | `ε` | Error Term | Random fluctuations that the model cannot explain (e.g., measurement errors, unknown factors). | **Formula Interpretation Example:** Suppose our model for predicting house price (`y`) is: `Price = 50000 + 3000 * Area + 10000 * Bedrooms - 2000 * Age` * `β₀ = 50000`: Theoretically, the baseline price of a "house" with 0 area, 0 bedrooms, and 0 age. * `β₁ = 3000`: When the number of bedrooms and age are the same, for every 1 square meter increase in area, the house price increases by an average of 3000 yuan. * `β₂ = 10000`: When area and age are the same, for each additional bedroom, the house price increases by an average of 10000 yuan. * `β₃ = -2000`: When area and number of bedrooms are the same, for every additional year of age, the house price decreases by an average of 2000 yuan. * * * ## 2. How to "Train" a Multiple Linear Regression Model? "Training" a model essentially means finding a set of optimal regression coefficients `(β₀, β₁, ..., βₙ)` based on our existing data, so that the model's predicted values are as close as possible to the true values. This process is usually accomplished through the **least squares method**. ### 2.1 Goal: Minimize the Loss Function We use the **residual sum of squares** as the loss function to measure the model's prediction error. `RSS = Σ(yᵢ - ŷᵢ)²` where `yᵢ` is the i-th true value, and `ŷᵢ` is the model's predicted value for the i-th sample. **The training goal is to find a set of coefficients that minimizes the RSS value.** ### 2.2 Solution Process (Matrix Form) When the number of features is large, using matrix operations can represent and solve the problem more efficiently. The model can be written as: `Y = Xβ + ε` where: * `Y` is a column vector containing all target values. * `X` is the design matrix, with the first column usually all 1s (corresponding to the intercept term `β₀`), and each subsequent column corresponding to a feature. * `β` is a column vector containing all regression coefficients. * `ε` is the error term vector. Through least squares derivation, the optimal solution for the coefficients `β` (closed-form solution) can be obtained: `β = (XᵀX)⁻¹XᵀY` This formula is theoretically perfect, but in actual programming, we usually use numerical optimization libraries (such as `scikit-learn`) to calculate efficiently and stably, which automatically handles complex operations such as matrix inversion. * * * ## 3. Implementing Multiple Linear Regression with Python Let's build a multiple linear regression model using the popular machine learning library `scikit-learn` through a complete example. ### 3.1 Environment Preparation and Data Loading First, ensure that the necessary libraries are installed, and load a sample dataset. ## Example # Import necessary libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score from sklearn.datasets import fetch_california_housing # A classic multivariate dataset # Load California housing dataset california = fetch_california_housing() df = pd.DataFrame(california.data, columns=california.feature_names) df['MedHouseVal']= california.target# Add target column: median house value print("Dataset shape:", df.shape) print("\n First 5 rows:") print(df.head()) print("\n Feature description:") print(california.DESCR[:500])# Print partial description ### 3.2 Data Exploration and Preprocessing Before modeling, understanding the basic situation of the data is crucial. ## Example # 1. View basic data information print(df.info()) print("\n Basic statistical description:") print(df.describe()) # 2. Split features (X) and target (y) X = df.drop('MedHouseVal', axis=1)# Feature matrix: contains all columns except house price y = df['MedHouseVal']# Target vector: house price # 3. Split training set and test set (70% training, 30% testing) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) print(f"\n Training set samples: {X_train.shape}, Test set samples: {X_test.shape}") ### 3.3 Creating, Training, and Evaluating the Model Now, we create a linear regression model, "fit" it with training data, and evaluate its performance on the test set. ## Example # 1. Create model instance model = LinearRegression() # 2. Train model (fit data) model.fit(X_train, y_train) # 3. Use trained model for prediction y_train_pred = model.predict(X_train) y_test_pred = model.predict(X_test) # 4. Evaluate model performance # Mean Squared Error (MSE) - smaller is better train_mse = mean_squared_error(y_train, y_train_pred) test_mse = mean_squared_error(y_test, y_test_pred) # Coefficient of Determination (R²) - closer to 1 is better, indicating stronger model explanatory power train_r2 = r2_score(y_train, y_train_pred) test_r2 = r2_score(y_test, y_test_pred) print("=== Model Performance Evaluation ===") print(f"Training set MSE: {train_mse:.4f}, R²: {train_r2:.4f}") print(f"Test set MSE: {test_mse:.4f}, R²: {test_r2:.4f}") # 5. View learned model parameters print("\n=== Model Parameters ===") print(f"Intercept (β₀): {model.intercept_:.4f}") print("Regression coefficients (β₁, β₂, ...):") for feature, coef in zip(X.columns, model.coef_): print(f" {feature}: {coef:.4f}") ### 3.4 Interpreting Results and Visualization Understand the meaning of coefficients and evaluation metrics. ## Example # Visualization: True values vs Predicted values (test set) plt.figure(figsize=(8,6)) plt.scatter(y_test, y_test_pred, alpha=0.5) plt.plot([y.min(), y.max()],[y.min(), y.max()],'r--', lw=2)# Plot ideal diagonal line plt.xlabel('True House Price') plt.ylabel('Predicted House Price') plt.title('Multiple Linear Regression: True vs Predicted Values (Test Set)') plt.grid(True, linestyle='--', alpha=0.7) plt.show() # Visualization: Feature importance (approximated by absolute coefficient values) features = X.columns coefs = model.coef_ plt.figure(figsize=(10,6)) bars = plt.barh(features, np.abs(coefs))# Use absolute values to compare impact sizes plt.xlabel('Absolute Value of Regression Coefficient') plt.title('Feature Impact on House Price (Based on Absolute Coefficient Values)') # Add numerical labels to bars for bar, coef in zip(bars, coefs): width = bar.get_width() plt.text(width + 0.01, bar.get_y() + bar.get_height()/2, f'{coef:.4f}', va='center') plt.grid(True, axis='x', linestyle='--', alpha=0.7) plt.tight_layout() plt.show() * * * ## 4. Considerations and Challenges of Multiple Linear Regression ### 4.1 Key Assumptions The validity of the linear regression model is based on several statistical assumptions: 1. **Linearity**: There is a linear relationship between independent variables and the dependent variable. 2. **Independence**: Observations are independent of each other. 3. **Homoscedasticity**: The variance of the error term remains constant across all observations. 4. **Normality**: The error term follows a normal distribution. 5. **No Multicollinearity**: Independent variables should not be highly correlated with each other. ### 4.2 Common Challenges and Solutions | Challenge | Description | Possible Consequences | Solutions | | --- | --- | --- | --- | | **Multicollinearity** | Features are highly correlated with each other. | Coefficient estimates become unstable, making it difficult to interpret the impact of individual features. | 1. Use correlation matrix to check and remove highly correlated features. 2. Use Principal Component Analysis (PCA) for dimensionality reduction. 3. Use regularization methods (such as Ridge regression). | | **Overfitting** | The model is too complex, perfectly fitting the noise in the training data, performing poorly on the test set. | Test set error is much greater than training set error. | 1. Collect more data. 2. Use fewer features (feature selection). 3. Use regularization. | | **Nonlinear Relationships** | The data relationship is inherently nonlinear. | Poor model prediction, low R² value. | 1. Transform features (such as polynomial features, logarithmic transformation). 2. Use nonlinear models (such as decision trees, neural networks). | * * * ## 5. Hands-on Practice: Consolidate Your Understanding Now, it's your turn to practice! Please complete the following exercises in order to consolidate your understanding of multiple linear regression. ### Exercise 1: Model Interpretation Using the model coefficients output by the code above, answer the following questions: 1. Which feature has the **largest positive impact** on California house prices? (Judge based on coefficient value) 2. Which feature has a **negative impact** on California house prices? 3. How to interpret the coefficient of `AveRooms` (average number of rooms)? Please describe in one complete sentence. ### Exercise 2: Diagnosing Multicollinearity 1. Calculate the correlation matrix of features `X` (`df.corr()`). 2. Find feature pairs with absolute correlation greater than 0.7. They may have multicollinearity. 3. (Optional) Try removing one of the highly correlated features from the model, retrain, and observe how the coefficients and R² score change. ### Exercise 3: Try a New Dataset 1. Load another regression dataset from `sklearn.datasets`, such as `load_diabetes` (diabetes dataset). 2. Repeat the modeling steps in this article: data exploration, splitting, training, evaluation, visualization. 3. Compare the model performance on the two datasets and think about possible reasons.

YouTip

Ml Multiple Linear Regression

📂 Categories