Ml Forward And Backward Propagation
## Forward and Backward Propagation
In deep learning, forward propagation and backward propagation are the two core pillars that support its operation. They are like two sides of the same coin, together forming the complete closed loop of neural networks from learning to application. Thoroughly understanding these two processes is the first key to opening the door to deep learning.
This article will guide you step by step to deconstruct these seemingly complex concepts, using clear logic and vivid analogies, so that you not only know what they are, but also understand why they are.
* * *
## What are Forward and Backward Propagation?
Before diving into the details, let's establish a macro-level understanding.
Imagine you are teaching a child to recognize cats and dogs. You show him a picture (**input**), he makes a judgment based on the knowledge already in his brain (**network parameters**, i.e., weights and biases), and then tells you it's a cat (**output**). This process of looking at the picture -> brain processing -> giving an answer is **forward propagation**.
But the child's judgment may be wrong. You tell him: no, this is a dog. The difference between this correct answer and the child's answer is the **error**. The child needs to reflect on this error: which pieces of knowledge (parameters) in my brain caused this misjudgment? How should I adjust them so I can get it right next time? This process of adjusting knowledge from back to front based on the error is **backward propagation**.
In neural networks:
* **Forward propagation**: The process where data flows from the input layer, through hidden layers, and finally reaches the output layer, producing prediction results. This is an **inference** process.
* **Backward propagation**: Based on the error between the prediction results produced by forward propagation and the true values, starting from the output layer, calculating backwards layer by layer the "contribution" of each parameter (weight and bias) to the total error (i.e., gradient), and updating parameters accordingly. This is a **learning** process.
!(#)
Their relationship can be represented by a simple learning loop diagram:
!(#)
* * *
## Forward Propagation: The Inference Path of Neural Networks
Forward propagation is the forward channel through which neural networks make predictions. Let's understand it through a simplest three-layer neural network (input layer, one hidden layer, output layer).
### Core Concepts and Calculations
Suppose we want to predict house prices, with house area `x` as input. Our miniature network structure is as follows:
* **Input layer**: One neuron, receiving `x`.
* **Hidden layer**: One neuron, with weight `w1` and bias `b1`.
* **Output layer**: One neuron, with weight `w2` and bias `b2`, outputting predicted house price `y_pred`.
Forward propagation calculations are divided into two steps:
**1. Hidden layer calculation**: Input `x` combines with weight `w1` and bias `b1`, then passes through an activation function (e.g., Sigmoid, denoted as `Ο`), producing hidden layer output `a1`.
z1 = w1 * x + b1 a1 = Ο(z1) = 1 / (1 + exp(-z1))
* `z1` is the linear transformation result.
* `a1` is the output after non-linear activation, which gives the network the ability to learn complex patterns.
**2. Output layer calculation**: Hidden layer output `a1` serves as input, combines with output layer weight `w2` and bias `b2`, producing the final prediction `y_pred`. Here for simplicity, we assume the output layer doesn't use an activation function (i.e., linear output).
y_pred = w2 * a1 + b2
**Code Example: Manual Implementation of Forward Propagation**
## Example
import numpy as np
def sigmoid(x):
"""Sigmoid activation function"""
return 1 / (1 + np.exp(-x))
# Initialize network parameters (usually randomly initialized, here specified for demonstration)
w1, b1 =2.0, -1.0# Hidden layer parameters
w2, b2 =1.5,0.5# Output layer parameters
def forward_pass(x):
"""Execute one forward propagation"""
# Hidden layer calculation
z1 = w1 * x + b1
a1 = sigmoid(z1)# Apply activation function
# Output layer calculation
y_pred = w2 * a1 + b2 # Linear output
# Return intermediate results and final prediction for subsequent understanding
return{'z1': z1,'a1': a1,'y_pred': y_pred}
# Assume house area is 3 (unit: hundred square meters)
x_input =3.0
result = forward_pass(x_input)
print(f"Input x = {x_input}")
print(f"Hidden layer linear output z1 = w1*x + b1 = {result['z1']:.4f}")
print(f"Hidden layer activation output a1 = sigmoid(z1) = {result['a1']:.4f}")
print(f"Final predicted house price y_pred = w2*a1 + b2 = {result['y_pred']:.4f}")
**Output Example:**
Input x = 3.0
Hidden layer linear output z1 = w1*x + b1 = 5.0000
Hidden layer activation output a1 = sigmoid(z1) = 0.9933
Final predicted house price y_pred = w2*a1 + b2 = 1.9899
This `y_pred` is the network's predicted price for a house with area 3. But obviously, this predicted value (based on our arbitrarily set parameters) is likely far from the true house price. How do we measure this gap and improve it? This requires the **loss function** and the following **backward propagation**.
* * *
## Loss Function: The Measuring Stick for Goodness
Before backward propagation begins, we must first quantify the gap between the predicted value `y_pred` and the true value `y_true`. This is the role of the loss function.
**Common Loss Functions**:
* **Mean Squared Error**: Suitable for regression problems (e.g., predicting house prices, temperature). `Loss = (1/N) * Ξ£ (y_true - y_pred)^2`
* **Cross-Entropy Loss**: Suitable for classification problems (e.g., image classification, spam detection).
Taking mean squared error as an example, for a single sample:
Loss = (y_true - y_pred)^2
Our goal is to adjust `w1, b1, w2, b2` to make this `Loss` value as small as possible.
* * *
## Backward Propagation: The Learning Engine of Neural Networks
Backward propagation is the core of deep learning's learning algorithm. Its essence is the efficient application of the **chain rule** in neural networks. The goal is to calculate the **partial derivatives (gradients)** of the loss function `L` with respect to each parameter (`w1, b1, w2, b2`), i.e., `βL/βw1`, `βL/βb1`, etc. These gradients indicate in which direction and by how much each parameter should be adjusted to reduce the loss.
### Understanding Gradients: Direction and Step Size for Going Downhill
Imagine you are standing on a mountain blindfolded (**loss surface**), with the goal of finding the lowest point of the valley (**minimum loss**). Before each step, you need to feel with your feet the steepest downhill direction around you. This steepest downhill direction is the **gradient**. Backward propagation is what helps you precisely calculate the gradient at every point under your feet (corresponding to each set of parameters).
### Backward Propagation Calculation Steps (Chain Rule Derivation)
We continue with our previous miniature network, and assume the true house price `y_true = 2.5`, with loss function mean squared error `L = (y_true - y_pred)^2`.
Backward propagation starts from the output layer, calculating gradients **backwards** layer by layer:
**Calculate Gradients for Output Layer Parameters**
* Gradient of loss `L` with respect to predicted value `y_pred`: `βL/βy_pred = -2 * (y_true - y_pred)`
* Because `y_pred = w2 * a1 + b2`, so: `βL/βw2 = (βL/βy_pred) * (βy_pred/βw2) = (βL/βy_pred) * a1``βL/βb2 = (βL/βy_pred) * (βy_pred/βb2) = (βL/βy_pred) * 1`
**Calculate Gradients for Hidden Layer Parameters**
* First, we need the gradient of loss `L` with respect to hidden layer output `a1`. `a1` affects `L` through `y_pred`: `βL/βa1 = (βL/βy_pred) * (βy_pred/βa1) = (βL/βy_pred) * w2`
* Then, `a1 = Ο(z1)`, the derivative of Sigmoid function is `Ο'(z) = Ο(z)*(1-Ο(z))`.
* Finally, calculate gradients of `L` with respect to hidden layer parameters `w1`, `b1`: `βL/βw1 = (βL/βa1) * (βa1/βz1) * (βz1/βw1) = (βL/βa1) * Ο'(z1) * x``βL/βb1 = (βL/βa1) * (βa1/βz1) * (βz1/βb1) = (βL/βa1) * Ο'(z1) * 1`
**Code Example: Manual Implementation of Backward Propagation**
## Example
# Continue from forward propagation code and results
y_true =2.5
y_pred = result['y_pred']
a1 = result['a1']
z1 = result['z1']
x = x_input
print(f"True value y_true = {y_true}")
print(f"Predicted value y_pred = {y_pred:.4f}")
print(f"Initial loss Loss = {(y_true - y_pred)**2:.4f}")
print("n--- Starting backward propagation gradient calculation ---")
# 1. Calculate gradient of loss with respect to y_pred
dL_dy_pred = -2 * (y_true - y_pred)
print(f"Gradient βL/βy_pred = -2*(y_true - y_pred) = {dL_dy_pred:.4f}")
# 2. Calculate gradients for output layer parameters w2, b2
dL_dw2 = dL_dy_pred * a1
dL_db2 = dL_dy_pred * 1
print(f"Gradient βL/βw2 = (βL/βy_pred) * a1 = {dL_dw2:.4f}")
print(f"Gradient βL/βb2 = (βL/βy_pred) * 1 = {dL_db2:.4f}")
# 3. Calculate gradient of loss with respect to hidden layer output a1
dL_da1 = dL_dy_pred * w2
print(f"Gradient βL/βa1 = (βL/βy_pred) * w2 = {dL_da1:.4f}")
# 4. Calculate derivative of Sigmoid function at z1
def sigmoid_derivative(x):
"""Derivative of Sigmoid function"""
s = sigmoid(x)
return s * (1 - s)
sigma_prime_z1 = sigmoid_derivative(z1)
print(f"Sigmoid derivative Ο'(z1) = Ο(z1)*(1-Ο(z1)) = {sigma_prime_z1:.4f}")
# 5. Calculate gradients for hidden layer parameters w1, b1
dL_dw1 = dL_da1 * sigma_prime_z1 * x
dL_db1 = dL_da1 * sigma_prime_z1 * 1
print(f"Gradient βL/βw1 = (βL/βa1) * Ο'(z1) * x = {dL_dw1:.4f}")
print(f"Gradient βL/βb1 = (βL/βa1) * Ο'(z1) * 1 = {dL_db1:.4f}")
**Output Example:**
True value y_true = 2.5
Predicted value y_pred = 1.9899
Initial loss Loss = 0.2602
--- Starting backward propagation gradient calculation ---
Gradient βL/βy_pred = -2*(y_true - y_pred) = -1.0202
Gradient βL/βw2 = (βL/βy_pred) * a1 = -1.0134
Gradient βL/βb2 = (βL/βy_pred) * 1 = -1.0202
Gradient βL/βa1 = (βL/βy_pred) * w2 = -1.5303
Sigmoid derivative Ο'(z1) = Ο(z1)*(1-Ο(z1)) = 0.0066
Gradient βL/βw1 = (βL/βa1) * Ο'(z1) * x = -0.0304
Gradient βL/βb1 = (βL/βa1) * Ο'(z1) * 1 = -0.0101
Now, we have obtained gradients for all parameters. These negative values mean that if we **increase** these parameter values, the loss will **increase** (because the gradient direction is the ascending direction). To reduce the loss, we should **adjust parameters in the opposite direction of the gradient**.
* * *
## Parameter Update: Gradient Descent
After obtaining the gradients, we use the **gradient descent** algorithm to update parameters:
parameter = parameter - learning_rate * gradient of this parameter
Where **learning rate** is a very important hyperparameter that controls the step size of each parameter update. If the step size is too small, learning is slow; if the step size is too large, it may fail to converge or even diverge.
**Code Example: Applying Gradient Descent to Update Parameters**
## Example
learning_rate =0.1
# Update parameters
w1_new = w1 - learning_rate * dL_dw1
b1_new = b1 - learning_rate * dL_db1
w2_new = w2 - learning_rate * dL_dw2
b2_new = b2 - learning_rate * dL_db2
print("--- Updated parameters ---")
print(f"w1: {w1:.4f} -> {w1_new:.4f}")
print(f"b1: {b1:.4f} -> {b1_new:.4f}")
print(f"w2: {w2:.4f} -> {w2_new:.4f}")
print(f"b2: {b2:.4f} -> {b2_new:.4f}")
# Do one forward propagation with new parameters to verify if loss decreases
def forward_pass_with_params(x, w1, b1, w2, b2):
z1 = w1 * x + b1
a1 = sigmoid(z1)
y_pred = w2 * a1 + b2
return y_pred
y_pred_new = forward_pass_with_params(x_input, w1_new, b1_new, w2_new, b2_new)
loss_new =(y_true - y_pred_new)**2
print(f"nPrediction with new parameters: y_pred_new = {y_pred_new:.4f}")
print(f"Updated loss New Loss = {loss_new:.4f}")
print(f"Loss change: {loss_new - (y_true-y_pred)**2:.4f} (negative value indicates loss decrease)")
**Output Example:**
--- Updated parameters ---
w1: 2.0000 -> 2.0030
b1: -1.0000 -> -0.9990
w2: 1.5000 -> 1.6013
b2: 0.5000 -> 0.6020
Prediction with new parameters: y_pred_new = 2.1933
Updated loss New Loss = 0.0940
Loss change: -0.1662 (negative value indicates loss decrease)
Great! After one complete cycle of **forward propagation -> loss calculation -> backward propagation -> gradient descent update**, our predicted value `y_pred` moved from `1.99` closer to the true value `2.5`, and the loss decreased from `0.260` to `0.094`. Repeating this cycle tens of thousands of times (on large amounts of data), the neural network can learn effective parameters and make accurate predictions.
* * *
## Practice Exercises: Consolidate Your Understanding
Now, it's time to consolidate your knowledge through hands-on practice.
**Exercise 1: Expand the Network** Modify the above code to increase the hidden layer neurons to 2. You need to initialize `w1` as an array with shape `(2,)` (two weights), `b1` as an array with shape `(2,)`. Adjust the forward propagation and backward propagation calculations accordingly. Observe how the network's capability changes.
**Exercise 2: Change Activation Function** Replace the Sigmoid activation function with the ReLU function (`f(x) = max(0, x)`). You need to re-derive and implement the derivative of ReLU (`f'(x) = 1 if x>0 else 0`). Compare how the training process differs when using different activation functions.
**Exercise 3: Implement a Training Loop** Write a complete training loop to train a miniature network on a simple dataset (for example, construct data for `y = 2x + 1 + noise` yourself) to fit it. Set the number of iterations (epoch), print the loss after each iteration, and observe whether the loss continues to decrease as training progresses.
**Exercise 4: Understand the Effect of Learning Rate** Based on Exercise 3, try different `learning_rate` values (such as 0.01, 0.1, 0.5, 1.0). Observe how the loss curve changes when the learning rate is too large or too small (whether it oscillates, diverges, or converges slowly), and deeply understand the importance of learning rate as step size.
* * *
## Summary
Forward propagation and backward propagation are the core dynamics of neural network learning:
1. **Forward propagation** is the **inference path**, which uses current parameters to map inputs to outputs, and calculates the score of current performance (loss).
2. **Backward propagation** is the **learning algorithm**, which uses the chain rule to efficiently calculate the gradients of the loss function with respect to every parameter in the network, indicating the direction for parameter optimization.
3. **Gradient descent** is the **optimization strategy**, which actually updates parameters based on the gradients provided by backward propagation, with learning rate as step size, gradually improving the network's performance.
YouTip