YouTip LogoYouTip

Ml Forward And Backward Propagation

## Forward and Backward Propagation In deep learning, forward propagation and backward propagation are the two core pillars that support its operation. They are like two sides of the same coin, together forming the complete closed loop of neural networks from learning to application. Thoroughly understanding these two processes is the first key to opening the door to deep learning. This article will guide you step by step to deconstruct these seemingly complex concepts, using clear logic and vivid analogies, so that you not only know what they are, but also understand why they are. * * * ## What are Forward and Backward Propagation? Before diving into the details, let's establish a macro-level understanding. Imagine you are teaching a child to recognize cats and dogs. You show him a picture (**input**), he makes a judgment based on the knowledge already in his brain (**network parameters**, i.e., weights and biases), and then tells you it's a cat (**output**). This process of looking at the picture -> brain processing -> giving an answer is **forward propagation**. But the child's judgment may be wrong. You tell him: no, this is a dog. The difference between this correct answer and the child's answer is the **error**. The child needs to reflect on this error: which pieces of knowledge (parameters) in my brain caused this misjudgment? How should I adjust them so I can get it right next time? This process of adjusting knowledge from back to front based on the error is **backward propagation**. In neural networks: * **Forward propagation**: The process where data flows from the input layer, through hidden layers, and finally reaches the output layer, producing prediction results. This is an **inference** process. * **Backward propagation**: Based on the error between the prediction results produced by forward propagation and the true values, starting from the output layer, calculating backwards layer by layer the "contribution" of each parameter (weight and bias) to the total error (i.e., gradient), and updating parameters accordingly. This is a **learning** process. !(#) Their relationship can be represented by a simple learning loop diagram: !(#) * * * ## Forward Propagation: The Inference Path of Neural Networks Forward propagation is the forward channel through which neural networks make predictions. Let's understand it through a simplest three-layer neural network (input layer, one hidden layer, output layer). ### Core Concepts and Calculations Suppose we want to predict house prices, with house area `x` as input. Our miniature network structure is as follows: * **Input layer**: One neuron, receiving `x`. * **Hidden layer**: One neuron, with weight `w1` and bias `b1`. * **Output layer**: One neuron, with weight `w2` and bias `b2`, outputting predicted house price `y_pred`. Forward propagation calculations are divided into two steps: **1. Hidden layer calculation**: Input `x` combines with weight `w1` and bias `b1`, then passes through an activation function (e.g., Sigmoid, denoted as `Οƒ`), producing hidden layer output `a1`. z1 = w1 * x + b1 a1 = Οƒ(z1) = 1 / (1 + exp(-z1)) * `z1` is the linear transformation result. * `a1` is the output after non-linear activation, which gives the network the ability to learn complex patterns. **2. Output layer calculation**: Hidden layer output `a1` serves as input, combines with output layer weight `w2` and bias `b2`, producing the final prediction `y_pred`. Here for simplicity, we assume the output layer doesn't use an activation function (i.e., linear output). y_pred = w2 * a1 + b2 **Code Example: Manual Implementation of Forward Propagation** ## Example import numpy as np def sigmoid(x): """Sigmoid activation function""" return 1 / (1 + np.exp(-x)) # Initialize network parameters (usually randomly initialized, here specified for demonstration) w1, b1 =2.0, -1.0# Hidden layer parameters w2, b2 =1.5,0.5# Output layer parameters def forward_pass(x): """Execute one forward propagation""" # Hidden layer calculation z1 = w1 * x + b1 a1 = sigmoid(z1)# Apply activation function # Output layer calculation y_pred = w2 * a1 + b2 # Linear output # Return intermediate results and final prediction for subsequent understanding return{'z1': z1,'a1': a1,'y_pred': y_pred} # Assume house area is 3 (unit: hundred square meters) x_input =3.0 result = forward_pass(x_input) print(f"Input x = {x_input}") print(f"Hidden layer linear output z1 = w1*x + b1 = {result['z1']:.4f}") print(f"Hidden layer activation output a1 = sigmoid(z1) = {result['a1']:.4f}") print(f"Final predicted house price y_pred = w2*a1 + b2 = {result['y_pred']:.4f}") **Output Example:** Input x = 3.0 Hidden layer linear output z1 = w1*x + b1 = 5.0000 Hidden layer activation output a1 = sigmoid(z1) = 0.9933 Final predicted house price y_pred = w2*a1 + b2 = 1.9899 This `y_pred` is the network's predicted price for a house with area 3. But obviously, this predicted value (based on our arbitrarily set parameters) is likely far from the true house price. How do we measure this gap and improve it? This requires the **loss function** and the following **backward propagation**. * * * ## Loss Function: The Measuring Stick for Goodness Before backward propagation begins, we must first quantify the gap between the predicted value `y_pred` and the true value `y_true`. This is the role of the loss function. **Common Loss Functions**: * **Mean Squared Error**: Suitable for regression problems (e.g., predicting house prices, temperature). `Loss = (1/N) * Ξ£ (y_true - y_pred)^2` * **Cross-Entropy Loss**: Suitable for classification problems (e.g., image classification, spam detection). Taking mean squared error as an example, for a single sample: Loss = (y_true - y_pred)^2 Our goal is to adjust `w1, b1, w2, b2` to make this `Loss` value as small as possible. * * * ## Backward Propagation: The Learning Engine of Neural Networks Backward propagation is the core of deep learning's learning algorithm. Its essence is the efficient application of the **chain rule** in neural networks. The goal is to calculate the **partial derivatives (gradients)** of the loss function `L` with respect to each parameter (`w1, b1, w2, b2`), i.e., `βˆ‚L/βˆ‚w1`, `βˆ‚L/βˆ‚b1`, etc. These gradients indicate in which direction and by how much each parameter should be adjusted to reduce the loss. ### Understanding Gradients: Direction and Step Size for Going Downhill Imagine you are standing on a mountain blindfolded (**loss surface**), with the goal of finding the lowest point of the valley (**minimum loss**). Before each step, you need to feel with your feet the steepest downhill direction around you. This steepest downhill direction is the **gradient**. Backward propagation is what helps you precisely calculate the gradient at every point under your feet (corresponding to each set of parameters). ### Backward Propagation Calculation Steps (Chain Rule Derivation) We continue with our previous miniature network, and assume the true house price `y_true = 2.5`, with loss function mean squared error `L = (y_true - y_pred)^2`. Backward propagation starts from the output layer, calculating gradients **backwards** layer by layer: **Calculate Gradients for Output Layer Parameters** * Gradient of loss `L` with respect to predicted value `y_pred`: `βˆ‚L/βˆ‚y_pred = -2 * (y_true - y_pred)` * Because `y_pred = w2 * a1 + b2`, so: `βˆ‚L/βˆ‚w2 = (βˆ‚L/βˆ‚y_pred) * (βˆ‚y_pred/βˆ‚w2) = (βˆ‚L/βˆ‚y_pred) * a1``βˆ‚L/βˆ‚b2 = (βˆ‚L/βˆ‚y_pred) * (βˆ‚y_pred/βˆ‚b2) = (βˆ‚L/βˆ‚y_pred) * 1` **Calculate Gradients for Hidden Layer Parameters** * First, we need the gradient of loss `L` with respect to hidden layer output `a1`. `a1` affects `L` through `y_pred`: `βˆ‚L/βˆ‚a1 = (βˆ‚L/βˆ‚y_pred) * (βˆ‚y_pred/βˆ‚a1) = (βˆ‚L/βˆ‚y_pred) * w2` * Then, `a1 = Οƒ(z1)`, the derivative of Sigmoid function is `Οƒ'(z) = Οƒ(z)*(1-Οƒ(z))`. * Finally, calculate gradients of `L` with respect to hidden layer parameters `w1`, `b1`: `βˆ‚L/βˆ‚w1 = (βˆ‚L/βˆ‚a1) * (βˆ‚a1/βˆ‚z1) * (βˆ‚z1/βˆ‚w1) = (βˆ‚L/βˆ‚a1) * Οƒ'(z1) * x``βˆ‚L/βˆ‚b1 = (βˆ‚L/βˆ‚a1) * (βˆ‚a1/βˆ‚z1) * (βˆ‚z1/βˆ‚b1) = (βˆ‚L/βˆ‚a1) * Οƒ'(z1) * 1` **Code Example: Manual Implementation of Backward Propagation** ## Example # Continue from forward propagation code and results y_true =2.5 y_pred = result['y_pred'] a1 = result['a1'] z1 = result['z1'] x = x_input print(f"True value y_true = {y_true}") print(f"Predicted value y_pred = {y_pred:.4f}") print(f"Initial loss Loss = {(y_true - y_pred)**2:.4f}") print("n--- Starting backward propagation gradient calculation ---") # 1. Calculate gradient of loss with respect to y_pred dL_dy_pred = -2 * (y_true - y_pred) print(f"Gradient βˆ‚L/βˆ‚y_pred = -2*(y_true - y_pred) = {dL_dy_pred:.4f}") # 2. Calculate gradients for output layer parameters w2, b2 dL_dw2 = dL_dy_pred * a1 dL_db2 = dL_dy_pred * 1 print(f"Gradient βˆ‚L/βˆ‚w2 = (βˆ‚L/βˆ‚y_pred) * a1 = {dL_dw2:.4f}") print(f"Gradient βˆ‚L/βˆ‚b2 = (βˆ‚L/βˆ‚y_pred) * 1 = {dL_db2:.4f}") # 3. Calculate gradient of loss with respect to hidden layer output a1 dL_da1 = dL_dy_pred * w2 print(f"Gradient βˆ‚L/βˆ‚a1 = (βˆ‚L/βˆ‚y_pred) * w2 = {dL_da1:.4f}") # 4. Calculate derivative of Sigmoid function at z1 def sigmoid_derivative(x): """Derivative of Sigmoid function""" s = sigmoid(x) return s * (1 - s) sigma_prime_z1 = sigmoid_derivative(z1) print(f"Sigmoid derivative Οƒ'(z1) = Οƒ(z1)*(1-Οƒ(z1)) = {sigma_prime_z1:.4f}") # 5. Calculate gradients for hidden layer parameters w1, b1 dL_dw1 = dL_da1 * sigma_prime_z1 * x dL_db1 = dL_da1 * sigma_prime_z1 * 1 print(f"Gradient βˆ‚L/βˆ‚w1 = (βˆ‚L/βˆ‚a1) * Οƒ'(z1) * x = {dL_dw1:.4f}") print(f"Gradient βˆ‚L/βˆ‚b1 = (βˆ‚L/βˆ‚a1) * Οƒ'(z1) * 1 = {dL_db1:.4f}") **Output Example:** True value y_true = 2.5 Predicted value y_pred = 1.9899 Initial loss Loss = 0.2602 --- Starting backward propagation gradient calculation --- Gradient βˆ‚L/βˆ‚y_pred = -2*(y_true - y_pred) = -1.0202 Gradient βˆ‚L/βˆ‚w2 = (βˆ‚L/βˆ‚y_pred) * a1 = -1.0134 Gradient βˆ‚L/βˆ‚b2 = (βˆ‚L/βˆ‚y_pred) * 1 = -1.0202 Gradient βˆ‚L/βˆ‚a1 = (βˆ‚L/βˆ‚y_pred) * w2 = -1.5303 Sigmoid derivative Οƒ'(z1) = Οƒ(z1)*(1-Οƒ(z1)) = 0.0066 Gradient βˆ‚L/βˆ‚w1 = (βˆ‚L/βˆ‚a1) * Οƒ'(z1) * x = -0.0304 Gradient βˆ‚L/βˆ‚b1 = (βˆ‚L/βˆ‚a1) * Οƒ'(z1) * 1 = -0.0101 Now, we have obtained gradients for all parameters. These negative values mean that if we **increase** these parameter values, the loss will **increase** (because the gradient direction is the ascending direction). To reduce the loss, we should **adjust parameters in the opposite direction of the gradient**. * * * ## Parameter Update: Gradient Descent After obtaining the gradients, we use the **gradient descent** algorithm to update parameters: parameter = parameter - learning_rate * gradient of this parameter Where **learning rate** is a very important hyperparameter that controls the step size of each parameter update. If the step size is too small, learning is slow; if the step size is too large, it may fail to converge or even diverge. **Code Example: Applying Gradient Descent to Update Parameters** ## Example learning_rate =0.1 # Update parameters w1_new = w1 - learning_rate * dL_dw1 b1_new = b1 - learning_rate * dL_db1 w2_new = w2 - learning_rate * dL_dw2 b2_new = b2 - learning_rate * dL_db2 print("--- Updated parameters ---") print(f"w1: {w1:.4f} -> {w1_new:.4f}") print(f"b1: {b1:.4f} -> {b1_new:.4f}") print(f"w2: {w2:.4f} -> {w2_new:.4f}") print(f"b2: {b2:.4f} -> {b2_new:.4f}") # Do one forward propagation with new parameters to verify if loss decreases def forward_pass_with_params(x, w1, b1, w2, b2): z1 = w1 * x + b1 a1 = sigmoid(z1) y_pred = w2 * a1 + b2 return y_pred y_pred_new = forward_pass_with_params(x_input, w1_new, b1_new, w2_new, b2_new) loss_new =(y_true - y_pred_new)**2 print(f"nPrediction with new parameters: y_pred_new = {y_pred_new:.4f}") print(f"Updated loss New Loss = {loss_new:.4f}") print(f"Loss change: {loss_new - (y_true-y_pred)**2:.4f} (negative value indicates loss decrease)") **Output Example:** --- Updated parameters --- w1: 2.0000 -> 2.0030 b1: -1.0000 -> -0.9990 w2: 1.5000 -> 1.6013 b2: 0.5000 -> 0.6020 Prediction with new parameters: y_pred_new = 2.1933 Updated loss New Loss = 0.0940 Loss change: -0.1662 (negative value indicates loss decrease) Great! After one complete cycle of **forward propagation -> loss calculation -> backward propagation -> gradient descent update**, our predicted value `y_pred` moved from `1.99` closer to the true value `2.5`, and the loss decreased from `0.260` to `0.094`. Repeating this cycle tens of thousands of times (on large amounts of data), the neural network can learn effective parameters and make accurate predictions. * * * ## Practice Exercises: Consolidate Your Understanding Now, it's time to consolidate your knowledge through hands-on practice. **Exercise 1: Expand the Network** Modify the above code to increase the hidden layer neurons to 2. You need to initialize `w1` as an array with shape `(2,)` (two weights), `b1` as an array with shape `(2,)`. Adjust the forward propagation and backward propagation calculations accordingly. Observe how the network's capability changes. **Exercise 2: Change Activation Function** Replace the Sigmoid activation function with the ReLU function (`f(x) = max(0, x)`). You need to re-derive and implement the derivative of ReLU (`f'(x) = 1 if x>0 else 0`). Compare how the training process differs when using different activation functions. **Exercise 3: Implement a Training Loop** Write a complete training loop to train a miniature network on a simple dataset (for example, construct data for `y = 2x + 1 + noise` yourself) to fit it. Set the number of iterations (epoch), print the loss after each iteration, and observe whether the loss continues to decrease as training progresses. **Exercise 4: Understand the Effect of Learning Rate** Based on Exercise 3, try different `learning_rate` values (such as 0.01, 0.1, 0.5, 1.0). Observe how the loss curve changes when the learning rate is too large or too small (whether it oscillates, diverges, or converges slowly), and deeply understand the importance of learning rate as step size. * * * ## Summary Forward propagation and backward propagation are the core dynamics of neural network learning: 1. **Forward propagation** is the **inference path**, which uses current parameters to map inputs to outputs, and calculates the score of current performance (loss). 2. **Backward propagation** is the **learning algorithm**, which uses the chain rule to efficiently calculate the gradients of the loss function with respect to every parameter in the network, indicating the direction for parameter optimization. 3. **Gradient descent** is the **optimization strategy**, which actually updates parameters based on the gradients provided by backward propagation, with learning rate as step size, gradually improving the network's performance.
← Ml Common Network TypesMl Deep Reinforcement Learning β†’