Ai Deep Learning

## Deep Learning Basics Deep learning is not some mysterious black magic—it is essentially a mathematical function approximator: give it a large amount of input-output data, and it automatically finds the complex relationships between them. For example: you give it one million cat images, each labeled "cat". Deep learning automatically figures out what combinations of pixels look like a cat. Another example: you give it one billion sentences of human language, with the next word of each sentence as the target. Deep learning automatically learns what the most likely next word is, given the previous context. The core advantage of deep learning is: as the amount of data increases, its performance continues to improve. This is something traditional machine learning cannot do. In this chapter, we will start from scratch and break down every component of deep learning step by step: * Mathematical foundations of neural networks * The forward propagation computation process * The role of activation functions * Design of loss functions * The principle of backpropagation * Choice of optimizers * Regularization techniques * Normalization methods * Learning rate scheduling Finally, we will implement a complete Multilayer Perceptron (MLP) using PyTorch, tying all the knowledge points together. > This chapter will involve some mathematics, but don't worry—we will explain every formula intuitively, rather than having you memorize it by rote. The focus is on understanding why it is designed this way, not how to calculate it. * * * ## Mathematical Foundations of Neural Networks Deep learning is built upon three branches of mathematics: linear algebra, calculus, and probability theory. You don't need to become a math expert, but you need to understand a few core concepts. ### Linear Algebra: Matrix Multiplication The most basic operation in neural networks is matrix multiplication. Why use matrices? Because they can concisely represent the process of "multiple inputs undergoing transformations by multiple neurons". First, let's look at an intuitive example: ## Example ```python # ============================================ # Intuitive Understanding of Matrix Multiplication # ============================================ import numpy as np # Assume the input is a 3-dimensional vector: [height, weight, age] # Units are: centimeters, kilograms, years x = np.array([175, 70, 25]) # Input vector, shape (3,) print(f"Input x: {x}") print(f"Input shape: {x.shape}") # Weight matrix W: 2 neurons, each receiving 3 inputs # Shape is (output dimension, input dimension) = (2, 3) W = np.array([ [0.1, 0.2, 0.3], # Weights for neuron 1 [0.4, 0.5, 0.6], # Weights for neuron 2 ]) print(f"nWeight matrix W:n{W}") print(f"Weight shape: {W.shape}") # Bias b: one bias per neuron b = np.array([0.1, 0.2]) # Shape (2,) print(f"nBias b: {b}") print(f"Bias shape: {b.shape}") # Matrix multiplication: y = W · x + b # Note: numpy's @ operator represents matrix multiplication y = W @ x + b print(f"nOutput y: {y}") print(f"Output shape: {y.shape}") # Manual calculation verification y1 = 0.1 * 175 + 0.2 * 70 + 0.3 * 25 + 0.1 # Neuron 1 y2 = 0.4 * 175 + 0.5 * 70 + 0.6 * 25 + 0.2 # Neuron 2 print(f"nManual calculation verification: [{y1}, {y2}]") The geometric meaning of matrix multiplication is "linear transformation"—it can rotate, scale, and stretch vector space. But linear transformations alone are not enough. If each layer were just matrix multiplication, then no matter how deep the network is, the entire network would be equivalent to a single-layer network. Because the composition of multiple linear transformations is still a linear transformation. This is why we need activation functions—to introduce nonlinearity. ### Calculus: Derivatives and the Chain Rule The core of training neural networks is "gradient descent"—finding the direction of parameters that minimizes the loss. The gradient is the "derivative of a multivariate function"—it tells us how much the loss will change when each parameter changes a little. First, let's look at a simple example: ## Example ```python # ============================================ # Intuitive Understanding of Derivatives # ============================================ def f(x): """A simple function: f(x) = x²""" return x ** 2 def numerical_derivative(f, x, h=1e-6): """Numerical derivative: approximate the derivative using a tiny increment h Derivative definition: f'(x) = lim(h→0) [f(x+h) - f(x)] / h """ return (f(x + h) - f(x)) / h # Calculate derivative at x=3 x = 3.0 df_dx = numerical_derivative(f, x) print(f"f({x}) = {f(x)}") print(f"f'({x}) ≈ {df_dx}") print(f"Analytical solution (exact value): 2 * {x} = {2 * x}") # Understanding the meaning of the derivative: # A derivative of 6 means: at x=3, if x increases by 1, f(x) increases by approximately 6 x_new = x + 0.01 f_new = f(x_new) print(f"nx increases from {x} to {x_new}") print(f"f(x) changes from {f(x)} to {f_new}") print(f"Actual increase: {f_new - f(x)}") print(f"Derivative prediction: {df_dx * 0.01}") For multivariate functions, we need to calculate the partial derivative with respect to each variable, and then combine them into a vector—this is the gradient. The chain rule is the mathematical foundation of backpropagation. It allows us to "break down layer by layer" the derivatives of composite functions. ## Example ```python # ============================================ # Intuitive Understanding of the Chain Rule # ============================================ # Consider the composite function: y = f(g(x)) # Where: g(x) = x², f(z) = z³ # Then: y = (x²)³ = x⁶ def g(x): return x ** 2 def f(z): return z ** 3 def y(x): return f(g(x)) # Manual derivative calculation (chain rule): # dy/dx = df/dz * dz/dx = 3*z² * 2*x = 3*(x²)² * 2*x = 6*x⁵ x = 2.0 z = g(x) # z = 4 dy_dx_chain = 3 * (z ** 2) * 2 * x # Chain rule calculation print(f"Chain rule calculation: dy/dx = {dy_dx_chain}") # Numerical derivative verification def numerical_derivative_y(x, h=1e-6): return (y(x + h) - y(x)) / h dy_dx_numerical = numerical_derivative_y(x) print(f"Numerical derivative verification: dy/dx ≈ {dy_dx_numerical}") # Analytical solution: 6*x^5 = 6*32 = 192 print(f"Analytical solution: 6 * {x}^5 = {6 * (x ** 5)}") The core idea of the chain rule is: break down complex functions into simple functions, differentiate them separately, and then multiply the results. Backpropagation is the application of the chain rule in neural networks—starting from the output, pass the gradients back layer by layer. ### Probability Theory: Conditional Probability In classification problems, neural networks often output a "probability distribution"—given the input, what is the probability of each class. Conditional probability P(Y|X) represents "the probability of Y occurring given that X has occurred". For example: P(rain | dark clouds) represents "the probability of rain when dark clouds are seen". In deep learning, we use neural networks to estimate this conditional probability: ## Example ```python # ============================================ # Using Neural Networks to Output Probability Distributions # ============================================ import numpy as np def softmax(x): """Softmax function: convert arbitrary real numbers into a probability distribution The sum of outputs is 1, each value is in [0, 1] """ # Subtract the maximum to prevent numerical overflow exp_x = np.exp(x - np.max(x)) return exp_x / np.sum(exp_x) # Assume the last layer of the neural network outputs 3 "scores" (logits) # Corresponding to "cat", "dog", "bird" respectively logits = np.array([2.0, 1.0, 0.5]) print(f"Neural network output (logits): {logits}") # Convert to probabilities using Softmax probabilities = softmax(logits) print(f"Probability distribution: {probabilities}") print(f"Sum of probabilities: {np.sum(probabilities)}") # Interpretation: # P(cat|input image) ≈ 0.67 # P(dog|input image) ≈ 0.24 # P(bird|input image) ≈ 0.09 print("nClass probabilities:") print(f" Cat: {probabilities:.2%}") print(f" Dog: {probabilities:.2%}") print(f" Bird: {probabilities:.2%}") Key points of the three mathematical foundations: | Branch of Mathematics | Core Concepts | Use in Deep Learning | | --- | --- | --- | | Linear Algebra | Matrix multiplication, vectors, tensors | Representing network structure, efficient forward propagation computation | | Calculus | Derivatives, chain rule, gradients | Backpropagation, parameter updates | | Probability Theory | Conditional probability, probability distributions | Modeling uncertainty, designing loss functions | * * * ## Forward Propagation Forward propagation is the process of input data "flowing" through the neural network, from the input layer to the hidden layers and then to the output layer. ### Computation Graph Representing a neural network as a computation graph makes the logic of forward propagation and backpropagation clearer. A computation graph consists of nodes (operations) and edges (data flow). Computation graph of a simple two-layer neural network: ## Example ```python # ============================================ # Understanding Forward Propagation with Computation Graphs # ============================================ import numpy as np def relu(x): """ReLU activation function: max(0, x)""" return np.maximum(0, x) # A simple two-layer neural network # Input layer (2) → Hidden layer (3) → Output layer (2) # Input: x = [x1, x2] x = np.array([1.0, 2.0]) print(f"Input x: {x}") # Layer 1: Input → Hidden layer W1 = np.array([ [0.1, 0.2], # Hidden neuron 1 [0.3, 0.4], # Hidden neuron 2 [0.5, 0.6], # Hidden neuron 3 ]) b1 = np.array([0.1, 0.2, 0.3]) # Layer 2: Hidden layer → Output layer W2 = np.array([ [0.1, 0.2, 0.3], # Output neuron 1 [0.4, 0.5, 0.6], # Output neuron 2 ]) b2 = np.array([0.1, 0.2]) # Forward propagation computation graph # Step 1: Hidden layer linear transformation z1 = W1 @ x + b1 print(f"nHidden layer linear transformation z1: {z1}") # Step 2: Hidden layer activation function a1 = relu(z1) print(f"Hidden layer after activation a1: {a1}") # Step 3: Output layer linear transformation z2 = W2 @ a1 + b2 print(f"Output layer linear transformation z2: {z2}") # Step 4: Output layer activation (for classification problems, Softmax is usually used) a2 = softmax(z2) # Using the softmax function defined earlier print(f"Output layer after activation a2: {a2}") The benefit of computation graphs is: every step is clear, and during backpropagation, we only need to calculate gradients in the reverse order. ### Neural Networks from the Matrix Multiplication Perspective From the perspective of matrix multiplication, a neural network is a series of "linear transformations + nonlinear activations". Each layer can be abstracted as: aᵢ = activation(Wᵢ · aᵢ₋₁ + bᵢ) where a₀ is the input x. This formula is concise yet powerful—it can represent almost all neural network structures, from simple logistic regression to complex Transformers. > Note the dimension matching of matrix multiplication: if the input vector is n-dimensional and the next layer has m neurons, then the shape of the weight matrix W must be m×n. This way, after W·x, we get an m-dimensional vector. ### The Role of Activation Functions As mentioned earlier: without activation functions, a deep network is equivalent to a single-layer network. Let's prove this with a simple example: ## Example ```python # ============================================ # Why Activation Functions Are Needed # ============================================ import numpy as np # Assume a two-layer network, but neither has activation functions # Input x = np.array([1.0, 2.0]) # Layer 1 weights and biases W1 = np.array([[0.1, 0.2], [0.3, 0.4]]) b1 = np.array([0.1, 0.2]) # Layer 2 weights and biases W2 = np.array([[0.5, 0.6], [0.7, 0.8]]) b2 = np.array([0.3, 0.4]) # Two layers calculated separately z1 = W1 @ x + b1 z2 = W2 @ z1 + b2 print(f"Two layers calculated separately: {z2}") # Merge into one layer calculation # Mathematically it can be proven: z2 = (W2·W1)·x + (W2·b1 + b2) W_combined = W2 @ W1 b_combined = W2 @ b1 + b2 z_combined = W_combined @ x + b_combined print(f"Merged into one layer calculation: {z_combined}") # Results are exactly the same! print(f"nTwo results are equal: {np.allclose(z2, z_combined)}") print("Conclusion: Without activation functions, deep network = single-layer network") Activation functions are the key component that makes deep networks meaningful—they introduce nonlinearity, allowing the network to learn complex patterns. * * * ## Detailed Explanation of Activation Functions Activation functions determine how a neuron's output responds to its input. Good activation functions should meet several conditions: * Nonlinear: this is essential * Differentiable: so that gradients can be calculated * Computationally simple: both forward and backward propagation should be fast * Non-saturating: gradients won't vanish or explode Let's look at the most commonly used activation functions. ### Sigmoid: Saturation and Vanishing Gradients Sigmoid is one of the earliest activation functions used. It compresses any real number to between (0, 1). Formula: σ(x) = 1 / (1 + e⁻ˣ) ## Example ```python # ============================================ # Sigmoid Activation Function # ============================================ import numpy as np import matplotlib.pyplot as plt def sigmoid(x): """Sigmoid function""" return 1 / (1 + np.exp(-x)) def sigmoid_derivative(x): """Derivative of Sigmoid: σ'(x) = σ(x) * (1 - σ(x))""" s = sigmoid(x) return s * (1 - s) # Test some values x_values = [-10, -5, -2, 0, 2, 5, 10] print("x | sigmoid(x) | sigmoid'(x)") print("-" * 40) for x in x_values: s = sigmoid(x) d = sigmoid_derivative(x) print(f"{x:4} | {s:10.6f} | {d:12.6f}") # Observe: when x is very large or very small, the derivative approaches 0 # This is the "vanishing gradient" problem x_big = 10.0 print(f"nWhen x = {x_big}:") print(f" sigmoid(x) = {sigmoid(x_big):.10f} (almost 1)") print(f" sigmoid'(x) = {sigmoid_derivative(x_big):.10f} (almost 0)") print("Gradient vanished! During backpropagation, gradients cannot propagate back.") The problems with Sigmoid are obvious: * When |x| > 6, the gradient is almost 0, causing vanishing gradients * The output is not zero-centered, which affects optimization * Computing exponentials is relatively slow Due to these issues, Sigmoid is rarely used in hidden layers in modern deep learning. ### ReLU and Variants ReLU (Rectified Linear Unit) is the most commonly used activation function today. Formula: ReLU(x) = max(0, x) Simple but very effective. ## Example ```python # ============================================ # ReLU and Its Variants # ============================================ import numpy as np def relu(x): """ReLU: max(0, x)""" return np.maximum(0, x) def relu_derivative(x): """Derivative of ReLU: 1 when x > 0, otherwise 0""" return (x >

YouTip

Ai Deep Learning

📂 Categories