PyTorch Autograd Automatic Differentiation | Rookie Tutorial
Training deep learning models essentially involves repeatedly computing gradients and updating parameters.
Manually deriving gradients for each layer is tedious and error-prone. PyTorch's Autograd (automatic differentiation) engine was designed to solve this problemβit can automatically compute gradients for any computational graph, allowing you to focus on model design rather than calculus derivations.
Core Concepts
1. What is Automatic Differentiation
Automatic Differentiation (AD) is neither numerical differentiation (finite differences) nor symbolic differentiation (algebraic derivation). Instead, it precisely computes derivatives by recording the computation process and applying the chain rule in reverse step-by-step.
PyTorchβs Autograd uses a dynamic computational graph (Define-by-Run) approach: during each forward pass, a directed acyclic graph (DAG) is built in real time, recording every operation and its inputs/outputs; during backpropagation, the graph is traversed backward to compute gradients at each node.
2. requires_grad Attribute
The requires_grad attribute of a Tensor controls whether gradients should be tracked for that tensor:
Examples
import torch
# Create a tensor that requires gradient tracking (default requires_grad=False)
x = torch.tensor(3.0, requires_grad=True)
print(x) # tensor(3., requires_grad=True)
print(x.requires_grad) # True
# Can also modify after creation
y = torch.tensor(2.0)
print(y.requires_grad) # False
y.requires_grad_(True) # In-place modification (note the underscore)
print(y.requires_grad) # True
# The result of operations involving tensors with requires_grad=True automatically inherits requires_grad=True
z = x * y
print(z.requires_grad) # True
3. grad_fn and Computational Graph
Every tensor produced by an operation records a grad_fn, pointing to the operation node that created it. This forms the "skeleton" of the computational graph:
import torch
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
z = x ** 2 + y * 3 # z = xΒ² + 3y
print(z) # tensor(13., grad_fn=<AddBackward0>)
print(z.grad_fn) # <AddBackward0 object>
# Trace the chain of operations that created z
print(z.grad_fn.next_functions)
# ((<PowBackward0 object>, 0), (<MulBackward0 object>, 0))
# We can see z is composed of a power operation and a multiplication operation
backward() Backward Propagation
1. Calling backward() on Scalar Output
Calling .backward() on a final scalar (loss value) causes Autograd to automatically compute gradients for all leaf nodes by traversing the computational graph backward, storing results in each tensorβs .grad attribute:
import torch
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
# Forward pass: z = xΒ² + 3y
z = x ** 2 + y * 3
# Backward pass: automatically compute dz/dx and dz/dy
z.backward()
# Check gradients
print(x.grad) # tensor(4.) β dz/dx = 2x = 2Γ2 = 4
print(y.grad) # tensor(3.) β dz/dy = 3
# Mathematical verification:
# z = xΒ² + 3y
# dz/dx = 2x = 2Γ2 = 4 β
# dz/dy = 3 β
2. Accumulation Issue When Calling backward() Multiple Times
Autograd gradients are accumulated, not overwritten. Each call to backward() adds to the existing value in .grad. This is the most common pitfall in training loops:
import torch
x = torch.tensor(2.0, requires_grad=True)
# First backward pass
loss = x ** 2
loss.backward()
print(x.grad) # tensor(4.) β dL/dx = 2x = 4
# Second backward pass (no zeroing!)
loss = x ** 2
loss.backward()
print(x.grad) # tensor(8.) β Accumulated! Not 4, but 4+4=8
# β
Correct approach: Zero gradients before each backward pass
x.grad.zero_() # In-place zeroing (note the underscore)
loss = x ** 2
loss.backward()
print(x.grad) # tensor(4.) β Correct
optimizer.zero_grad() before each backward() to clear gradients; otherwise, gradients will accumulate continuously, leading to incorrect parameter updates.
3. Calling backward(gradient) on Non-Scalar Output
If the output is a vector or matrix instead of a scalar, backward() requires a gradient argument matching the output shape (i.e., "upstream gradient"). This essentially computes a vector-Jacobian product (VJP):
import torch
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
# Forward pass: y is a vector
y = x ** 2 # y = [1, 4, 9]
# Non-scalar outputs require gradient argument (same shape as y)
# Gradient can be understood as "gradient of loss w.r.t. y"
y.backward(gradient=torch.ones_like(y)) # Assume upstream gradient is all 1
print(x.grad)
# tensor([2., 4., 6.]) β dy/dx = 2x, computed element-wise
# If upstream gradient is not all 1 (e.g., weighted)
x.grad.zero_()
y.backward(gradient=torch.tensor([1.0, 0.5, 2.0])) # Different weights
# Actual computation: x.grad = 2x * gradient = [2Γ1, 4Γ0.5, 6Γ2]
print(x.grad)
# tensor([2., 2., 12.])
# More common practice: First sum/mean to scalar, then backward()
x.grad.zero_()
loss = (x ** 2).sum() # Aggregate vector into scalar
loss.backward()
print(x.grad)
# tensor([2., 4., 6.]) β Equivalent to first method
torch.no_grad() Disable Gradient Tracking
During model inference (prediction), gradients are not needed. Using torch.no_grad() skips computational graph construction, significantly saving memory and computation:
import torch
x = torch.tensor(3.0, requires_grad=True)
# Within no_grad context, no operations are tracked for gradients
with torch.no_grad():
y = x ** 2
print(y.requires_grad) # False β No longer tracking gradients
print(y.grad_fn) # None β No computational graph node
# After exiting no_grad context, normal tracking resumes
z = x ** 2
print(z.requires_grad) # True
# Common use case: Wrap entire inference process during model evaluation
model = torch.nn.Linear(10, 1)
inputs = torch.randn(32, 10)
with torch.no_grad():
outputs = model(inputs) # No computational graph built, faster and more memory-efficient
@torch.no_grad() Decorator Syntax
You can also use the decorator form, which is convenient for marking an entire inference function as gradient-free:
import torch
import torch.nn as nn
model = nn.Linear(10, 1)
@torch.no_grad()
def predict(model, x):
"""Inference function, no gradient computation needed"""
return model(x)
x = torch.randn(5, 10)
output = predict(model, x)
print(output.requires_grad) # False
detach() Detach from Computational Graph
.detach() returns a new tensor sharing the same data but not tracking gradients. It's commonly used in the following scenarios:
| Scenario | Description |
|---|---|
| Convert intermediate results to numpy array | NumPy does not support tensors with gradients; must detach first |
| Record training loss (logging) | Prevent saving the entire computational graph and avoid memory leaks |
| Freeze gradient propagation in part of the network | Used in GAN training, transfer learning, etc. |
import torch
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2 + x * 3 # y has grad_fn
# Detached tensor shares data with y but is detached from the computational graph
y_detached = y.detach()
print(y_detached.requires_grad) # False
print(y_detached.grad_fn) # None
# β
Convert to numpy (tensors with gradients cannot be directly converted)
# y.numpy() # β Error: RuntimeError
y_detached.numpy() # β
Works fine
# β
When logging loss values, always detach (avoid keeping computational graph and consuming memory)
losses = []
for i in range(3):
loss = (x ** 2).sum()
losses.append(loss.detach().item()) # .item() converts scalar tensor to Python float
loss.backward()
x.grad.zero_()
print(losses) # [14.0, 14.0, 14.0]
retain_graph Retain Computational Graph
By default, the computational graph is automatically released after backward() (to save memory). If you need multiple backward passes on the same graph (e.g., certain GAN training), pass retain_graph=True:
import torch
x = torch.tensor(2.0, requires_grad=True)
y = x ** 3 # y = xΒ³
# First backward (retain graph)
y.backward(retain_graph=True)
print(x.grad) # tensor(12.) β dy/dx = 3xΒ² = 3Γ4 = 12
# Second backward (graph still exists)
x.grad.zero_()
y.backward(retain_graph=True)
print(x.grad) # tensor(12.) β Same result
# Last backward β no need to retain
x.grad.zero_()
y.backward() # Graph is now released
print(x.grad) # tensor(12.)
# Trying backward again will raise an error (graph already released)
# y.backward() # β RuntimeError: Trying to backward through the graph a second time
retain_graph=True causes continuous memory growth because the computational graph cannot be freed. Only use it when you truly need multiple backward passes.
Application of Gradients in Neural Network Training
Below is a complete example of manually implementing gradient descent using Autograd, demonstrating the full workflow of Autograd in actual training:
import torch
# Construct training data: y = 2x + 1 with noise
torch.manual_seed(42)
X = torch.randn(100, 1)
y_true = 2 * X + 1 + 0.1 * torch.randn(100, 1)
# Initialize model parameters (need gradient tracking)
w = torch.zeros(1, requires_grad=True) # Weight
b = torch.zeros(1, requires_grad=True) # Bias
lr = 0.1 # Learning rate
epochs = 50 # Number of training epochs
for epoch in range(epochs):
# 1. Forward pass: compute predictions
y_pred = X * w + b
# 2. Compute loss (MSE)
loss = (y_pred - y_true).pow(2).mean()
YouTip