Pytorch Autograd

PyTorch Autograd Automatic Differentiation | Rookie Tutorial

Training deep learning models essentially involves repeatedly computing gradients and updating parameters.

Manually deriving gradients for each layer is tedious and error-prone. PyTorch's Autograd (automatic differentiation) engine was designed to solve this problem—it can automatically compute gradients for any computational graph, allowing you to focus on model design rather than calculus derivations.

Core Concepts

1. What is Automatic Differentiation

Automatic Differentiation (AD) is neither numerical differentiation (finite differences) nor symbolic differentiation (algebraic derivation). Instead, it precisely computes derivatives by recording the computation process and applying the chain rule in reverse step-by-step.

PyTorch’s Autograd uses a dynamic computational graph (Define-by-Run) approach: during each forward pass, a directed acyclic graph (DAG) is built in real time, recording every operation and its inputs/outputs; during backpropagation, the graph is traversed backward to compute gradients at each node.

2. requires_grad Attribute

The requires_grad attribute of a Tensor controls whether gradients should be tracked for that tensor:

Examples

import torch

# Create a tensor that requires gradient tracking (default requires_grad=False)

x = torch.tensor(3.0, requires_grad=True)

print(x)  # tensor(3., requires_grad=True)
print(x.requires_grad)  # True

# Can also modify after creation

y = torch.tensor(2.0)
print(y.requires_grad)  # False
y.requires_grad_(True)  # In-place modification (note the underscore)

print(y.requires_grad)  # True

# The result of operations involving tensors with requires_grad=True automatically inherits requires_grad=True

z = x * y
print(z.requires_grad)  # True

3. grad_fn and Computational Graph

Every tensor produced by an operation records a grad_fn, pointing to the operation node that created it. This forms the "skeleton" of the computational graph:

import torch

x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
z = x ** 2 + y * 3  # z = x² + 3y

print(z)  # tensor(13., grad_fn=<AddBackward0>)
print(z.grad_fn)  # <AddBackward0 object>

# Trace the chain of operations that created z

print(z.grad_fn.next_functions)

# ((<PowBackward0 object>, 0), (<MulBackward0 object>, 0))

# We can see z is composed of a power operation and a multiplication operation

backward() Backward Propagation

1. Calling backward() on Scalar Output

Calling .backward() on a final scalar (loss value) causes Autograd to automatically compute gradients for all leaf nodes by traversing the computational graph backward, storing results in each tensor’s .grad attribute:

import torch

x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

# Forward pass: z = x² + 3y
z = x ** 2 + y * 3

# Backward pass: automatically compute dz/dx and dz/dy
z.backward()

# Check gradients
print(x.grad)  # tensor(4.) ← dz/dx = 2x = 2×2 = 4
print(y.grad)  # tensor(3.) ← dz/dy = 3

# Mathematical verification:
# z = x² + 3y
# dz/dx = 2x = 2×2 = 4 ✓
# dz/dy = 3 ✓

2. Accumulation Issue When Calling backward() Multiple Times

Autograd gradients are accumulated, not overwritten. Each call to backward() adds to the existing value in .grad. This is the most common pitfall in training loops:

import torch

x = torch.tensor(2.0, requires_grad=True)

# First backward pass
loss = x ** 2
loss.backward()
print(x.grad)  # tensor(4.) ← dL/dx = 2x = 4

# Second backward pass (no zeroing!)
loss = x ** 2
loss.backward()
print(x.grad)  # tensor(8.) ← Accumulated! Not 4, but 4+4=8

# ✅ Correct approach: Zero gradients before each backward pass
x.grad.zero_()  # In-place zeroing (note the underscore)
loss = x ** 2
loss.backward()
print(x.grad)  # tensor(4.) ← Correct

In neural network training, you must call optimizer.zero_grad() before each backward() to clear gradients; otherwise, gradients will accumulate continuously, leading to incorrect parameter updates.

3. Calling backward(gradient) on Non-Scalar Output

If the output is a vector or matrix instead of a scalar, backward() requires a gradient argument matching the output shape (i.e., "upstream gradient"). This essentially computes a vector-Jacobian product (VJP):

import torch

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Forward pass: y is a vector
y = x ** 2  # y = [1, 4, 9]

# Non-scalar outputs require gradient argument (same shape as y)
# Gradient can be understood as "gradient of loss w.r.t. y"

y.backward(gradient=torch.ones_like(y))  # Assume upstream gradient is all 1
print(x.grad)
# tensor([2., 4., 6.]) ← dy/dx = 2x, computed element-wise

# If upstream gradient is not all 1 (e.g., weighted)
x.grad.zero_()
y.backward(gradient=torch.tensor([1.0, 0.5, 2.0]))  # Different weights
# Actual computation: x.grad = 2x * gradient = [2×1, 4×0.5, 6×2]
print(x.grad)
# tensor([2., 2., 12.])

# More common practice: First sum/mean to scalar, then backward()
x.grad.zero_()
loss = (x ** 2).sum()  # Aggregate vector into scalar
loss.backward()
print(x.grad)
# tensor([2., 4., 6.]) ← Equivalent to first method

torch.no_grad() Disable Gradient Tracking

During model inference (prediction), gradients are not needed. Using torch.no_grad() skips computational graph construction, significantly saving memory and computation:

import torch

x = torch.tensor(3.0, requires_grad=True)

# Within no_grad context, no operations are tracked for gradients
with torch.no_grad():
    y = x ** 2

print(y.requires_grad)  # False ← No longer tracking gradients
print(y.grad_fn)        # None ← No computational graph node

# After exiting no_grad context, normal tracking resumes
z = x ** 2
print(z.requires_grad)  # True

# Common use case: Wrap entire inference process during model evaluation
model = torch.nn.Linear(10, 1)
inputs = torch.randn(32, 10)
with torch.no_grad():
    outputs = model(inputs)  # No computational graph built, faster and more memory-efficient

@torch.no_grad() Decorator Syntax

You can also use the decorator form, which is convenient for marking an entire inference function as gradient-free:

import torch
import torch.nn as nn

model = nn.Linear(10, 1)

@torch.no_grad()
def predict(model, x):
    """Inference function, no gradient computation needed"""
    return model(x)

x = torch.randn(5, 10)
output = predict(model, x)
print(output.requires_grad)  # False

detach() Detach from Computational Graph

.detach() returns a new tensor sharing the same data but not tracking gradients. It's commonly used in the following scenarios:

Scenario	Description
Convert intermediate results to numpy array	NumPy does not support tensors with gradients; must detach first
Record training loss (logging)	Prevent saving the entire computational graph and avoid memory leaks
Freeze gradient propagation in part of the network	Used in GAN training, transfer learning, etc.

import torch

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2 + x * 3  # y has grad_fn

# Detached tensor shares data with y but is detached from the computational graph
y_detached = y.detach()

print(y_detached.requires_grad)  # False
print(y_detached.grad_fn)        # None

# ✅ Convert to numpy (tensors with gradients cannot be directly converted)
# y.numpy()  # ❌ Error: RuntimeError

y_detached.numpy()  # ✅ Works fine

# ✅ When logging loss values, always detach (avoid keeping computational graph and consuming memory)
losses = []
for i in range(3):
    loss = (x ** 2).sum()
    losses.append(loss.detach().item())  # .item() converts scalar tensor to Python float
    loss.backward()
    x.grad.zero_()

print(losses)  # [14.0, 14.0, 14.0]

retain_graph Retain Computational Graph

By default, the computational graph is automatically released after backward() (to save memory). If you need multiple backward passes on the same graph (e.g., certain GAN training), pass retain_graph=True:

import torch

x = torch.tensor(2.0, requires_grad=True)
y = x ** 3  # y = x³

# First backward (retain graph)
y.backward(retain_graph=True)
print(x.grad)  # tensor(12.) ← dy/dx = 3x² = 3×4 = 12

# Second backward (graph still exists)
x.grad.zero_()
y.backward(retain_graph=True)
print(x.grad)  # tensor(12.) ← Same result

# Last backward — no need to retain
x.grad.zero_()
y.backward()  # Graph is now released
print(x.grad)  # tensor(12.)

# Trying backward again will raise an error (graph already released)
# y.backward()  # ❌ RuntimeError: Trying to backward through the graph a second time

Unnecessarily using retain_graph=True causes continuous memory growth because the computational graph cannot be freed. Only use it when you truly need multiple backward passes.

Application of Gradients in Neural Network Training

Below is a complete example of manually implementing gradient descent using Autograd, demonstrating the full workflow of Autograd in actual training:

import torch

# Construct training data: y = 2x + 1 with noise
torch.manual_seed(42)
X = torch.randn(100, 1)
y_true = 2 * X + 1 + 0.1 * torch.randn(100, 1)

# Initialize model parameters (need gradient tracking)
w = torch.zeros(1, requires_grad=True)  # Weight
b = torch.zeros(1, requires_grad=True)  # Bias

lr = 0.1  # Learning rate
epochs = 50  # Number of training epochs

for epoch in range(epochs):
    # 1. Forward pass: compute predictions
    y_pred = X * w + b

    # 2. Compute loss (MSE)
    loss = (y_pred - y_true).pow(2).mean()

YouTip