Pytorch Torch Optim

PyTorch torch.optim Optimizer Module | Rookie Tutorial

Why Are Optimizers Needed?

Optimizers are core components in deep learning, responsible for adjusting model parameters based on the gradients of the loss function, enabling the model to gradually approach the optimal solution.

Common Optimizer Types

Optimizer Name	Main Features	Applicable Scenarios
SGD	Simple and basic, can incorporate momentum	Basic teaching, simple models, CNNs
Adam	Adaptive learning rate	Most deep learning tasks
AdamW	Adam + separated weight decay	Tasks requiring L2 regularization
RMSprop	Adaptive learning rate	RNN networks, speech recognition
Adagrad	Parameter-independent learning rate	Sparse data, text processing
Adadelta	Adaptive learning rate	Long-term training tasks

Core API of Optimizers

Mastering the basic usage flow of optimizers is the first step in deep learning.

Basic Usage Flow

Create an instance
Clear gradients
Perform backpropagation
Update parameters

Example


import torch

import torch.nn as nn

import torch.optim as optim

# 1. Define a simple model
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(784, 10)

    def forward(self, x):
        return self.fc(x)

model = SimpleNet()

# 2. Create an optimizer instance
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 3. Training loop
for epoch in range(epochs):
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, labels)

    # Backward pass
    optimizer.zero_grad()  # Clear gradient cache to avoid gradient accumulation
    loss.backward()         # Compute gradients
    optimizer.step()        # Update parameters

Key Method Explanation

zero_grad(set_to_none=True): Clears the gradient cache of parameters. Setting it to True sets gradients to None, which saves GPU memory compared to setting them to 0.
step(): Performs a single parameter update based on gradients and learning rate.
state_dict(): Retrieves the optimizer's state dictionary, useful for saving checkpoints.
load_state_dict(state_dict): Loads the optimizer's state, used for resuming training.
add_param_group(param_group): Dynamically adds a parameter group.

Note: You must call zero_grad() before each backward pass; otherwise, gradients will accumulate, leading to unstable training. It is recommended to use zero_grad(set_to_none=True) to save GPU memory.

Save and Load Optimizer State

When resuming training, you need to save both the model and optimizer states simultaneously.

Example


# Save checkpoint (including model, optimizer, and scheduler)
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'scheduler_state_dict': scheduler.state_dict(),
    'loss': loss,
}

torch.save(checkpoint, 'checkpoint.pth')

# Load checkpoint
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
scheduler.load_state_dict(checkpoint['scheduler_state_dict'])

start_epoch = checkpoint['epoch'] + 1

Common Optimizer Details

SGD (Stochastic Gradient Descent)

SGD is the most fundamental optimization algorithm, updating parameters by computing gradients from individual samples or small batches. It serves as the cornerstone of deep learning optimization, with many advanced optimizers built upon it.


# SGD optimizer parameter explanation
# params: Parameters to be optimized (usually from model.parameters())
# lr: Learning rate, controls the step size of parameter updates, default 0.01
# momentum: Momentum factor, used to accelerate convergence and reduce oscillations, default 0
# weight_decay: L2 regularization coefficient, prevents overfitting, default 0
# dampening: Momentum damping, controls the calculation of momentum terms, default 0
# nesterov: Whether to use Nesterov momentum, default False

optimizer = optim.SGD(
    params=model.parameters(),
    lr=0.01,  # Learning rate
    momentum=0.9,  # Momentum factor
    weight_decay=1e-4,  # L2 regularization
    nesterov=True  # Enable Nesterov momentum
)

Key Parameter Explanation:

lr (float): Controls the step size of parameter updates.
momentum (float): Used to accelerate convergence and reduce oscillations; common value is 0.9.
weight_decay (float): L2 regularization coefficient to prevent overfitting; commonly set to 1e-4.
nesterov (bool): Whether to enable Nesterov momentum; enabling it reduces oscillations.

Features:

Simple implementation, foundational algorithm for deep learning optimization.
Adding momentum accelerates convergence and improves training stability.
Convergence speed is relatively slow but may achieve higher final accuracy.
Often used as a baseline for comparing other optimizers.

Although SGD is simple, it often performs well under appropriate hyperparameters, making it an excellent starting point for learning optimization algorithms. In image classification tasks, SGD combined with momentum remains a mainstream choice.

Adam (Adaptive Moment Estimation)

Adam is one of the most widely used optimizers today, combining the advantages of momentum and adaptive learning rates. It adaptively adjusts the learning rate for each parameter by calculating first- and second-order moment estimates of gradients.


# Adam optimizer parameter explanation
# params: Parameters to be optimized
# lr: Learning rate, default 0.001 (recommended value)
# betas: Coefficients for calculating moving averages of gradients and squared gradients (beta1, beta2)
# beta1 controls first-order moment estimation (momentum), default 0.9
# beta2 controls second-order moment estimation (variance), default 0.999
# eps: Numerical stability term to prevent division by zero, default 1e-8
# weight_decay: L2 regularization coefficient, default 0
# amsgrad: Whether to use AMSGrad variant, default False

optimizer = optim.Adam(
    params=model.parameters(),
    lr=0.001,  # Recommended smaller learning rate
    betas=(0.9, 0.999),  # Common momentum parameters
    eps=1e-8,  # Numerical stability term
    weight_decay=1e-4,  # L2 regularization
    amsgrad=False  # Whether to use AMSGrad
)

Key Parameter Explanation:

betas (Tuple[float, float]): Controls exponential moving averages of gradients and squared gradients.
eps (float): Numerical stability term to prevent denominator errors.
amsgrad (bool): Whether to use AMSGrad variant; using it ensures convergence.

Features:

Adaptive learning rate: Automatically adjusts learning rate based on historical gradient trends.
Combines momentum concept: Uses first-order moment estimation to accelerate convergence.
Robustness: Relatively insensitive to hyperparameter choices.
Fast convergence, suitable for rapid prototyping.

Adam is the default choice for most deep learning tasks, but in certain specific scenarios (such as GANs or reinforcement learning), other optimizers may need to be tried.

AdamW (Adam with Weight Decay)

AdamW is an improved version of Adam that decouples weight decay from gradient updates, theoretically promoting better convergence. In practice, AdamW usually outperforms Adam.


# AdamW optimizer
# Main difference from Adam: different implementation of weight_decay
# AdamW's weight decay is more accurate and does not affect gradient calculations

optimizer = optim.AdamW(
    params=model.parameters(),
    lr=0.001,
    betas=(0.9, 0.999),
    weight_decay=0.01,  # Weight decay coefficient, typically larger than Adam's setting
    amsgrad=False
)

# Recommended configuration: AdamW usually uses 0.01 weight_decay
# While Adam typically uses 0.001

If your task requires weight decay (L2 regularization), AdamW is strongly recommended over Adam.

RMSprop

RMSprop is an adaptive learning rate optimizer particularly suited for handling non-stationary objectives and recurrent neural networks.


# RMSprop optimizer
# Normalizes learning rate by dividing it by the exponentially weighted average of gradients

optimizer = optim.RMSprop(
    params=model.parameters(),
    lr=0.01,  # Learning rate
    alpha=0.99,  # Exponential decay rate for squared gradients
    eps=1e-8,  # Numerical stability term
    weight_decay=0,  # L2 regularization
    momentum=0,  # Momentum factor
    centered=False  # Whether to center gradients
)

Adagrad

Adagrad is ideal for sparse data, automatically adjusting the learning rate for each parameter.


# Adagrad optimizer
# Suitable for optimizing sparse data, applying smaller learning rates to frequently updated parameters

optimizer = optim.Adagrad(
    params=model.parameters(),
    lr=0.01,  # Learning rate
    lr_decay=0,  # No learning rate decay
    weight_decay=0,  # L2 regularization
    initial_accumulator_value=0  # Initial accumulation value
)

Advanced Optimizer Techniques

Learning Rate Scheduling

Learning rate scheduling allows dynamic adjustment of the learning rate during training, often significantly improving model convergence.


from torch.optim.lr_scheduler import (
    StepLR,  # Step decay
    MultiStepLR,  # Multi-milestone decay
    ExponentialLR,  # Exponential decay
    CosineAnnealingLR,  # Cosine annealing
    ReduceLROnPlateau,  # Automatic adjustment based on metrics
)

# Method 1: StepLR - Decays once every 30 epochs
optimizer = optim.SGD(model.parameters(), lr=0.1)
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)

# Method 2: MultiStepLR - Decays at specified epochs
optimizer = optim.SGD(model.parameters(), lr=0.1)
scheduler = MultiStepLR(optimizer, milestones=[30, 60, 80], gamma=0.1)

# Method 3: CosineAnnealingLR - Cosine curve annealing
optimizer = optim.Adam(model.parameters(), lr=0.001)
scheduler = CosineAnnealingLR(optimizer, T_max=50, eta_min=1e-6)

# Method 4: ReduceLROnPlateau - Automatic adjustment based on metrics
optimizer = optim.Adam(model.parameters(), lr=0.001)
scheduler = ReduceLROnPlateau(
    optimizer, mode='min',  # Monitor loss
    factor=0.5,  # Decay coefficient
    patience=5,  # Number of waiting epochs
    verbose=True  # Print information
)

# Training loop
for epoch in range(100):
    train_loss = train(...)
    val_loss = validate(...)

    # StepLR and other schedulers
    scheduler.step()

    # ReduceLROnPlateau requires passing monitored metrics
    scheduler.step(val_loss)

Learning rate schedulers must work in conjunction with optimizers. The step() method should always be called after optimizer.step(); otherwise, learning rate updates may become abnormal.

Parameter Group Optimization

Parameter group optimization allows setting different learning rates for different layers, especially useful in transfer learning.


# Parameter group optimization example
# Set different learning rates for different layers
# Typically: Base network uses smaller learning rate, classification head uses larger learning rate

optimizer = optim.SGD([
    {'params': model.base.parameters(), 'lr': 1e-3},  # Base layer: Larger learning rate
    {'params': model.classifier.parameters(), 'lr': 1e-2}  # Classification layer: Larger learning rate
], lr=1e-4)  # Global default learning rate (used when no parameter group is specified)

# More common writing style in practical applications
optimizer = optim.Adam([
    {'params': model.fc.parameters(), 'lr': 1e-3},  # Classification head
    {'params': [p for n, p in model.named_parameters() if not n.startswith('fc')], 
     'lr': 1e-5},  # Base network
])

Gradient Clipping

Gradient clipping can prevent gradient explosion and improve training stability, especially useful in deep networks like RNNs and LSTMs.


import torch.nn as nn

# Gradient clipping example
# max_norm: Maximum norm of gradients; gradients exceeding this value will be scaled down
# norm_type: Norm type, default is 2 (Euclidean norm)

nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Usage location in the training loop
for epoch in range(epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    loss.backward()

    # Clip gradients after loss.backward() but before optimizer.step()
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()

Gradient clipping is a common technique for training deep neural networks (especially RNNs and LSTMs), effectively preventing training crashes caused by gradient explosions.

Gradient Accumulation

When GPU memory is insufficient, gradient accumulation can simulate the effect of a large batch size.


# Gradient accumulation example
# Actual batch_size = batch_size * accumulation_steps

accumulation_steps = 4  # Accumulate 4 small batches

optimizer.zero_grad()
for i, (inputs, labels) in enumerate(train_loader):
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    loss = loss / accumulation_steps  # Average the loss across accumulated steps

    loss.backward()

    # Update parameters only after accumulating a specified number of steps
    if (i + 1) % accumulation_steps == 0:
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
    else:
        optimizer.step()
    optimizer.zero_grad()

    # Handle remaining gradients
    if (i + 1) % accumulation_steps != 0:
        optimizer.step()
    optimizer.zero_grad()

Gradient accumulation is another way to handle situations where GPU memory is limited, allowing you to simulate the effects of a large batch size.

Complete Training Example

The following is a complete training process demonstrating the best practices for using optimizers.


# Complete training example
# Includes all key steps: creating an optimizer, clearing gradients, performing backpropagation, and updating parameters

YouTip