YouTip LogoYouTip

Pytorch Lr Scheduler

Learning Rate (LR) is one of the most important hyperparameters in neural network training. If the learning rate is too large, training will oscillate or even diverge; if it's too small, convergence will be extremely slow and easily get stuck in local optima.\n\n**Learning Rate Scheduler (LR Scheduler)** dynamically adjusts the learning rate during training, balancing fast convergence in the early stages with fine-tuning in later stages.\n\n* * *\n\n## 1. Basic Concepts and Usage Patterns\n\n### Standard Usage Flow\n\n## Example\n\nimport torch\n\nimport torch.nn as nn\n\nimport torch.optim as optim\n\nmodel = nn.Linear(10,1)\n\n optimizer = optim.SGD(model.parameters(), lr=0.1)\n\n# 1. Create scheduler, pass in optimizer\n\n scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)\n\nfor epoch in range(100):\n\n# 2. Train\n\n train(model, optimizer)\n\n# 3. Call scheduler.step() after each epoch\n\n scheduler.step()\n\n# 4. Check current learning rate\n\n current_lr = scheduler.get_last_lr()\n\nprint(f"Epoch {epoch+1}, LR: {current_lr:.6f}")\n\n### When to Call step()\n\n## Example\n\n# Correct: optimizer.step() first, then scheduler.step()\n\n optimizer.step()\n\n scheduler.step()\n\n# Wrong: scheduler.step() before optimizer.step()\n\n# PyTorch 1.1.0+ will produce a warning, some schedulers behave abnormally\n\n scheduler.step()\n\n optimizer.step()\n\n> Note: `optimizer.step()` must be called before `scheduler.step()`.\n\n### Viewing and Saving Learning Rate State\n\n## Example\n\n# Get current learning rate\n\n current_lr = optimizer.param_groups['lr']\n\n current_lr = scheduler.get_last_lr()# LR after last step()\n\n# Save checkpoint (must save scheduler state simultaneously)\n\n torch.save({\n\n'epoch': epoch,\n\n'model': model.state_dict(),\n\n'optimizer': optimizer.state_dict(),\n\n'scheduler': scheduler.state_dict(),# Don't forget this\n\n},'checkpoint.pth')\n\n# Resume checkpoint\n\n ckpt = torch.load('checkpoint.pth')\n\n model.load_state_dict(ckpt['model'])\n\n optimizer.load_state_dict(ckpt['optimizer'])\n\n scheduler.load_state_dict(ckpt['scheduler'])\n\n* * *\n\n## 2. Fixed Decay Schedulers\n\n### 2.1 StepLR Step Decay\n\nEvery fixed `step_size` epochs, multiply the learning rate by `gamma`. One of the simplest and most commonly used schedulers.\n\n**Formula:** lr = lr_base * gamma^(floor(step / step_size))\n\n## Example\n\noptimizer = optim.SGD(model.parameters(), lr=0.1)\n\n scheduler = optim.lr_scheduler.StepLR(\n\n optimizer,\n\n step_size=30,# Decay every 30 epochs\n\n gamma=0.1# Multiply by 0.1 each time (i.e., shrink to 1/10 of original)\n\n)\n\n# LR changes:\n\n# Epoch 0-29: 0.1\n\n# Epoch 30-59: 0.01\n\n# Epoch 60-89: 0.001\n\n# Epoch 90+: 0.0001\n\n> **Applicable scenarios:** Tasks with fixed training rhythm and clear stages, such as ResNet training on ImageNet (90 epochs, decay at 30th and 60th epoch).\n\n* * *\n\n### 2.2 MultiStepLR Multi-Milestone Decay\n\nDecay learning rate at specified epochs (milestones), more flexible than StepLR.\n\n## Example\n\noptimizer = optim.SGD(model.parameters(), lr=0.1)\n\n scheduler = optim.lr_scheduler.MultiStepLR(\n\n optimizer,\n\n milestones=[30,60,80],# Decay at epoch 30, 60, 80\n\n gamma=0.1\n\n)\n\n# LR changes:\n\n# Epoch 0-29: 0.1\n\n# Epoch 30-59: 0.01\n\n# Epoch 60-79: 0.001\n\n# Epoch 80+: 0.0001\n\n> **Applicable scenarios:** When you know at which epochs the model needs fine-tuning, such as the later fine convergence stage of classification models.\n\n* * *\n\n### 2.3 ExponentialLR Exponential Decay\n\nDecay every epoch, learning rate decreases continuously in exponential form, smoother decay.\n\n**Formula:** lr = lr_base * gamma^epoch\n\n## Example\n\noptimizer = optim.Adam(model.parameters(), lr=0.01)\n\n scheduler = optim.lr_scheduler.ExponentialLR(\n\n optimizer,\n\n gamma=0.95# Multiply by 0.95 each epoch\n\n)\n\n# LR changes (first 5 epochs):\n\n# Epoch 1: 0.0100\n\n# Epoch 2: 0.0095\n\n# Epoch 3: 0.0090\n\n# Epoch 4: 0.0086\n\n# Epoch 5: 0.0081\n\n> Setting `gamma` too small (e.g., 0.5) will cause learning rate to rapidly approach zero, typically set between 0.9~0.99.\n\n* * *\n\n## 3. Adaptive Schedulers\n\n### 3.1 ReduceLROnPlateau Monitor Metric Decay\n\n**One of the most intelligent schedulers**: monitors a metric (e.g., validation loss), automatically reduces learning rate when the metric stops improving. No need to know at which epoch to decay in advance.\n\n## Example\n\noptimizer = optim.Adam(model.parameters(), lr=0.01)\n\n scheduler = optim.lr_scheduler.ReduceLROnPlateau(\n\n optimizer,\n\n mode='min',# 'min': smaller is better (loss); 'max': larger is better (accuracy)\n\n factor=0.1,# When triggered, lr = lr Γ— factor\n\n patience=10,# Number of epochs to allow metric stagnation, decay if exceeded\n\n threshold=1e-4,# Improvement less than this value is considered no improvement\n\n min_lr=1e-6,# Lower bound of learning rate, will not go below this\n\n verbose=True# Print decay information\n\n)\n\nfor epoch in range(100):\n\n train_loss = train(model, optimizer)\n\n val_loss = evaluate(model)\n\n# Unlike other schedulers, pass in the monitored metric here\n\n scheduler.step(val_loss)\n\n**Monitoring Accuracy:**\n\n## Example\n\nscheduler = optim.lr_scheduler.ReduceLROnPlateau(\n\n optimizer,\n\n mode='max',# accuracy larger is better\n\n factor=0.5,\n\n patience=5,\n\n)\n\nscheduler.step(val_accuracy)\n\n> **Applicable scenarios:** Default first choice for almost all tasks, especially when uncertain about total epochs or training is unstable.\n\n* * *\n\n### 3.2 CosineAnnealingLR Cosine Annealing\n\nLearning rate smoothly decreases from initial value to minimum (`eta_min`) following a **cosine curve**, avoiding the abrupt changes of step decay.\n\n**Formula:** lr_t = eta_min + 0.5 * (eta_max - eta_min) * (1 + cos(t * pi / T_max)\n\n## Example\n\noptimizer = optim.SGD(model.parameters(), lr=0.1)\n\n scheduler = optim.lr_scheduler.CosineAnnealingLR(\n\n optimizer,\n\n T_max=100,# Half period length (usually set to total epoch count)\n\n eta_min=1e-6# Minimum learning rate (default 0)\n\n)\n\n# LR trajectory (T_max=10illustration):\n\n# 0.1 -> 0.095 -> 0.079 -> 0.055 -> 0.026 -> 0.001\n\n# (Smooth cosine curve descent)\n\n> **Applicable scenarios:** Fixed epoch training, widely used in Vision Transformer, ResNet and other papers, good convergence quality.\n\n* * *\n\n### 3.3 CosineAnnealingWarmRestarts\n\nUpgraded version of cosine annealing, supports **periodic restarts** (Warm Restarts): after each period ends, learning rate resets to initial value, starting a new round of cosine decay. Allows model to escape local optima.\n\n## Example\n\noptimizer = optim.SGD(model.parameters(), lr=0.1)\n\n scheduler = optim.lr_scheduler.CosineAnnealingWarmRestarts(\n\n optimizer,\n\n T_0=10,# Length of first period (number of epochs)\n\n T_mult=2,# Multiplier for period length after each restart (1=equal length, 2=gradually longer)\n\n eta_min=1e-6\n\n)\n\n# Period lengths when T_mult=2: 10 -> 20 -> 40 -> 80 ...\n\n# LR changes (T_0=10, T_mult=1):\n\n# 0.1 -> ... -> 0 -> 0.1 -> ... -> 0 -> 0.1 (restart every 10 epochs)\n\n> **Applicable scenarios:** Training large models, or when you want the model to choose the best from multiple convergence points, works well with Snapshot Ensemble.\n\n* * *\n\n## 4. Warmup Schedulers\n\n**Warmup** refers to the training beginning where, for several steps, the learning rate gradually increases from a very small value to the target value. Large batch size training and Transformer models almost always need warmup, otherwise early gradient updates are too aggressive and the model cannot stabilize.\n\n### 4.1 LinearLR Linear Scheduling\n\nLinearly change learning rate within specified epochs (can be used for linear warmup or linear decay).\n\n## Example\n\noptimizer = optim.Adam(model.parameters(), lr=0.01)\n\n# Linear warmup: within first 5 epochs, lr linearly grows from 0.01Γ—0.1=0.001 to 0.01\n\n warmup_scheduler = optim.lr_scheduler.LinearLR(\n\n optimizer,\n\n start_factor=0.1,# Initial lr = base_lr Γ— start_factor\n\n end_factor=1.0,# Final lr = base_lr Γ— end_factor\n\n total_iters=5# Complete over 5 epochs\n\n)\n\n# LR changes: 0.001 -> 0.003 -> 0.005 -> 0.007 -> 0.009 -> 0.01\n\n* * *\n\n### 4.2 ConstantLR Constant Phase\n\nFor specified number of epochs, fix learning rate at `base_lr Γ— factor`, then restore original value.\n\n## Example\n\n# Use base_lr Γ— 0.5 for first 5 epochs, then return to normal\n\n scheduler = optim.lr_scheduler.ConstantLR(\n\n optimizer,\n\n factor=0.5,\n\n total_iters=5\n\n)\n\n* * *\n\n### 4.3 SequentialLR Combined Scheduling\n\nChain multiple schedulers **in sequence**, the standard way to implement "warmup + decay" combination strategies.\n\n## Example\n\noptimizer = optim.Adam(model.parameters(), lr=0.01)\n\n# Phase 1: Warmup (first 5 epochs, LR linearly grows from 0.001 to 0.01)\n\n warmup = optim.lr_scheduler.LinearLR(\n\n optimizer, start_factor=0.1, end_factor=1.0, total_iters=5\n\n)\n\n# Phase 2: Cosine annealing (remaining 95 epochs)\n\n cosine = optim.lr_scheduler.CosineAnnealingLR(\n\n optimizer, T_max=95, eta_min=1e-6\n\n)\n\n# Combine: execute warmup first, switch to cosine after epoch 5\n\n scheduler = optim.lr_scheduler.SequentialLR(\n\n optimizer,\n\n schedulers=[warmup, cosine],\n\n milestones=# Switch at epoch 5\n\n)\n\n# Usage is identical to ordinary schedulers\n\nfor epoch in range(100):\n\n train(...)\n\n scheduler.step()\n\n> **Applicable scenarios:** Standard for Transformer training (warmup + cosine annealing), BERT, GPT, ViT pre-training all use this strategy.\n\n* * *\n\n## 5. Cyclic Schedulers\n\n### 5.1 CyclicLR Cyclic Learning Rate\n\nLearning rate **cycles periodically** between `base_lr` and `max_lr`, can help model explore wider parameter space and escape saddle points.\n\n## Example\n\noptimizer = optim.SGD(model.parameters(), lr=0.01)\n\n scheduler = optim.lr_scheduler.CyclicLR(\n\n optimizer,\n\n base_lr=0.001,# Lower bound of learning rate\n\n max_lr=0.01,# Upper bound of learning rate\n\n step_size_up=2000,# Steps to rise from base_lr to max_lr\n\n step_size_down=2000,# Steps to fall from max_lr to base_lr (default equals step_size_up)\n\n mode='triangular',# Triangular cycle (equal amplitude)\n\n# mode='triangular2' # Amplitude halves each cycle\n\n# mode='exp_range' # Amplitude decays exponentially\n\n)\n\n# CyclicLR is called by step (batch), not by epoch\n\nfor epoch in range(num_epochs):\n\nfor inputs, labels in train_loader:\n\n optimizer.zero_grad()\n\n loss = criterion(model(inputs), labels)\n\n loss.backward()\n\n optimizer.step()\n\n scheduler.step()# Call after each batch\n\n**Three Modes Comparison:**\n\n| mode | Amplitude Change | Characteristics |\n| --- | --- | --- |\n| `triangular` | Unchanged | Stable exploration, good for early stage |\n| `triangular2` | Halves each cycle | Explore first, then converge |\n| `exp_range` | Exponential decay | Eventually stable convergence |\n\n* * *\n\n### 5.2 OneCycleLR Single Cycle Policy\n\n**One of the best performing schedulers**, proposed by fastai's 1-Cycle Policy. The entire training has only one cycle: learning rate first rises then falls, momentum changes inversely.\n\nFaster training speed, typically requires only **1/5~1/10** of traditional training epochs to achieve the same accuracy.\n\n## Example\n\noptimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)\n\nsteps_per_epoch =len(train_loader)\n\nscheduler = optim.lr_scheduler.OneCycleLR(\n\n optimizer,\n\n max_lr=0.1,# Peak learning rate\n\n steps_per_epoch=steps_per_epoch,# Steps per epoch\n\n epochs=10,# Total epochs\n\n pct_start=0.3,# First 30% used for warmup rise\n\n anneal_strategy='cos',# Decay strategy ('cos' or 'linear')\n\n div_factor=25,# Initial lr = max_lr / div_factor\n\n final_div_factor=1e4# Final lr = max_lr / final_div_factor\n\n)\n\n# Initial lr = 0.1 / 25 = 0.004\n\n# Peak lr = 0.1 (at 30% point)\n\n# Final lr = 0.1 / 10000 = 0.00001\n\n# Also called by batch\n\nfor epoch in range(10):\n\nfor inputs, labels in train_loader:\n\n optimizer.zero_grad()\n\n loss = criterion(mod
← Pytorch Batchnorm DropoutPytorch Torch Nn Sigmoid β†’