Pytorch Lr Scheduler
Learning Rate (LR) is one of the most important hyperparameters in neural network training. If the learning rate is too large, training will oscillate or even diverge; if it's too small, convergence will be extremely slow and easily get stuck in local optima.\n\n**Learning Rate Scheduler (LR Scheduler)** dynamically adjusts the learning rate during training, balancing fast convergence in the early stages with fine-tuning in later stages.\n\n* * *\n\n## 1. Basic Concepts and Usage Patterns\n\n### Standard Usage Flow\n\n## Example\n\nimport torch\n\nimport torch.nn as nn\n\nimport torch.optim as optim\n\nmodel = nn.Linear(10,1)\n\n optimizer = optim.SGD(model.parameters(), lr=0.1)\n\n# 1. Create scheduler, pass in optimizer\n\n scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)\n\nfor epoch in range(100):\n\n# 2. Train\n\n train(model, optimizer)\n\n# 3. Call scheduler.step() after each epoch\n\n scheduler.step()\n\n# 4. Check current learning rate\n\n current_lr = scheduler.get_last_lr()\n\nprint(f"Epoch {epoch+1}, LR: {current_lr:.6f}")\n\n### When to Call step()\n\n## Example\n\n# Correct: optimizer.step() first, then scheduler.step()\n\n optimizer.step()\n\n scheduler.step()\n\n# Wrong: scheduler.step() before optimizer.step()\n\n# PyTorch 1.1.0+ will produce a warning, some schedulers behave abnormally\n\n scheduler.step()\n\n optimizer.step()\n\n> Note: `optimizer.step()` must be called before `scheduler.step()`.\n\n### Viewing and Saving Learning Rate State\n\n## Example\n\n# Get current learning rate\n\n current_lr = optimizer.param_groups['lr']\n\n current_lr = scheduler.get_last_lr()# LR after last step()\n\n# Save checkpoint (must save scheduler state simultaneously)\n\n torch.save({\n\n'epoch': epoch,\n\n'model': model.state_dict(),\n\n'optimizer': optimizer.state_dict(),\n\n'scheduler': scheduler.state_dict(),# Don't forget this\n\n},'checkpoint.pth')\n\n# Resume checkpoint\n\n ckpt = torch.load('checkpoint.pth')\n\n model.load_state_dict(ckpt['model'])\n\n optimizer.load_state_dict(ckpt['optimizer'])\n\n scheduler.load_state_dict(ckpt['scheduler'])\n\n* * *\n\n## 2. Fixed Decay Schedulers\n\n### 2.1 StepLR Step Decay\n\nEvery fixed `step_size` epochs, multiply the learning rate by `gamma`. One of the simplest and most commonly used schedulers.\n\n**Formula:** lr = lr_base * gamma^(floor(step / step_size))\n\n## Example\n\noptimizer = optim.SGD(model.parameters(), lr=0.1)\n\n scheduler = optim.lr_scheduler.StepLR(\n\n optimizer,\n\n step_size=30,# Decay every 30 epochs\n\n gamma=0.1# Multiply by 0.1 each time (i.e., shrink to 1/10 of original)\n\n)\n\n# LR changes:\n\n# Epoch 0-29: 0.1\n\n# Epoch 30-59: 0.01\n\n# Epoch 60-89: 0.001\n\n# Epoch 90+: 0.0001\n\n> **Applicable scenarios:** Tasks with fixed training rhythm and clear stages, such as ResNet training on ImageNet (90 epochs, decay at 30th and 60th epoch).\n\n* * *\n\n### 2.2 MultiStepLR Multi-Milestone Decay\n\nDecay learning rate at specified epochs (milestones), more flexible than StepLR.\n\n## Example\n\noptimizer = optim.SGD(model.parameters(), lr=0.1)\n\n scheduler = optim.lr_scheduler.MultiStepLR(\n\n optimizer,\n\n milestones=[30,60,80],# Decay at epoch 30, 60, 80\n\n gamma=0.1\n\n)\n\n# LR changes:\n\n# Epoch 0-29: 0.1\n\n# Epoch 30-59: 0.01\n\n# Epoch 60-79: 0.001\n\n# Epoch 80+: 0.0001\n\n> **Applicable scenarios:** When you know at which epochs the model needs fine-tuning, such as the later fine convergence stage of classification models.\n\n* * *\n\n### 2.3 ExponentialLR Exponential Decay\n\nDecay every epoch, learning rate decreases continuously in exponential form, smoother decay.\n\n**Formula:** lr = lr_base * gamma^epoch\n\n## Example\n\noptimizer = optim.Adam(model.parameters(), lr=0.01)\n\n scheduler = optim.lr_scheduler.ExponentialLR(\n\n optimizer,\n\n gamma=0.95# Multiply by 0.95 each epoch\n\n)\n\n# LR changes (first 5 epochs):\n\n# Epoch 1: 0.0100\n\n# Epoch 2: 0.0095\n\n# Epoch 3: 0.0090\n\n# Epoch 4: 0.0086\n\n# Epoch 5: 0.0081\n\n> Setting `gamma` too small (e.g., 0.5) will cause learning rate to rapidly approach zero, typically set between 0.9~0.99.\n\n* * *\n\n## 3. Adaptive Schedulers\n\n### 3.1 ReduceLROnPlateau Monitor Metric Decay\n\n**One of the most intelligent schedulers**: monitors a metric (e.g., validation loss), automatically reduces learning rate when the metric stops improving. No need to know at which epoch to decay in advance.\n\n## Example\n\noptimizer = optim.Adam(model.parameters(), lr=0.01)\n\n scheduler = optim.lr_scheduler.ReduceLROnPlateau(\n\n optimizer,\n\n mode='min',# 'min': smaller is better (loss); 'max': larger is better (accuracy)\n\n factor=0.1,# When triggered, lr = lr Γ factor\n\n patience=10,# Number of epochs to allow metric stagnation, decay if exceeded\n\n threshold=1e-4,# Improvement less than this value is considered no improvement\n\n min_lr=1e-6,# Lower bound of learning rate, will not go below this\n\n verbose=True# Print decay information\n\n)\n\nfor epoch in range(100):\n\n train_loss = train(model, optimizer)\n\n val_loss = evaluate(model)\n\n# Unlike other schedulers, pass in the monitored metric here\n\n scheduler.step(val_loss)\n\n**Monitoring Accuracy:**\n\n## Example\n\nscheduler = optim.lr_scheduler.ReduceLROnPlateau(\n\n optimizer,\n\n mode='max',# accuracy larger is better\n\n factor=0.5,\n\n patience=5,\n\n)\n\nscheduler.step(val_accuracy)\n\n> **Applicable scenarios:** Default first choice for almost all tasks, especially when uncertain about total epochs or training is unstable.\n\n* * *\n\n### 3.2 CosineAnnealingLR Cosine Annealing\n\nLearning rate smoothly decreases from initial value to minimum (`eta_min`) following a **cosine curve**, avoiding the abrupt changes of step decay.\n\n**Formula:** lr_t = eta_min + 0.5 * (eta_max - eta_min) * (1 + cos(t * pi / T_max)\n\n## Example\n\noptimizer = optim.SGD(model.parameters(), lr=0.1)\n\n scheduler = optim.lr_scheduler.CosineAnnealingLR(\n\n optimizer,\n\n T_max=100,# Half period length (usually set to total epoch count)\n\n eta_min=1e-6# Minimum learning rate (default 0)\n\n)\n\n# LR trajectory (T_max=10illustration):\n\n# 0.1 -> 0.095 -> 0.079 -> 0.055 -> 0.026 -> 0.001\n\n# (Smooth cosine curve descent)\n\n> **Applicable scenarios:** Fixed epoch training, widely used in Vision Transformer, ResNet and other papers, good convergence quality.\n\n* * *\n\n### 3.3 CosineAnnealingWarmRestarts\n\nUpgraded version of cosine annealing, supports **periodic restarts** (Warm Restarts): after each period ends, learning rate resets to initial value, starting a new round of cosine decay. Allows model to escape local optima.\n\n## Example\n\noptimizer = optim.SGD(model.parameters(), lr=0.1)\n\n scheduler = optim.lr_scheduler.CosineAnnealingWarmRestarts(\n\n optimizer,\n\n T_0=10,# Length of first period (number of epochs)\n\n T_mult=2,# Multiplier for period length after each restart (1=equal length, 2=gradually longer)\n\n eta_min=1e-6\n\n)\n\n# Period lengths when T_mult=2: 10 -> 20 -> 40 -> 80 ...\n\n# LR changes (T_0=10, T_mult=1):\n\n# 0.1 -> ... -> 0 -> 0.1 -> ... -> 0 -> 0.1 (restart every 10 epochs)\n\n> **Applicable scenarios:** Training large models, or when you want the model to choose the best from multiple convergence points, works well with Snapshot Ensemble.\n\n* * *\n\n## 4. Warmup Schedulers\n\n**Warmup** refers to the training beginning where, for several steps, the learning rate gradually increases from a very small value to the target value. Large batch size training and Transformer models almost always need warmup, otherwise early gradient updates are too aggressive and the model cannot stabilize.\n\n### 4.1 LinearLR Linear Scheduling\n\nLinearly change learning rate within specified epochs (can be used for linear warmup or linear decay).\n\n## Example\n\noptimizer = optim.Adam(model.parameters(), lr=0.01)\n\n# Linear warmup: within first 5 epochs, lr linearly grows from 0.01Γ0.1=0.001 to 0.01\n\n warmup_scheduler = optim.lr_scheduler.LinearLR(\n\n optimizer,\n\n start_factor=0.1,# Initial lr = base_lr Γ start_factor\n\n end_factor=1.0,# Final lr = base_lr Γ end_factor\n\n total_iters=5# Complete over 5 epochs\n\n)\n\n# LR changes: 0.001 -> 0.003 -> 0.005 -> 0.007 -> 0.009 -> 0.01\n\n* * *\n\n### 4.2 ConstantLR Constant Phase\n\nFor specified number of epochs, fix learning rate at `base_lr Γ factor`, then restore original value.\n\n## Example\n\n# Use base_lr Γ 0.5 for first 5 epochs, then return to normal\n\n scheduler = optim.lr_scheduler.ConstantLR(\n\n optimizer,\n\n factor=0.5,\n\n total_iters=5\n\n)\n\n* * *\n\n### 4.3 SequentialLR Combined Scheduling\n\nChain multiple schedulers **in sequence**, the standard way to implement "warmup + decay" combination strategies.\n\n## Example\n\noptimizer = optim.Adam(model.parameters(), lr=0.01)\n\n# Phase 1: Warmup (first 5 epochs, LR linearly grows from 0.001 to 0.01)\n\n warmup = optim.lr_scheduler.LinearLR(\n\n optimizer, start_factor=0.1, end_factor=1.0, total_iters=5\n\n)\n\n# Phase 2: Cosine annealing (remaining 95 epochs)\n\n cosine = optim.lr_scheduler.CosineAnnealingLR(\n\n optimizer, T_max=95, eta_min=1e-6\n\n)\n\n# Combine: execute warmup first, switch to cosine after epoch 5\n\n scheduler = optim.lr_scheduler.SequentialLR(\n\n optimizer,\n\n schedulers=[warmup, cosine],\n\n milestones=# Switch at epoch 5\n\n)\n\n# Usage is identical to ordinary schedulers\n\nfor epoch in range(100):\n\n train(...)\n\n scheduler.step()\n\n> **Applicable scenarios:** Standard for Transformer training (warmup + cosine annealing), BERT, GPT, ViT pre-training all use this strategy.\n\n* * *\n\n## 5. Cyclic Schedulers\n\n### 5.1 CyclicLR Cyclic Learning Rate\n\nLearning rate **cycles periodically** between `base_lr` and `max_lr`, can help model explore wider parameter space and escape saddle points.\n\n## Example\n\noptimizer = optim.SGD(model.parameters(), lr=0.01)\n\n scheduler = optim.lr_scheduler.CyclicLR(\n\n optimizer,\n\n base_lr=0.001,# Lower bound of learning rate\n\n max_lr=0.01,# Upper bound of learning rate\n\n step_size_up=2000,# Steps to rise from base_lr to max_lr\n\n step_size_down=2000,# Steps to fall from max_lr to base_lr (default equals step_size_up)\n\n mode='triangular',# Triangular cycle (equal amplitude)\n\n# mode='triangular2' # Amplitude halves each cycle\n\n# mode='exp_range' # Amplitude decays exponentially\n\n)\n\n# CyclicLR is called by step (batch), not by epoch\n\nfor epoch in range(num_epochs):\n\nfor inputs, labels in train_loader:\n\n optimizer.zero_grad()\n\n loss = criterion(model(inputs), labels)\n\n loss.backward()\n\n optimizer.step()\n\n scheduler.step()# Call after each batch\n\n**Three Modes Comparison:**\n\n| mode | Amplitude Change | Characteristics |\n| --- | --- | --- |\n| `triangular` | Unchanged | Stable exploration, good for early stage |\n| `triangular2` | Halves each cycle | Explore first, then converge |\n| `exp_range` | Exponential decay | Eventually stable convergence |\n\n* * *\n\n### 5.2 OneCycleLR Single Cycle Policy\n\n**One of the best performing schedulers**, proposed by fastai's 1-Cycle Policy. The entire training has only one cycle: learning rate first rises then falls, momentum changes inversely.\n\nFaster training speed, typically requires only **1/5~1/10** of traditional training epochs to achieve the same accuracy.\n\n## Example\n\noptimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)\n\nsteps_per_epoch =len(train_loader)\n\nscheduler = optim.lr_scheduler.OneCycleLR(\n\n optimizer,\n\n max_lr=0.1,# Peak learning rate\n\n steps_per_epoch=steps_per_epoch,# Steps per epoch\n\n epochs=10,# Total epochs\n\n pct_start=0.3,# First 30% used for warmup rise\n\n anneal_strategy='cos',# Decay strategy ('cos' or 'linear')\n\n div_factor=25,# Initial lr = max_lr / div_factor\n\n final_div_factor=1e4# Final lr = max_lr / final_div_factor\n\n)\n\n# Initial lr = 0.1 / 25 = 0.004\n\n# Peak lr = 0.1 (at 30% point)\n\n# Final lr = 0.1 / 10000 = 0.00001\n\n# Also called by batch\n\nfor epoch in range(10):\n\nfor inputs, labels in train_loader:\n\n optimizer.zero_grad()\n\n loss = criterion(mod
YouTip