PyTorch Mixed Precision Training (AMP) |Runoob Tutorial\\n\\n

PyTorch Mixed Precision Training (AMP) |

\\n\\n

Mixed precision training is one of the most important performance optimization techniques in deep learning. By simultaneously using FP32 (single precision) and FP16 (half precision) floating-point numbers for computation, it can significantly improve training speed and reduce memory usage with almost no loss in model accuracy. This section provides a detailed introduction to Automatic Mixed Precision (AMP) technology in PyTorch.

\\n\\n

\\n
Applicable Version: The code in this article is written based on the torch.cuda.amp API in PyTorch 1.6+. PyTorch 2.4+ recommends using torch.amp.autocast and torch.amp.GradScaler, which have essentially the same usage. Differences will be noted in the article.
\\n

\\n\\n

1. Mixed Precision Training Basics

\\n\\n

1.1 Why Mixed Precision is Needed

\\n\\n

Deep learning model training involves a large amount of matrix operations. Traditional FP32 (32-bit floating-point) computation has high precision but occupies more memory and has slower computation speed. FP16 (16-bit floating-point) computation is faster and uses less memory, but has a smaller numerical representation range and is prone to gradient underflow problems.

\\n\\n

The core concept of mixed precision training is: use FP32 for operations that require high precision, and use FP16 for operations that don't require high precision. This way, you can enjoy the speed advantage of FP16 while avoiding precision issues.

\\n\\n

The following figure shows the bit layout differences between three floating-point formats—exponent bits determine the numerical range, and mantissa bits determine precision:

\\n\\n

1.2 Advantages of Mixed Precision

\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n

Metric	Improvement	Description
Training Speed	2-3x improvement	Depends on GPU Tensor Core support
Memory Usage	Reduced by ~50%	Activation values and intermediate results stored in FP16
Memory Bandwidth	Reduced by ~50%	Smaller data volume means less transfer
Communication Overhead	Reduced by ~50%	Gradient transfer volume halved in distributed training

\\n\\n

1.3 Tensor Core Acceleration Principle

\\n\\n

NVIDIA's Tensor Core is a hardware unit specifically designed for matrix operations. It can complete a 4×4 matrix multiply-accumulate operation (D = A × B + C) in a single clock cycle, which is the main source of FP16 training acceleration. Compared to ordinary CUDA cores that require multiple instructions to complete the same operation, Tensor Core compresses it into a single instruction.

\\n\\n

GPUs supporting Tensor Core include:

\\n\\n

Volta architecture (V100) — First-generation Tensor Core, FP16 only
Turing architecture (RTX 20 series) — Supports FP16 / INT8 / INT4
Ampere architecture (RTX 30 series, A100) — Added BF16 / TF32 support
Ada Lovelace architecture (RTX 40 series) — Added FP8 support
Hopper architecture (H100) — Added FP8 Transformer Engine

\\n\\n

\\n
Consumer-grade RTX GPUs also support Tensor Core, such as RTX 3060 and above models can enjoy AMP acceleration.
\\n

\\n\\n

2. PyTorch AMP Basic Usage

\\n\\n

2.1 autocast and GradScaler

\\n\\n

PyTorch's AMP API consists of two core components:

\\n\\n

autocast: Context manager that automatically switches operations within the region to FP16 (operations sensitive to precision will automatically fall back to FP32)
GradScaler: Dynamically adjusts the gradient scaling factor to amplify FP16 gradients to prevent underflow (only needed for FP16, BF16 usually doesn't need it)

\\n\\n

The following figure shows the data flow of a complete AMP training step:

\\n\\n

2.2 Basic Usage Example

\\n\\n

Example

\\n\\n

import torch\\n\\nimport torch.nn as nn\\n\\nimport torch.optim as optim\\n\\nfrom torch.cuda.amp import autocast, GradScaler\\n\\n# PyTorch 2.4+ recommendedUsage:\\n\\n# from torch.amp import autocast, GradScaler\\n\\n# Check if CUDA is Available\\n\\n device = torch.device("cuda"if torch.cuda.is_available()else"cpu")\\n\\nprint(f"Device: {device}")\\n\\nif torch.cuda.is_available():\\n\\nprint(f"GPU: {torch.cuda.get_device_name(0)}")\\n\\nprint(f"BF16 Support: {torch.cuda.is_bf16_supported()}")\\n\\n# ── Model Definition ──────────────────────────────────────\\n\\nclass SimpleModel(nn.Module):\\n\\ndef __init__ (self):\\n\\nsuper(). __init__ ()\\n\\nself.net= nn.Sequential(\\n\\n nn.Linear(128,256),\\n\\n nn.ReLU(),\\n\\n nn.Linear(256,256),\\n\\n nn.ReLU(),\\n\\n nn.Linear(256,10)\\n\\n)\\n\\ndef forward(self, x):\\n\\nreturn self.net(x)\\n\\nmodel = SimpleModel().to(device)\\n\\n# Loss Function and Optimizer\\n\\n criterion = nn.CrossEntropyLoss()\\n\\n optimizer = optim.Adam(model.parameters(), lr=1e-3)\\n\\n# ── Mixed Precision TrainingKey Components ──────────────────────────\\n\\n# GradScaler：Scale Loss to Avoid FP16 Gradient Underflow\\n\\n scaler = GradScaler()\\n\\n# Training Loop\\n\\ndef train_epoch_amp(model, loader, criterion, optimizer, scaler, device):\\n\\n model.train()\\n\\n total_loss =0\\n\\n correct =0\\n\\n total =0\\n\\nfor inputs, labels in loader:\\n\\n inputs = inputs.to(device, non_blocking=True)\\n\\n labels = labels.to(device, non_blocking=True)\\n\\noptimizer.zero_grad()\\n\\n# ── Core: Using autocast Context Manager ──────\\n\\n# autocast Operations Within the Context Automatically Use FP16\\n\\n# Precision-sensitive operations (such as softmax, loss) will automatically fall back to FP32.\\n\\nwith autocast(device_type='cuda'):\\n\\n outputs = model(inputs)\\n\\n loss = criterion(outputs, labels)\\n\\n# ── Backward Propagation with Scaler ─────────────\\n\\n# 1. Scale Loss (Multiply by scale_factor)\\n\\n# 2. Backward Propagation (Computed on Scaled Gradients)\\n\\n# 3. scaler.step Internally Auto-unscales Gradients and Checks for Inf/NaN\\n\\n scaler.scale(loss).backward()\\n\\n# Update Parameters\\n\\n scaler.step(optimizer)\\n\\n# Update Scaler's Scale Factor\\n\\n scaler.update()\\n\\n# Statistics\\n\\n total_loss += loss.item() * inputs.size(0)\\n\\n _, predicted = outputs.max(1)\\n\\n correct += predicted.eq(labels).sum().item()\\n\\n total += labels.size(0)\\n\\nreturn total_loss / total, correct / total\\n\\n# Mock Data\\n\\n train_loader =[\\n\\n(torch.randn(32,128), torch.randint(0,10,(32,)))for _ in range(10)\\n\\n]\\n\\n# Start Training\\n\\nfor epoch in range(3):\\n\\n loss, acc = train_epoch_amp(\\n\\n model, train_loader, criterion, optimizer, scaler, device\\n\\n)\\n\\nprint(f"Epoch {epoch+1}: Loss={loss:.4f}, Acc={acc:.4f}")\\n\\nprint("Mixed Precision TrainingDone!")\\n

\\n\\n

\\n
API Migration Tip: PyTorch 2.4 has marked torch.cuda.amp.autocast as deprecated, recommending the use of torch.amp.autocast('cuda'). The usage is exactly the same between the two, only the import path is different.

YouTip

Pytorch Amp

PyTorch Mixed Precision Training (AMP) |

1. Mixed Precision Training Basics

1.1 Why Mixed Precision is Needed

1.2 Advantages of Mixed Precision

1.3 Tensor Core Acceleration Principle

2. PyTorch AMP Basic Usage

2.1 autocast and GradScaler

2.2 Basic Usage Example

Example

📂 Categories