PyTorch Mixed Precision Training (AMP) |
\\n\\nMixed precision training is one of the most important performance optimization techniques in deep learning. By simultaneously using FP32 (single precision) and FP16 (half precision) floating-point numbers for computation, it can significantly improve training speed and reduce memory usage with almost no loss in model accuracy. This section provides a detailed introduction to Automatic Mixed Precision (AMP) technology in PyTorch.
\\n\\n\\n\\n\\nApplicable Version: The code in this article is written based on the
\\ntorch.cuda.ampAPI in PyTorch 1.6+. PyTorch 2.4+ recommends usingtorch.amp.autocastandtorch.amp.GradScaler, which have essentially the same usage. Differences will be noted in the article.
\\n\\n
1. Mixed Precision Training Basics
\\n\\n1.1 Why Mixed Precision is Needed
\\n\\nDeep learning model training involves a large amount of matrix operations. Traditional FP32 (32-bit floating-point) computation has high precision but occupies more memory and has slower computation speed. FP16 (16-bit floating-point) computation is faster and uses less memory, but has a smaller numerical representation range and is prone to gradient underflow problems.
\\n\\nThe core concept of mixed precision training is: use FP32 for operations that require high precision, and use FP16 for operations that don't require high precision. This way, you can enjoy the speed advantage of FP16 while avoiding precision issues.
\\n\\nThe following figure shows the bit layout differences between three floating-point formatsβexponent bits determine the numerical range, and mantissa bits determine precision:
\\n\\n1.2 Advantages of Mixed Precision
\\n\\n| Metric | \\nImprovement | \\nDescription | \\n
|---|---|---|
| Training Speed | \\n2-3x improvement | \\nDepends on GPU Tensor Core support | \\n
| Memory Usage | \\nReduced by ~50% | \\nActivation values and intermediate results stored in FP16 | \\n
| Memory Bandwidth | \\nReduced by ~50% | \\nSmaller data volume means less transfer | \\n
| Communication Overhead | \\nReduced by ~50% | \\nGradient transfer volume halved in distributed training | \\n
1.3 Tensor Core Acceleration Principle
\\n\\nNVIDIA's Tensor Core is a hardware unit specifically designed for matrix operations. It can complete a 4Γ4 matrix multiply-accumulate operation (D = A Γ B + C) in a single clock cycle, which is the main source of FP16 training acceleration. Compared to ordinary CUDA cores that require multiple instructions to complete the same operation, Tensor Core compresses it into a single instruction.
\\n\\nGPUs supporting Tensor Core include:
\\n\\n- \\n
- Volta architecture (V100) β First-generation Tensor Core, FP16 only \\n
- Turing architecture (RTX 20 series) β Supports FP16 / INT8 / INT4 \\n
- Ampere architecture (RTX 30 series, A100) β Added BF16 / TF32 support \\n
- Ada Lovelace architecture (RTX 40 series) β Added FP8 support \\n
- Hopper architecture (H100) β Added FP8 Transformer Engine \\n
\\n\\n\\nConsumer-grade RTX GPUs also support Tensor Core, such as RTX 3060 and above models can enjoy AMP acceleration.
\\n
\\n\\n
2. PyTorch AMP Basic Usage
\\n\\n2.1 autocast and GradScaler
\\n\\nPyTorch's AMP API consists of two core components:
\\n\\n- \\n
autocast: Context manager that automatically switches operations within the region to FP16 (operations sensitive to precision will automatically fall back to FP32) \\nGradScaler: Dynamically adjusts the gradient scaling factor to amplify FP16 gradients to prevent underflow (only needed for FP16, BF16 usually doesn't need it) \\n
The following figure shows the data flow of a complete AMP training step:
\\n\\n2.2 Basic Usage Example
\\n\\nExample
\\n\\nimport torch\\n\\nimport torch.nn as nn\\n\\nimport torch.optim as optim\\n\\nfrom torch.cuda.amp import autocast, GradScaler\\n\\n# PyTorch 2.4+ recommendedUsage:\\n\\n# from torch.amp import autocast, GradScaler\\n\\n# Check if CUDA is Available\\n\\n device = torch.device("cuda"if torch.cuda.is_available()else"cpu")\\n\\nprint(f"Device: {device}")\\n\\nif torch.cuda.is_available():\\n\\nprint(f"GPU: {torch.cuda.get_device_name(0)}")\\n\\nprint(f"BF16 Support: {torch.cuda.is_bf16_supported()}")\\n\\n# ββ Model Definition ββββββββββββββββββββββββββββββββββββββ\\n\\nclass SimpleModel(nn.Module):\\n\\ndef __init__ (self):\\n\\nsuper(). __init__ ()\\n\\nself.net= nn.Sequential(\\n\\n nn.Linear(128,256),\\n\\n nn.ReLU(),\\n\\n nn.Linear(256,256),\\n\\n nn.ReLU(),\\n\\n nn.Linear(256,10)\\n\\n)\\n\\ndef forward(self, x):\\n\\nreturn self.net(x)\\n\\nmodel = SimpleModel().to(device)\\n\\n# Loss Function and Optimizer\\n\\n criterion = nn.CrossEntropyLoss()\\n\\n optimizer = optim.Adam(model.parameters(), lr=1e-3)\\n\\n# ββ Mixed Precision TrainingKey Components ββββββββββββββββββββββββββ\\n\\n# GradScalerοΌScale Loss to Avoid FP16 Gradient Underflow\\n\\n scaler = GradScaler()\\n\\n# Training Loop\\n\\ndef train_epoch_amp(model, loader, criterion, optimizer, scaler, device):\\n\\n model.train()\\n\\n total_loss =0\\n\\n correct =0\\n\\n total =0\\n\\nfor inputs, labels in loader:\\n\\n inputs = inputs.to(device, non_blocking=True)\\n\\n labels = labels.to(device, non_blocking=True)\\n\\noptimizer.zero_grad()\\n\\n# ββ Core: Using autocast Context Manager ββββββ\\n\\n# autocast Operations Within the Context Automatically Use FP16\\n\\n# Precision-sensitive operations (such as softmax, loss) will automatically fall back to FP32.\\n\\nwith autocast(device_type='cuda'):\\n\\n outputs = model(inputs)\\n\\n loss = criterion(outputs, labels)\\n\\n# ββ Backward Propagation with Scaler βββββββββββββ\\n\\n# 1. Scale Loss (Multiply by scale_factor)\\n\\n# 2. Backward Propagation (Computed on Scaled Gradients)\\n\\n# 3. scaler.step Internally Auto-unscales Gradients and Checks for Inf/NaN\\n\\n scaler.scale(loss).backward()\\n\\n# Update Parameters\\n\\n scaler.step(optimizer)\\n\\n# Update Scaler's Scale Factor\\n\\n scaler.update()\\n\\n# Statistics\\n\\n total_loss += loss.item() * inputs.size(0)\\n\\n _, predicted = outputs.max(1)\\n\\n correct += predicted.eq(labels).sum().item()\\n\\n total += labels.size(0)\\n\\nreturn total_loss / total, correct / total\\n\\n# Mock Data\\n\\n train_loader =[\\n\\n(torch.randn(32,128), torch.randint(0,10,(32,)))for _ in range(10)\\n\\n]\\n\\n# Start Training\\n\\nfor epoch in range(3):\\n\\n loss, acc = train_epoch_amp(\\n\\n model, train_loader, criterion, optimizer, scaler, device\\n\\n)\\n\\nprint(f"Epoch {epoch+1}: Loss={loss:.4f}, Acc={acc:.4f}")\\n\\nprint("Mixed Precision TrainingDone!")\\n\\n\\n\\nAPI Migration Tip: PyTorch 2.4 has marked
torch.cuda.amp.autocastas deprecated, recommending the use oftorch.amp.autocast('cuda'). The usage is exactly the same between the two, only the import path is different.
YouTip