Pytorch Gpu Cuda
The core operations in deep learning are large-scale matrix multiplication and element-wise computations. CPUs are designed to handle complex sequential logic, typically with 8β64 cores; GPUs, on the other hand, have thousands of simple parallel cores, making them naturally suitable for such highly parallel numerical calculations. PyTorch leverages NVIDIA's CUDA (Compute Unified Device Architecture) framework to access GPU computing power, accelerating training by tens or even hundreds of times.
* * *
## 1. Differences Between CPU and GPU
The speed improvement from GPU training mainly comes from two aspects: first, the ability to execute massive identical computations in parallel; second, high-bandwidth video memory allows much faster data transfer compared to CPU memory.
For compute-intensive operations like matrix multiplication, the acceleration effect is particularly significant.
| Comparison Item | CPU | GPU (NVIDIA) |
| --- | --- | --- |
| Number of Cores | 8~64 large cores | Thousands of small cores (CUDA Cores) |
| Design Goal | Low-latency serial processing | High-throughput parallel computing |
| Memory Bandwidth | ~50~100 GB/s | ~500~3000 GB/s |
| Matrix Multiplication Speed | Baseline | 10x~100x faster |
| PyTorch Interface | `"cpu"` | `"cuda"` |
If using Apple Silicon (M1/M2/M3), PyTorch supports Metal GPU acceleration via the `mps` backend. The usage is almost identical to CUDAβjust change the device to `"mps"`.
* * *
## 2. Checking the CUDA Environment
Before using a GPU, you need to confirm whether the current environment has a version of PyTorch installed that supports CUDA.
You also need to check if there is an available GPU device on the system.
## Example
import torch
# Check if CUDA is available
print("CUDA available:", torch.cuda.is_available())
# Number of GPU devices
print("Number of GPUs:", torch.cuda.device_count())
# Index of the current default GPU
print("Current GPU:", torch.cuda.current_device())
# GPU model name
print("GPU model:", torch.cuda.get_device_name(0))
# PyTorch version and compiled-in CUDA version
print("PyTorch version:", torch.__version__)
print("CUDA version:", torch.version.cuda)
Sample output:
CUDA available: True GPU count: 1 Current GPU: 0 GPU model: NVIDIA GeForce RTX 4090 PyTorch version: 2.3.0+cu121 CUDA version: 12.1
### 2.1 Dynamically Selecting Device (Recommended Approach)
Hardcoding `"cuda"` in code will cause errors on machines without a GPU.
## Example
import torch
# Method 1: Classic approach, best compatibility
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Method 2: Recommended for PyTorch 2.0+, automatically supports CUDA / MPS / CPU
device = (
"cuda" if torch.cuda.is_available()
else "mps" if torch.backends.mps.is_available()
else "cpu"
)
print(f"Using device: {device}")
* * *
## 3. Moving Tensors Between Devices
In PyTorch, tensors are created on the CPU by default.
To perform GPU computation, you must explicitly move tensors to the GPU or create them directly on the GPU.
### 3.1 Basic Methods for Moving Tensors
## Example
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Create a CPU tensor
cpu_tensor = torch.tensor([1.0, 2.0, 3.0])
print(cpu_tensor.device) # cpu
# Method 1: .to(device) β Recommended, most versatile
gpu_tensor = cpu_tensor.to(device)
# Method 2: .cuda() β Only for CUDA environments
gpu_tensor = cpu_tensor.cuda()
# Method 3: Specify device during creation
gpu_tensor = torch.tensor([1.0, 2.0, 3.0], device=device)
gpu_tensor = torch.randn(3, 4, device=device)
# Move back to CPU (for printing, numpy conversion, saving, etc.)
back_to_cpu = gpu_tensor.cpu()
print(gpu_tensor.device) # cuda:0
print(back_to_cpu.device) # cpu
When converting GPU tensors to numpy, they must first be moved back to the CPU. If the tensor has gradients, you must also call `detach()` first:
## Example
# Convert regular GPU tensor to numpy
arr = gpu_tensor.cpu().numpy()
# Convert GPU tensor with gradients to numpy
arr = gpu_tensor.detach().cpu().numpy()
### 3.2 Device Consistency Constraints
Tensors on different devices cannot directly participate in the same operation.
Otherwise, a `RuntimeError` will be raised:
## Example
a = torch.randn(3).to("cuda")
b = torch.randn(3) # On CPU
# c = a + b # RuntimeError: Expected all tensors to be on the same device
# Correct way: align devices first
c = a + b.to("cuda")
Check which device a tensor resides on:
## Example
x = torch.randn(3, 4).to(device)
print(x.device) # cuda:0
print(x.is_cuda) # True
print(x.get_device()) # 0 (GPU index)
### 3.3 Speed Comparison Verification
## Example
import torch
import time
device = torch.device("cuda")
n = 5000
# CPU matrix multiplication
a_cpu = torch.randn(n, n)
b_cpu = torch.randn(n, n)
start = time.time()
c_cpu = torch.matmul(a_cpu, b_cpu)
print(f"CPU time: {time.time() - start:.3f}s")
# GPU matrix multiplication
a_gpu = a_cpu.to(device)
b_gpu = b_cpu.to(device)
torch.cuda.synchronize() # Ensure data transfer completes before timing starts
start = time.time()
c_gpu = torch.matmul(a_gpu, b_gpu)
torch.cuda.synchronize() # Wait for GPU execution to finish before stopping timer
print(f"GPU time: {time.time() - start:.3f}s")
Sample output:
CPU time: 1.847s GPU time: 0.021s
> GPU computations are executed asynchronouslyβthe Python call returns before the GPU finishes its task. When measuring execution time, you must call `torch.cuda.synchronize()` to wait until the GPU actually completes, otherwise the measurement will be inaccurate.
* * *
## 4. Moving Models to GPU
All parameters of a model (`weight`, `bias`) are essentially tensors.
These parameters must also be moved to the GPU to enable forward and backward propagation on the GPU.
Calling `.to(device)` on the entire model sufficesβPyTorch will automatically traverse and move all internal parameters.
## Example
import torch
import torch.nn as nn
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, 10),
)
def forward(self, x):
return self.net(x)
# Move model to GPU, just one call needed
model = SimpleNet().to(device)
# Verify that all parameters are on GPU
for name, param in model.named_parameters():
print(f"{name}: {param.device}")
# net.0.weight: cuda:0
# net.0.bias: cuda:0
# ...
# Input data must also be on the same device
x = torch.randn(32, 784).to(device)
output = model(x)
print(output.shape) # torch.Size([32, 10])
> If the model is on the GPU but input data remains on the CPU, forward propagation will raise an error. Always call `.to(device)` on both `inputs` and `labels` after each batch is loaded from the DataLoader.
* * *
## 5. Complete Training Pipeline
Below is a standard GPU training template including data loading, model training, and validation evaluation.
## Example
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
# 1. Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Training device: {device}")
# 2. Data loading
# pin_memory=True: locks data in memory to speed up CPU -> GPU transfer
# num_workers: multi-process prefetching to reduce data waiting time
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
])
train_dataset = datasets.MNIST(root="./data", train=True,
download=True, transform=transform)
test_dataset = datasets.MNIST(root="./data", train=False,
download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True,
num_workers=4, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=256, shuffle=False,
num_workers=4, pin_memory=True)
# 3. Define model
class CNN(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(1, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(64 * 7 * 7, 256), nn.ReLU(), nn.Dropout(0.5),
nn.Linear(256, 10),
)
def forward(self, x):
return self.classifier(self.features(x))
model = CNN().to(device) # Move model to GPU
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
# 4. Training function
def train_epoch(model, loader, optimizer, criterion):
model.train()
YouTip