YouTip LogoYouTip

Pytorch Gpu Cuda

The core operations in deep learning are large-scale matrix multiplication and element-wise computations. CPUs are designed to handle complex sequential logic, typically with 8–64 cores; GPUs, on the other hand, have thousands of simple parallel cores, making them naturally suitable for such highly parallel numerical calculations. PyTorch leverages NVIDIA's CUDA (Compute Unified Device Architecture) framework to access GPU computing power, accelerating training by tens or even hundreds of times. * * * ## 1. Differences Between CPU and GPU The speed improvement from GPU training mainly comes from two aspects: first, the ability to execute massive identical computations in parallel; second, high-bandwidth video memory allows much faster data transfer compared to CPU memory. For compute-intensive operations like matrix multiplication, the acceleration effect is particularly significant. | Comparison Item | CPU | GPU (NVIDIA) | | --- | --- | --- | | Number of Cores | 8~64 large cores | Thousands of small cores (CUDA Cores) | | Design Goal | Low-latency serial processing | High-throughput parallel computing | | Memory Bandwidth | ~50~100 GB/s | ~500~3000 GB/s | | Matrix Multiplication Speed | Baseline | 10x~100x faster | | PyTorch Interface | `"cpu"` | `"cuda"` | If using Apple Silicon (M1/M2/M3), PyTorch supports Metal GPU acceleration via the `mps` backend. The usage is almost identical to CUDAβ€”just change the device to `"mps"`. * * * ## 2. Checking the CUDA Environment Before using a GPU, you need to confirm whether the current environment has a version of PyTorch installed that supports CUDA. You also need to check if there is an available GPU device on the system. ## Example import torch # Check if CUDA is available print("CUDA available:", torch.cuda.is_available()) # Number of GPU devices print("Number of GPUs:", torch.cuda.device_count()) # Index of the current default GPU print("Current GPU:", torch.cuda.current_device()) # GPU model name print("GPU model:", torch.cuda.get_device_name(0)) # PyTorch version and compiled-in CUDA version print("PyTorch version:", torch.__version__) print("CUDA version:", torch.version.cuda) Sample output: CUDA available: True GPU count: 1 Current GPU: 0 GPU model: NVIDIA GeForce RTX 4090 PyTorch version: 2.3.0+cu121 CUDA version: 12.1 ### 2.1 Dynamically Selecting Device (Recommended Approach) Hardcoding `"cuda"` in code will cause errors on machines without a GPU. ## Example import torch # Method 1: Classic approach, best compatibility device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Method 2: Recommended for PyTorch 2.0+, automatically supports CUDA / MPS / CPU device = ( "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu" ) print(f"Using device: {device}") * * * ## 3. Moving Tensors Between Devices In PyTorch, tensors are created on the CPU by default. To perform GPU computation, you must explicitly move tensors to the GPU or create them directly on the GPU. ### 3.1 Basic Methods for Moving Tensors ## Example import torch device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Create a CPU tensor cpu_tensor = torch.tensor([1.0, 2.0, 3.0]) print(cpu_tensor.device) # cpu # Method 1: .to(device) β€” Recommended, most versatile gpu_tensor = cpu_tensor.to(device) # Method 2: .cuda() β€” Only for CUDA environments gpu_tensor = cpu_tensor.cuda() # Method 3: Specify device during creation gpu_tensor = torch.tensor([1.0, 2.0, 3.0], device=device) gpu_tensor = torch.randn(3, 4, device=device) # Move back to CPU (for printing, numpy conversion, saving, etc.) back_to_cpu = gpu_tensor.cpu() print(gpu_tensor.device) # cuda:0 print(back_to_cpu.device) # cpu When converting GPU tensors to numpy, they must first be moved back to the CPU. If the tensor has gradients, you must also call `detach()` first: ## Example # Convert regular GPU tensor to numpy arr = gpu_tensor.cpu().numpy() # Convert GPU tensor with gradients to numpy arr = gpu_tensor.detach().cpu().numpy() ### 3.2 Device Consistency Constraints Tensors on different devices cannot directly participate in the same operation. Otherwise, a `RuntimeError` will be raised: ## Example a = torch.randn(3).to("cuda") b = torch.randn(3) # On CPU # c = a + b # RuntimeError: Expected all tensors to be on the same device # Correct way: align devices first c = a + b.to("cuda") Check which device a tensor resides on: ## Example x = torch.randn(3, 4).to(device) print(x.device) # cuda:0 print(x.is_cuda) # True print(x.get_device()) # 0 (GPU index) ### 3.3 Speed Comparison Verification ## Example import torch import time device = torch.device("cuda") n = 5000 # CPU matrix multiplication a_cpu = torch.randn(n, n) b_cpu = torch.randn(n, n) start = time.time() c_cpu = torch.matmul(a_cpu, b_cpu) print(f"CPU time: {time.time() - start:.3f}s") # GPU matrix multiplication a_gpu = a_cpu.to(device) b_gpu = b_cpu.to(device) torch.cuda.synchronize() # Ensure data transfer completes before timing starts start = time.time() c_gpu = torch.matmul(a_gpu, b_gpu) torch.cuda.synchronize() # Wait for GPU execution to finish before stopping timer print(f"GPU time: {time.time() - start:.3f}s") Sample output: CPU time: 1.847s GPU time: 0.021s > GPU computations are executed asynchronouslyβ€”the Python call returns before the GPU finishes its task. When measuring execution time, you must call `torch.cuda.synchronize()` to wait until the GPU actually completes, otherwise the measurement will be inaccurate. * * * ## 4. Moving Models to GPU All parameters of a model (`weight`, `bias`) are essentially tensors. These parameters must also be moved to the GPU to enable forward and backward propagation on the GPU. Calling `.to(device)` on the entire model sufficesβ€”PyTorch will automatically traverse and move all internal parameters. ## Example import torch import torch.nn as nn device = torch.device("cuda" if torch.cuda.is_available() else "cpu") class SimpleNet(nn.Module): def __init__(self): super().__init__() self.net = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 128), nn.ReLU(), nn.Linear(128, 10), ) def forward(self, x): return self.net(x) # Move model to GPU, just one call needed model = SimpleNet().to(device) # Verify that all parameters are on GPU for name, param in model.named_parameters(): print(f"{name}: {param.device}") # net.0.weight: cuda:0 # net.0.bias: cuda:0 # ... # Input data must also be on the same device x = torch.randn(32, 784).to(device) output = model(x) print(output.shape) # torch.Size([32, 10]) > If the model is on the GPU but input data remains on the CPU, forward propagation will raise an error. Always call `.to(device)` on both `inputs` and `labels` after each batch is loaded from the DataLoader. * * * ## 5. Complete Training Pipeline Below is a standard GPU training template including data loading, model training, and validation evaluation. ## Example import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader from torchvision import datasets, transforms # 1. Device configuration device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"Training device: {device}") # 2. Data loading # pin_memory=True: locks data in memory to speed up CPU -> GPU transfer # num_workers: multi-process prefetching to reduce data waiting time transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,)), ]) train_dataset = datasets.MNIST(root="./data", train=True, download=True, transform=transform) test_dataset = datasets.MNIST(root="./data", train=False, download=True, transform=transform) train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True, num_workers=4, pin_memory=True) test_loader = DataLoader(test_dataset, batch_size=256, shuffle=False, num_workers=4, pin_memory=True) # 3. Define model class CNN(nn.Module): def __init__(self): super().__init__() self.features = nn.Sequential( nn.Conv2d(1, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), ) self.classifier = nn.Sequential( nn.Flatten(), nn.Linear(64 * 7 * 7, 256), nn.ReLU(), nn.Dropout(0.5), nn.Linear(256, 10), ) def forward(self, x): return self.classifier(self.features(x)) model = CNN().to(device) # Move model to GPU criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=1e-3) # 4. Training function def train_epoch(model, loader, optimizer, criterion): model.train()
Pytorch Torch Nn Transformeren β†’