Pytorch Loss Function
Loss functions measure the gap between model predictions and ground truth values, serving as the core guide for neural network trainingβoptimizers update model parameters by minimizing the loss function.
PyTorch includes over a dozen common loss functions in the `torch.nn` module, covering major task types such as classification, regression, and ranking.
* * *
## 1. Loss Function Basics
### Basic Usage
All PyTorch loss functions are subclasses of `nn.Module` and share a unified usage pattern:
## Instance
import torch
import torch.nn as nn
# 1. Instantiate the loss function
criterion = nn.CrossEntropyLoss()
# 2. Compute the loss (predictions first, targets second)
loss = criterion(predictions, targets)
# 3. Backpropagation
loss.backward()
### Input Shape Conventions
Different loss functions have different input shape requirements, which is where beginners most often make mistakes:
| Loss Function | Prediction (input) Shape | Label (target) Shape |
| --- | --- | --- |
| `CrossEntropyLoss` | `(N, C)` raw logits | `(N,)` integer class indices |
| `BCELoss` | `(N,)` probabilities after Sigmoid | `(N,)` 0/1 floats |
| `BCEWithLogitsLoss` | `(N,)` raw logits | `(N,)` 0/1 floats |
| `MSELoss` | `(N,)` any real number | `(N,)` any real number |
| `NLLLoss` | `(N, C)` probabilities after log_softmax | `(N,)` integer class indices |
> **N** = batch size, **C** = number of classes
* * *
## 2. Classification Task Loss Functions
### 2.1 CrossEntropyLoss
The most commonly used multi-class classification loss function. **It automatically applies Softmax + Log + Negation internally**, so there is no need to manually apply Softmax to the model output.
**Mathematical Formula:**
Loss = -sum(y_c * log(p_c))
Where p_c = exp(x_c) / sum_j exp(x_j) is the Softmax output.
## Instance
import torch
import torch.nn as nn
criterion = nn.CrossEntropyLoss()
# Model output: raw logits, shape (batch_size, num_classes)
# No need to apply Softmax beforehand!
predictions = torch.tensor([
[2.0,0.5,0.3],# Sample 1, most likely class 0
[0.1,3.0,0.2],# Sample 2, most likely class 1
[0.2,0.1,4.0],# Sample 3, most likely class 2
])
# Labels: integer class indices, shape (batch_size,)
targets = torch.tensor([0,1,2])
loss = criterion(predictions, targets)
print(f"Loss: {loss.item():.4f}")# Loss: 0.1763
**Supports soft labels (Label Smoothing):**
## Instance
# Label smoothing, mitigates overfitting, commonly used in image classification competitions
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
# Also supports directly passing soft labels (probability distributions)
soft_targets = torch.tensor([
[0.9,0.05,0.05],
[0.05,0.9,0.05],
])
predictions = torch.randn(2,3)
loss = criterion(predictions, soft_targets)
> **Applicable Scenarios:** Multi-classification (cat/dog/bird), image classification, text classification, and all other multi-classification tasks.
* * *
### 2.2 BCELoss Binary Cross-Entropy Loss
Specifically for **binary classification** or **multi-label classification** tasks. The input must be probability values (0~1) processed through `Sigmoid`.
**Mathematical Formula:**
Loss = -[y * log(p) + (1-y) * log(1-p)]
## Instance
criterion = nn.BCELoss()
# Model output must be passed through Sigmoid first, value range (0, 1)
raw_output = torch.tensor([2.0, -1.0,0.5, -3.0])
predictions = torch.sigmoid(raw_output)# [0.88, 0.27, 0.62, 0.05]
# Labels: float type 0.0 or 1.0
targets = torch.tensor([1.0,0.0,1.0,0.0])
loss = criterion(predictions, targets)
print(f"Loss: {loss.item():.4f}")# Loss: 0.2824
# Multi-label classification (each sample can belong to multiple classes)
# predictions shape: (batch_size, num_labels)
predictions_ml = torch.sigmoid(torch.randn(4,5))
targets_ml = torch.randint(0,2,(4,5)).float()
loss_ml = criterion(predictions_ml, targets_ml)
> `BCELoss` requires the input to be in the (0, 1) range; passing raw logits will lead to numerical instability or even NaN. It is recommended to use `BCEWithLogitsLoss` below.
* * *
### 2.3 BCEWithLogitsLoss
An improved version of `BCELoss`. **It automatically applies Sigmoid internally**, is more numerically stable, and is recommended as the priority choice.
## Instance
criterion = nn.BCEWithLogitsLoss()
# Pass raw logits directly, no need to manually apply Sigmoid
predictions = torch.tensor([2.0, -1.0,0.5, -3.0])
targets = torch.tensor([1.0,0.0,1.0,0.0])
loss = criterion(predictions, targets)
print(f"Loss: {loss.item():.4f}")
# Equivalent to (but with better numerical stability):
# loss = BCELoss(Sigmoid(predictions), targets)
**With positive sample weights (handling class imbalance):**
## Instance
# pos_weight: positive sample weight, the larger the value, the more attention is paid to positive samples
# For example, if negative samples are 10 times the positive samples, set pos_weight=10
pos_weight = torch.tensor([10.0])
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
> **Applicable Scenarios:** Binary classification (spam detection), multi-label classification (article multi-label tagging), object detection (foreground/background judgment).
* * *
### 2.4 NLLLoss Negative Log-Likelihood Loss
Requires manually applying `log_softmax` to the model output, offering greater flexibility. `CrossEntropyLoss = LogSoftmax + NLLLoss`.
## Instance
criterion = nn.NLLLoss()
# Must manually apply log_softmax first
raw_output = torch.randn(4,3)# (batch, num_classes)
log_probs = torch.log_softmax(raw_output, dim=1)
targets = torch.tensor([0,2,1,0])
loss = criterion(log_probs, targets)
> **Use Cases:** When you need to use log probabilities in intermediate steps (e.g., CTC, Beam Search); for other cases, prioritize `CrossEntropyLoss`.
* * *
## 3. Regression Task Loss Functions
### 3.1 MSELoss Mean Squared Error
The most classic regression loss, **highly sensitive to large errors** (because squaring amplifies the impact of large errors).
**Mathematical Formula:**
MSELoss = (1/N) * sum((y_i - y_hat_i)^2)
## Instance
criterion = nn.MSELoss()
predictions = torch.tensor([2.5,0.5,2.0,8.0])
targets = torch.tensor([3.0, -0.5,2.0,7.0])
loss = criterion(predictions, targets)
print(f"MSE Loss: {loss.item():.4f}")# MSE Loss: 0.3750
# Manual verification
manual =((predictions - targets) ** 2).mean()
print(f"Manual calculation: {manual.item():.4f}")# 0.3750
> **Applicable Scenarios:** Continuous value regression like house price prediction, temperature prediction, etc. Works well when there are no obvious outliers in the data.
* * *
### 3.2 L1Loss Mean Absolute Error
**More robust to outliers**, because it takes the absolute value instead of squaring, so large errors are not overly amplified.
**Mathematical Formula:**
L1Loss = (1/N) * sum(|y_i - y_hat_i|)
## Instance
criterion = nn.L1Loss()
predictions = torch.tensor([2.5,0.5,2.0,8.0])
targets = torch.tensor([3.0, -0.5,2.0,7.0])
loss = criterion(predictions, targets)
print(f"L1 Loss: {loss.item():.4f}")# L1 Loss: 0.5000
* * *
### 3.3 SmoothL1Loss Huber Loss
**Combines the advantages of MSE and L1**: uses MSE for small errors (smooth, stable gradients) and L1 for large errors (robust to outliers). The standard loss in object detection (Faster R-CNN).
**Mathematical Formula:**
SmoothL1(x) = 0.5*x^2 if |x| < 1, else |x| - 0.5
## Instance
criterion = nn.SmoothL1Loss()
predictions = torch.tensor([2.5,0.5,2.0,8.0])
targets = torch.tensor([3.0, -0.5,2.
YouTip