Pytorch Embedding
Word embedding is one of the most fundamental and important technologies in natural language processing.
Word embedding maps discrete word symbols to continuous dense vectors, enabling machines to understand and process text data.
PyTorch provides the `nn.Embedding` module to implement this functionality, which is the foundation for building various NLP models.
* * *
## 1. Basic Concepts of Word Embedding
In computers, text is essentially a sequence of integers. Each word is assigned a unique index ID, but this discrete representation has a problem: similar words may be semantically close, but their IDs are completely unrelated.
Word embedding solves this problem by learning an embedding matrix:
$$
E \in \mathbb{R}^{V \times D}
$$
Where $V$ is the vocabulary size and $D$ is the embedding dimension. Each word ID corresponds to a row in the embedding matrix, and its vector representation is obtained through a lookup operation:
$$
\text{embedding} = E \left[\right. \text{word}_\text{id} \left]\right.
$$
The advantages of word embedding include:
* Converting high-dimensional sparse one-hot vectors into low-dimensional dense vectors, significantly reducing computational overhead
* Semantically similar words are closer in the vector space, and word similarity can be calculated using cosine similarity
* Embedding vectors are learnable parameters that can be automatically adjusted through backpropagation
* * *
## 2. nn.Embedding Details
`nn.Embedding` is the word embedding layer provided by PyTorch, encapsulating the creation of the embedding matrix and the lookup operation.
### 2.1 Basic Usage
## Instance
import torch
import torch.nn as nn
# Create word embedding layer
# num_embeddings: vocabulary size (vocab_size)
# embedding_dim: embedding dimension (embedding_dim)
vocab_size = 10000
embedding_dim = 256
embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
# Check the shape of the embedding matrix
print(embedding.weight.shape) # torch.Size([10000, 256])
# Input word indices (LongTensor) to get embedding vectors
word_ids = torch.tensor([0, 1, 2, 9999]) # arbitrary word indices
embedded = embedding(word_ids)
print(embedded.shape) # torch.Size([4, 256])
# Each word ID corresponds to a 256-dimensional vector
### 2.2 nn.Embedding Parameter Details
## Instance
import torch.nn as nn
embedding = nn.Embedding(
num_embeddings=10000, # Vocabulary size, must be greater than or equal to the maximum index value of input
embedding_dim=256, # Embedding vector dimension, typically 50, 100, 200, 300, etc.
padding_idx=None, # Padding token index, embedding vector for padding tokens is all zeros
max_norm=None, # Maximum norm of embedding vectors, used for normalization
norm_type=2.0, # Normalization type, typically L2 norm
scale_grad_by_freq=False, # Scale gradients by word frequency
sparse=False, # Whether to use sparse gradients (saves memory, but slower training)
_weight=None, # Predefined weights, used for loading pretrained embeddings
)
# Check the number of parameters
total_params = embedding.num_embeddings * embedding.embedding_dim
print(f"Embedding layer parameters: {total_params:,}")
# 10000 * 256 = 2,560,000
> The number of parameters in the embedding layer = vocabulary size Γ embedding dimension, which is a very large matrix. Typically, the embedding layer of an NLP model accounts for a large proportion of the model's total parameters.
### 2.3 Padding Index padding_idx
When processing variable-length seq
YouTip