YouTip LogoYouTip

Pytorch Embedding

Word embedding is one of the most fundamental and important technologies in natural language processing. Word embedding maps discrete word symbols to continuous dense vectors, enabling machines to understand and process text data. PyTorch provides the `nn.Embedding` module to implement this functionality, which is the foundation for building various NLP models. * * * ## 1. Basic Concepts of Word Embedding In computers, text is essentially a sequence of integers. Each word is assigned a unique index ID, but this discrete representation has a problem: similar words may be semantically close, but their IDs are completely unrelated. Word embedding solves this problem by learning an embedding matrix: $$ E \in \mathbb{R}^{V \times D} $$ Where $V$ is the vocabulary size and $D$ is the embedding dimension. Each word ID corresponds to a row in the embedding matrix, and its vector representation is obtained through a lookup operation: $$ \text{embedding} = E \left[\right. \text{word}_\text{id} \left]\right. $$ The advantages of word embedding include: * Converting high-dimensional sparse one-hot vectors into low-dimensional dense vectors, significantly reducing computational overhead * Semantically similar words are closer in the vector space, and word similarity can be calculated using cosine similarity * Embedding vectors are learnable parameters that can be automatically adjusted through backpropagation * * * ## 2. nn.Embedding Details `nn.Embedding` is the word embedding layer provided by PyTorch, encapsulating the creation of the embedding matrix and the lookup operation. ### 2.1 Basic Usage ## Instance import torch import torch.nn as nn # Create word embedding layer # num_embeddings: vocabulary size (vocab_size) # embedding_dim: embedding dimension (embedding_dim) vocab_size = 10000 embedding_dim = 256 embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim) # Check the shape of the embedding matrix print(embedding.weight.shape) # torch.Size([10000, 256]) # Input word indices (LongTensor) to get embedding vectors word_ids = torch.tensor([0, 1, 2, 9999]) # arbitrary word indices embedded = embedding(word_ids) print(embedded.shape) # torch.Size([4, 256]) # Each word ID corresponds to a 256-dimensional vector ### 2.2 nn.Embedding Parameter Details ## Instance import torch.nn as nn embedding = nn.Embedding( num_embeddings=10000, # Vocabulary size, must be greater than or equal to the maximum index value of input embedding_dim=256, # Embedding vector dimension, typically 50, 100, 200, 300, etc. padding_idx=None, # Padding token index, embedding vector for padding tokens is all zeros max_norm=None, # Maximum norm of embedding vectors, used for normalization norm_type=2.0, # Normalization type, typically L2 norm scale_grad_by_freq=False, # Scale gradients by word frequency sparse=False, # Whether to use sparse gradients (saves memory, but slower training) _weight=None, # Predefined weights, used for loading pretrained embeddings ) # Check the number of parameters total_params = embedding.num_embeddings * embedding.embedding_dim print(f"Embedding layer parameters: {total_params:,}") # 10000 * 256 = 2,560,000 > The number of parameters in the embedding layer = vocabulary size Γ— embedding dimension, which is a very large matrix. Typically, the embedding layer of an NLP model accounts for a large proportion of the model's total parameters. ### 2.3 Padding Index padding_idx When processing variable-length seq
← Pytorch AutoencoderPytorch Batchnorm Dropout β†’