Pytorch Transformer Model

PyTorch Build Transformer Model | Novice Tutorial\\n\\nTransformer is one of the most powerful models in modern machine learning.\\n\\nThe Transformer model is a deep learning architecture based on the Self-Attention mechanism. It has completely revolutionized the field of Natural Language Processing (NLP) and has become the foundation for modern deep learning models such as BERT, GPT, etc.\\n\\nTransformer is the core architecture in the modern NLP field. With its powerful long-range dependency modeling capability and efficient parallel computing advantages, it has surpassed traditional Long Short-Term Memory (LSTM) networks in tasks like language translation and text summarization.\\n\\nIf you are not yet familiar with Transformer, you can refer to: (#).\\n\\n## Building a Transformer Model with PyTorch\\n\\n**The steps to build a Transformer model are as follows:**\\n\\n### 1. Import Necessary Libraries and Modules\\n\\nImport PyTorch core libraries, neural network modules, optimizer modules, data processing tools, as well as math and object copying modules to provide support for defining the model architecture, managing data, and the training process.\\n\\nimport torch import torch.nn as nn import torch.optim as optim import torch.utils.data as data import math import copy\\nExplanation:\\n\\n* `torch`: The core library of PyTorch, used for tensor operations and automatic differentiation.\\n\\n* `torch.nn`: PyTorch's neural network module, containing various layers and loss functions.\\n\\n* `torch.optim`: Optimization algorithm module, such as Adam, SGD, etc.\\n\\n* `math`: Math function library, used for calculations like square roots.\\n\\n* `copy`: Used for deep copying objects.\\n\\n### Define Basic Building Blocks: Multi-Head Attention, Position-wise Feed-Forward Network, Positional Encoding\\n\\n**Multi-Head Attention** calculates the relationship between each pair of positions in a sequence through multiple "attention heads", enabling it to capture different features and patterns of the input sequence.\\n\\n!(#)\\n\\nThe MultiHeadAttention class encapsulates the commonly used multi-head attention mechanism in the Transformer model. It is responsible for splitting the input into multiple attention heads, applying attention to each head, and then combining the results. This allows the model to capture various relationships in the input data at different scales, improving the model's expressive power.\\n\\n## Example\\n\\nclass MultiHeadAttention(nn.Module):\\n\\ndef __init__ (self, d_model, num_heads):\\n\\nsuper(MultiHeadAttention,self). __init__ ()\\n\\nassert d_model % num_heads ==0,"d_model must be divisible by num_heads"\\n\\nself.d_model= d_model # Model dimension (e.g., 512)\\n\\nself.num_heads= num_heads # Number of attention heads (e.g., 8)\\n\\nself.d_k= d_model // num_heads # Dimension per head (e.g., 64)\\n\\n# Define linear transformation layers (no bias needed)\\n\\nself.W_q= nn.Linear(d_model, d_model)# Query transformation\\n\\nself.W_k= nn.Linear(d_model, d_model)# Key transformation\\n\\nself.W_v= nn.Linear(d_model, d_model)# Value transformation\\n\\nself.W_o= nn.Linear(d_model, d_model)# Output transformation\\n\\ndef scaled_dot_product_attention(self, Q, K, V, mask=None):\\n\\n"""\\n\\n Calculate scaled dot product attention\\n\\n Input shapes:\\n\\n Q: (batch_size, num_heads, seq_length, d_k)\\n\\n K, V: Same as Q\\n\\n Output shape: (batch_size, num_heads, seq_length, d_k)\\n\\n """\\n\\n# Calculate attention scores (dot product of Q and K)\\n\\n attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)\\n\\n# Apply mask (e.g., padding mask or future information mask)\\n\\nif mask is not None:\\n\\n attn_scores = attn_scores.masked_fill(mask ==0, -1e9)\\n\\n# Calculate attention weights (softmax normalization)\\n\\n attn_probs = torch.softmax(attn_scores, dim=-1)\\n\\n# Weighted sum of value vectors\\n\\n output = torch.matmul(attn_probs, V)\\n\\nreturn output\\n\\ndef split_heads(self, x):\\n\\n"""\\n\\n Split input tensor into multiple heads\\n\\n Input shape: (batch_size, seq_length, d_model)\\n\\n Output shape: (batch_size, num_heads, seq_length, d_k)\\n\\n """\\n\\n batch_size, seq_length, d_model = x.size()\\n\\nreturn x.view(bExtreme Size, seq_length,self.num_heads,self.d_k).transpose(1,2)\\n\\ndef combine_heads(self, x):\\n\\n"""\\n\\n Combine outputs from multiple heads back to the original shape\\n\\n Input shape: (batch_size, num_heads, seq_length, d_k)\\n\\n Output shape: (batch_size, seq_length, d_model)\\n\\n """\\n\\n batch_size, _, seq_length, d_k = x.size()\\n\\nreturn x.transpose(1,2).contiguous().view(batch_size, seq_length,self.d_model)\\n\\ndef forward(self, Q, K, V, mask=None):\\n\\n"""\\n\\n Forward propagation\\n\\n Input shapes: Q/K/V: (batch_size, seq_length, d_model)\\n\\n Output shape: (batch_size, seq_length, d_model)\\n\\n """\\n\\n# Linear transformation and split into multiple heads\\n\\n Q =self.split_heads(self.W_q(Q))# (batch, heads, seq_len, d_k)\\n\\n K =self.split_heads(self.W_k(K))\\n\\n V =self.split_heads(self.W_v(V))\\n\\n# Calculate attention\\n\\n attn_output =self.scaled_dot_product_attention(Q,Extreme, V, mask)\\n\\n# Combine heads and output transformation\\n\\n output =self.W_o(self.combine_heads(attn_output))\\n\\nreturn output\\n\\nExplanation:\\n\\n* **Multi-Head Attention Mechanism**: Splits the input into multiple heads, each independently calculating attention, and finally combines the results.\\n\\n* **Scaled Dot Product Attention**: Calculates the dot product of queries and keys, scales it, uses softmax to calculate attention weights, and finally performs a weighted sum on the values.\\n\\n* **Mask**: Used toMask Invalid Positions (e.g., padding parts).\\n\\n### Position-wise Feed-Forward Network\\n\\n## Example\\n\\nclass PositionWiseFeedForward(nn.Module):\\n\\ndef __init__ (Extreme, d_model, d_ff):\\n\\nsuper(PositionWiseFeedForward,self). __init__ ()\\n\\nself.fc1= nn.Linear(d_model, d_ff)# First fully connected layer\\n\\nself.fc2= nn.Linear(d_ff, d_model)# Second fully connected layer\\n\\nself.relu= nn.ReLU()# Activation function\\n\\ndef forward(self, x):\\n\\n# Feed-forward network calculation\\n\\nreturn self.fc2(self.relu(self.fc1(x)))\\n\\n**Feed-Forward Network:** Consists of two fully connected layers and a ReLU activation function, used to further process the output of the attention mechanism.\\n\\n### Positional Encoding\\n\\nPositional Encoding is used to inject positional information for each token in the input sequence.\\n\\nUses sine and cosine functions of different frequencies to generate positional encoding.\\n\\n## Example\\n\\nclass PositionalEncoding(nn.Module):\\n\\ndef __init__ (self, d_model, max_seq_length):\\n\\nsuper(PositionalEncoding,self). __init__ ()\\n\\n pe = torch.zeros(max_seq_length, d_model)# Initialize positional encoding matrix\\n\\n position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)\\n\\n div_term = torch.exp(torch.arange(0, d_model,2).float() * -(math.log(10000.0) / d_model))\\n\\n pe[:,0::2]= torch.sin(position * div_term)# Even positions use sine function\\n\\n pe[:,1::2]= torch.cos(position * div_term)# Odd positions use cosine function\\n\\nself.register_buffer('pe', pe.unsqueeze(0))# Register as buffer\\n\\ndef forward(self, x):\\n\\n# Add positional encoding to the input\\n\\nreturn x + self.pe[:, :x.size(1)]\\n\\n### Build Encoder Block (Encoder Layer)\\n\\n!(https://static.jyshare.com/images/mix/Figure_2_The_Encoder_part_of_the_transformer_network_Source_image_from_the_original_paper_b0e3ac40fa.avif)\\n\\n**Encoder Layer:** Contains a self-attention mechanism and a feed-forward network, each followed by residual connection and layer normalization.\\n\\n## Example\\n\\nclass EncoderLayer(nExtreme.Module):\\n\\ndef __init__ (self, d_model, num_heads, d_Extreme, dropout):\\n\\nsuper(EncoderLayer,self). __init__ ()\\n\\nself.self_attn= MultiHeadAttention(d_model, num_heads)# Self-attention mechanism\\n\\nself.feed_forward= PositionWiseFeedForward(d_model, d_ff)# Feed-forward network\\n\\nself.norm1= nn.LayerNorm(d_model)# Layer normalization\\n\\nself.norm2= nn.LayerNorm(d_model)\\n\\nself.dropout= nn.Dropout(dropout)# Dropout\\n\\ndef forward(self, x, mask):\\n\\n# Self-attention mechanism\\n\\n attn_output =self.self_attn(x, x, x, mask)\\n\\n x =self.norm1(x + self.dropout(attn_output))# Residual connection and layer normalization\\n\\n# Feed-forward network\\n\\n ff_output =self.feed_forward(x)\\n\\n x =self.norm2(x + self.dropout(ff_output))# Residual connection and layer normalization\\n\\nreturn x\\n\\n### Build Decoder Module\\n\\n!(https://static.jyshare.com/images/mix/Figure_3_The_Decoder_part_of_the_Transformer_network_Souce_Image_from_the_original_paper_b90d9e7f66.avif)\\n\\n**Decoder Layer:** Contains a self-attention mechanism, a cross-attention mechanism, and a feed-forward network, each followed by residual connection and layer normalization.\\n\\n## Example\\n\\nclass DecoderLayer(nn.Module):\\n\\ndef __init__ (self, d_model, num_heads, d_ff, dropout):\\n\\nsuper(DecoderLayer,self). __init__ ()\\n\\nself.self_attn= MultiHeadAttention(d_model, num_heads)# Self-attention mechanism\\n\\nself.cross_attn= MultiHeadAttention(d_model, num_heads)# Cross-attention mechanism\\n\\nself.feed_forward= PositionWiseFeedForward(d_model, d_ff)# Feed-forward network\\n\\nExtreme.norm1= nn.LayerNorm(d_model)# Layer normalization\\n\\nself.norm2= nn.LayerNorm(d_model)\\n\\nself.norm3= nn.LayerNorm(d_model)\\n\\nself.dropout= nn.DropExtreme(dropout)# Dropout\\n\\ndef forward(self, x, enc_output, src_mask, tgt_mask):\\n\\n# Self-attention mechanism\\n\\n attn_output =self.self_attn(x, x, x, tgt_mask)\\n\\n x =self.norm1(x + self.dropout(attn_output))# Residual connection and layer normalization\\n\\n# Cross-attention mechanism\\n\\n attn_output =self.cross_attn(x, enc_output, enc_output, src_mask)\\n\\n x =self.norm2(x + self.dropout(attn_output))# Residual connection and layer normalization\\n\\n# Feed-forward network\\n\\n ff_output =self.feed_forward(x)\\n\\n x =self.norm3(x + self.dropout(ff_output))# Residual connection and layer normalization\\n\\nreturn x\\n\\n### Build the Complete Transformer Model\\n\\n!(https://static.jyshare.com/images/mix/Figure_4_The_Transformer_Network_Source_Image_from_the_original_paper_120e177956.avif)\\n\\n## Example\\n\\nclass Transformer(nn.Module):\\n\\ndef __init__ (self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):\\n\\nsuper(Transformer,self). __init__ ()\\n\\nself.encoder_embedding= nn.Embedding(src_vocab_size, d_model)# Encoder word embedding\\n\\nself.decoder_embedding= nn.Embedding(tgt_vocab_size, d_model)# Decoder word embedding\\n\\nself.positional_encoding= PositionalEncoding(d_model, max_seq_length)# Positional encoding\\n\\n# Encoder and decoder layers\\n\\nself.encoder_layers= nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout)for _ in range(num_layers)])\\n\\nself.decoder_layers= nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout)for _ in range(num_layers)])\\n\\nself.fc= nn.Linear(d_model, tgt_vocab_size)# Final fully connected layer\\n\\nself.dropout= nn.Dropout(dropout)# Dropout\\n\\ndef generate_mask(self, src, tgt):\\n\\n# Source mask: masks padding tokens (assumes padding token index is 0)\\n\\n# Shape: (batch_size, 1, 1, seq_length)\\n\\n src_mask =(src !=0).unsqueeze(1).unsqueeze(2)\\n\\n# Target mask: masks padding tokens and future information\\n\\n# Shape: (batch_size, 1, seq_length, 1)\\n\\n tgt_mask =(tgt !=0).unsqueeze(1).unsqueeze(3)\\n\\n seq_length = tgt.size(1)\\n\\n# Generate upper triangular matrix mask to prevent decoder from seeing future information\\n\\n nopeak_mask =(1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()\\n\\n tgt_mask = tgt_mask & nopeak_mask # Combine padding mask and future information mask\\n\\nreturn src_mask, tgt_mask\\n\\ndef forward(self, src, tgt):\\n\\n# Generate masks\\n\\n src_mask, tgt_mask =self.generate_mask(src, tgt)\\n\\n# Encoder part\\n\\n src_embedded =self.

YouTip

Pytorch Transformer Model

📂 Categories