YouTip LogoYouTip

Bert Encoder

BERT Series Models |

\n\n

BERT (Bidirectional Encoder Representations from Transformers) is a revolutionary natural language processing model proposed by Google in 2018, which has fundamentally transformed research and application paradigms in the NLP field.

\n\n

This article systematically introduces the core principles, training methods, fine-tuning techniques, and mainstream variant models of BERT.

\n\n
\n\n

BERT Architecture and Training

\n\n

The figure below illustrates the core architecture of the BERT (Bidirectional Encoder Representations from Transformers) model and the Masked Language Modeling (MLM) task during pretraining.

\n\n

Image 1

\n\n

1. Input Layer (Embedding)

\n\n
    \n
  • Input Sequence: Text composed of tokens (or subwords), e.g., [W₁, Wβ‚‚, W₃, , Wβ‚…, W₆, W₇, Wβ‚‚, W₃, Wβ‚„, Wβ‚…]. \n
      \n
    • is a token randomly masked by BERT during pretraining (e.g., Wβ‚„ in the original text is replaced by ).
    • \n
    \n
  • \n
  • Embedding Layer: Converts each token into a fixed-dimensional vector representation (e.g., 768 dimensions), consisting of:\n
      \n
    • Token Embeddings: Semantic information of the vocabulary.
    • \n
    • Position Embeddings: Positional information of tokens in the sequence.
    • \n
    • Segment Embeddings: Distinguishes between sentences (useful for sentence-pair tasks; not explicitly shown in the figure).
    • \n
    \n
  • \n
\n\n

2. Transformer Encoder

\n\n
    \n
  • Multiple Transformer Blocks: Details are not expanded in the figure, but each block contains:\n
      \n
    • Self-Attention Mechanism: Captures bidirectional contextual dependencies (core feature of BERT).
    • \n
    • Feed-Forward Network: Nonlinear transformation.
    • \n
    • Residual Connections & Layer Normalization: Stabilizes the training process.
    • \n
    \n
  • \n
  • Output: Context-dependent vector representations corresponding to each input token (e.g., O₁, Oβ‚‚, ..., Oβ‚…).
  • \n
\n\n

3. Masked Language Modeling (MLM) Task

\n\n
    \n
  • Objective: Predict the original token corresponding to the masked token (e.g., Wβ‚„ in the figure).
  • \n
  • Classification Layer:\n
      \n
    • Fully-Connected Layer: Maps the Transformer output vector (e.g., Oβ‚„) to the vocabulary size dimension.
    • \n
    • Activation Function GELU: Gaussian Error Linear Unit (nonlinear function used by BERT).
    • \n
    • Layer Normalization (Norm): Normalizes the output.
    • \n
    • Softmax: Computes probabilities for each word in the vocabulary; selects the word with the highest probability as the prediction (e.g., W'₁, W'β‚‚, ..., W'β‚… are candidate words).
    • \n
    \n
  • \n
\n\n

Transformer Encoder Structure

\n\n

BERT is built upon the encoder part of the Transformer, with its core being multiple layers of self-attention mechanisms:

\n\n

Example

\n\n
# Simplified Transformer Encoder Layer\n\nclass TransformerEncoderLayer(nn.Module):\n\ndef __init__ (self, d_model, nhead, dim_feedforward=2048):\n\nsuper(). __init__ ()\n\nself.self_attn= MultiheadAttention(d_model, nhead)\n\nself.linear1= nn.Linear(d_model, dim_feedforward)\n\nself.linear2= nn.Linear(dim_feedforward, d_model)\n\nself.norm1= nn.LayerNorm(d_model)\n\nself.norm2= nn.LayerNorm(d_model)\n\ndef forward(self, src):\n\n# Self-attention mechanism\n\n src2 =self.self_attn(src, src, src)\n\n src = src + self.norm1(src2)\n\n# Feed-forward network\n\n src2 =self.linear2(F.relu(self.linear1(src)))\n\n src = src + self.norm2(src2)\n\nreturn src\n
\n\n

Key Innovation: Bidirectional Context Modeling

\n\n

Unlike traditional language models, BERT achieves bidirectional context understanding through two pretraining tasks:

\n\n
    \n
  1. Masked Language Model (MLM): Randomly masks 15% of input tokens and predicts the masked tokens.
  2. \n
  3. Next Sentence Prediction (NSP): Determines whether two sentences appear consecutively.
  4. \n
\n\n

Training Parameters and Configuration

\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ParameterBERT-baseBERT-large
Layers1224
Hidden Size7681024
Attention Heads1216
Total Parameters110M340M
\n\n
\n\n

BERT Fine-Tuning Methods

\n\n

Standard Fine-Tuning Pipeline

\n\n
    \n
  1. Task-Specific Layer Addition: Add classification/regression layers according to downstream tasks.
  2. \n
  3. Learning Rate: Typically set to a small value (2e-5 to 5e-5).
  4. \n
  5. Batch Size: 16 or 32 are common choices.
  6. \n
  7. Training Epochs: 2–4 epochs are usually sufficient.
  8. \n
\n\n

Efficient Fine-Tuning Techniques

\n\n

Example

\n\n
# Fine-tuning example using HuggingFace Transformers\n\nfrom transformers import BertForSequenceClassification, Trainer\n\nmodel = BertForSequenceClassification.from_pretrained('bert-base-uncased')\n\n trainer = Trainer(\n\n model=model,\n\n args=training_args,\n\n train_dataset=train_dataset,\n\n eval_dataset=eval_dataset\n\n)\n\n trainer.train()\n
\n\n

Comparison of Common Fine-Tuning Strategies

\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
MethodAdvantagesDisadvantages
Full-parameter fine-tuningBest performanceHigh computational cost
Feature extraction (freeze BERT)Computationally efficientSuboptimal performance
AdapterParameter-efficientRequires architecture modification
Prompt learningStrong few-shot performanceRequires prompt template design
\n\n
\n\n

Mainstream BERT Variant Models

\n\n

RoBERTa (Robustly Optimized BERT)

\n\n
    \n
  • Improvements:\n
      \n
    • Larger batch size (8k vs. 256)
    • \n
    • Longer training duration
    • \n
    • Removal of NSP task
    • \n
    • Dynamic masking strategy
    • \n
    \n
  • \n
  • Performance: Average improvement of 2–3% on the GLUE benchmark.
  • \n
\n\n

ALBERT (A Lite BERT)

\n\n
    \n
  • Core Innovations:\n
      \n
    • Parameter sharing (shared attention parameters across layers)
    • \n
    • Embedding factorization (decomposing token embeddings into two smaller matrices)
    • \n
    \n
  • \n
  • Effect: 89% reduction in parameter count and 1.7Γ— speedup.
  • \n
\n\n

Other Important Variants

\n\n
    \n
  1. DistilBERT: Model compression via knowledge distillation.
  2. \n
  3. ELECTRA: Replaces MLM with a generator-discriminator architecture.
  4. \n
  5. SpanBERT: Optimizes modeling of text spans.
  6. \n
\n\n
\n\n

Chinese BERT Models

\n\n

Overview of Chinese Pretrained Models

\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ModelOrganizationFeatures
BERT-wwmHIT (Harbin Institute of Technology)Whole Word Masking (wwm)
RoBERTa-wwm-extHITExtended training data
ERNIE (Baidu)BaiduKnowledge graph integration
NEZHAHuaweiRelative positional encoding
\n\n

Chinese BERT Usage Example

\n\n

Example

\n\n
from transformers import BertTokenizer, BertModel\n\ntokenizer = BertTokenizer.from_pretrained('bert-base-chinese')\n\n model = BertModel.from_pretrained('bert-base-chinese')\n\ninputs = tokenizer("Natural language processing is very interesting.", return_tensors="pt")\n\n outputs = model(**inputs)\n
\n\n

Recommendations for Fine-Tuning Chinese Tasks

\n\n
    \n
  1. Use the whole-word masking (wwm) version for better performance.
  2. \n
  3. Pay attention to Chinese word segmentation boundary issues.
  4. \n
  5. For specialized domains, consider domain-adaptive pretraining.
  6. \n
\n\n
\n\n

Practical Suggestions and Resources

\n\n

Learning Roadmap

\n\n

Image 2

\n\n

Recommended Resources

\n\n
    \n
  1. Papers:\n
      \n
    • Original BERT paper (arXiv:1810.04805)
    • \n
    • Papers for variants such as RoBERTa, ALBERT, etc.
    • \n
    \n
  2. \n
  3. Codebases:\n
      \n
    • HuggingFace Transformers
    • \n
    • GitHub implementations of Chinese BERT
    • \n
    \n
  4. \n
  5. Online Courses:\n
      \n
    • Coursera Natural Language Processing Specialization
    • \n
    • Hung-yi Lee’s Deep Learning Course
    • \n
    \n
  6. \n
\n\n

Through systematic learning and practice, BERT series models can become powerful tools for solving NLP problems. It is recommended to start with the base version and gradually explore more advanced variants and optimization techniques.

← Multimodal Pre Trained ModelsSequence To Sequence β†’