Ai Nlp Advanced
We actually interact with NLP every day. The autocomplete suggestions when typing on your phone's keyboard, spam emails automatically filtered by your email app, foreign language websites translated by translation softwareβthese are all applications of Natural Language Processing (NLP).
Before ChatGPT appeared, NLP had been developing for decades, but the technical threshold was very high. You needed to understand a long list of concepts such as word segmentation, part-of-speech tagging, syntactic analysis, and semantic role labeling, and also manually design features to make the machine "understand" a little bit of language.
Today, large language models have made all of this simple. However, understanding the technical lineage of NLP can help you more thoroughly understand why large models can do these things and where their limitations lie.
This module will take you from word vectors all the way to today's large language models, building a complete NLP knowledge system.
!(#)
> Learning Path: Word Vectors β Pre-trained Models β Three Paradigms: BERT/GPT/T5 β Downstream Tasks (Classification, NER, Translation, Summarization). Each step has corresponding code examples to help you implement the theory.
* * *
## History of Pre-trained Language Model Evolution
The development of NLP can be clearly divided into several stages, each with landmark technological breakthroughs.
### Word2Vec: The Revolution of Word Vectors
Before 2013, computers processed text in a very primitive way.
Typically, "One-Hot Encoding" was used: each word corresponds to a very long vector, with only one position being 1 and all others being 0. For example, if the vocabulary has 100,000 words, each word is a 100,000-dimensional vector.
The problem with this approach is obvious: the vector contains no semantic information. The distance between "cat" and "dog" is the same as the distance between "cat" and "table."
The core idea of Word2Vec is: the meaning of a word is defined by the words around it.
## Example
# ============================================
# Word2Vec Basic Concept Demonstration
# Use gensim library to train a simple word vector model
# ============================================
# First install gensim: pip install gensim
from gensim.models import Word2Vec
import numpy as np
# Prepare training data: some simple sentences
sentences =[
["I","Like,"eat","apple"],
["I","Like,"eat","banana"],
["Cat","Like,"eat","fish"],
["dog","Like,"eat","meat"],
["apple","is","a type of","Fruit"],
["banana","is","a type of","Fruit"],
["Cat","is","a type of","Animal"],
["dog","is","a type of","Animal"],
["TUTORIAL","is","a","Programming","website"],
["learn","Programming","go","TUTORIAL"],
]
# Train Word2Vec model
# vector_size: dimension of word vectors
# window: context window size (look at several words before and after)
# min_count: ignore words that appear less than this value
# workers: number of threads for parallel training
model = Word2Vec(
sentences=sentences,
vector_size=50,# Each word represented by 50-dimensional vector
window=3,# Look at 3 words before and after
min_count=1,# Keep all words
workers=4,
epochs=100# Train for 100 epochs
)
# Get word vector
apple_vector = model.wv
print(f"'apple' Word Vector (Before 10 dimensions):{apple_vector[:10]}")
print(f"Word Vector Dimensionality:{len(apple_vector)}")
# Calculate similarity between words
similarity = model.wv.similarity("apple","banana")
print(f"'apple' and 'banana' 's similarity:{similarity:.4f}")
similarity = model.wv.similarity("apple","Cat")
print(f"'apple' and 'Cat' 's similarity:{similarity:.4f}")
# Find most similar words
print("n and 'Cat' Most similar words:")
for word, score in model.wv.most_similar("Cat", topn=3):
print(f" {word}: {score:.4f}")
print("n and 'TUTORIAL' Most similar words:")
for word, score in model.wv.most_similar("TUTORIAL", topn=3):
print(f" {word}: {score:.4f}")
# Classic word vector arithmetic: king - man + woman β queen
# Try in our small corpus: fruit - apple + fish β ?
if"apple"in model.wv and"fish"in model.wv and"Fruit"in model.wv:
result = model.wv.most_similar(positive=["Fruit","fish"], negative=, topn=3)
print("n'Fruit' - 'apple' + 'fish' β")
for word, score in result:
print(f" {word}: {score:.4f}")
Word2Vec proved one thing: semantics can be represented in vector space.
But it has a limitation: each word has only one fixed vector, regardless of context. For example, "play/make" (hit/play) has different meanings in "make a phone call" (make a phone call) and "play games" (play games), but Word2Vec gives the same vector.
### ELMo: Context-dependent Word Vectors
ELMo (Embeddings from Language Models), which appeared in 2018, solved this problem.
ELMo's approach is: instead of pre-assigning a fixed vector to each word, it looks at the entire sentence and then generates a vector for that word.
The same character "play/make" gets one vector in "Imake a phone call" (I make a phone call) and a different vector in "Iplay games" (I play games).
ELMo uses bidirectional LSTM (Long Short-Term Memory) to model context. This was the first large-scale use of the "pre-training + fine-tuning" paradigm.
### GPT-1: Unidirectional Pre-training
Also in 2018, OpenAI released GPT-1 (Generative Pre-training Transformer).
Its characteristics are:
1. Uses Transformer decoder instead of LSTM
2. Unidirectional: only looks at previous words to predict the next word
3. Generative: can continue writing text
GPT-1 proved the huge potential of Transformer in NLP tasks.
### BERT: Bidirectional Pre-training
At the end of 2018, Google released BERT (Bidirectional Encoder Representations from Transformers), which completely changed the NLP field.
BERT's core innovations are:
1. Bidirectional: looks at both preceding and following context
2. MLM (Masked Language Model): randomly masks some words and lets the model predict them
3. NSP (Next Sentence Prediction): judges whether two sentences are consecutive
BERT achieved the best results at the time on 11 NLP tasks, marking NLP's entry into the "pre-trained model era."
### The Leap from GPT-3 to ChatGPT
In 2020, GPT-3 was released with 175 billion parameters.
People discovered that when the model is large enough and data is abundant enough, an "Emergence" phenomenon occursβthe model suddenly gains capabilities that small models don't have, such as few-shot learning and complex reasoning.
At the end of 2022, ChatGPT was released. Through RLHF (Reinforcement Learning from Human Feedback), the model's outputs better align with human preferences, and AI truly went mainstream.
Let's summarize this evolution history with a table:
| Year | Model | Core Idea | Historical Significance |
| --- | --- | --- | --- |
| 2013 | Word2Vec | Define a word's meaning by the words around it | Word vector revolution, vectorized representation of semantics |
| 2018 | ELMo | Context-dependent word vectors | FirstImplement vector representations for polysemy |
| 2018 | GPT-1 | Transformer decoder, unidirectional pre-training | Proved Transformer's potential |
| 2018 | BERT | Transformer encoder, bidirectional pre-training | NLP entered the pre-trained model era |
| 2020 | GPT-3 | 175 billion parameters, emergent abilities | Demonstrated the unlimited possibilities of large models |
| 2022 | ChatGPT | RLHF + dialogue capability | AI truly went mainstream |
* * *
## Deep Dive into BERT
BERT is a milestone in the history of NLP development and deserves in-depth understanding.
### MLM (Masked Language Model) Task
BERT's core pre-training task is MLM: randomly replace 15% of words in a sentence with , and let the model predict what the original word was.
For example, the sentence "Ilike eatapple" might become "I eatapple", and the model needs to predict that is "Like.
Why do this? Because it forces the model to use information from both left and right context, not just
YouTip