Ai Fine Tuning

Model Fine-tuning

When you need a model specifically for generating e-commerce product descriptions, there are three main options.

The first is pre-training from scratch. This requires collecting massive amounts of text data (potentially trillions of tokens), renting dozens of A100 GPUs for months of training, and spending millions of dollars. This is what giants like OpenAI and Google do, and it's unrealistic for most people.
The second is prompt engineering. Give a general-purpose large model a carefully designed prompt, such as "You are a professional e-commerce copywriter. Please generate an attractive description based on the following product information..." This approach is zero-cost and immediately usable, but the problems are: you have to write long prompts every time, the results are unstable, it's easily disturbed by irrelevant inputs, and token consumption is high.
The third is fine-tuning. Take an already trained general-purpose model and continue training it with your proprietary data for a short while, letting it "learn" your specific task. The resulting model retains general capabilities while excelling at your task, and the cost is controllable.

Comparison of the three methods:

Method	Cost	Effect	Applicable Scenario
Pre-training from scratch	Extremely high (millions of dollars)	Fully customized	Creating new foundation models (giants only)
Prompt engineering	Almost zero	Unstable, prompt-dependent	Simple tasks, quick validation
Fine-tuning	Moderate (hundreds to thousands of dollars)	Stable and professional	Domain-specific tasks, production

A simple analogy:

Pre-training from scratch is like cultivating a doctor from scratch, taking more than a decade.

Prompt engineering is like giving a general doctor a detailed operating manual.

Fine-tuning is like sending a doctor who already has a medical license to specialize for a few months, becoming a specialist expert.

Why Fine-tuning is Needed

Not all tasks require fine-tuning. First, understand when to use it and when not to.

Limitations of Prompt Engineering

Prompt engineering is powerful, but it has several ceilings that are difficult to break through.

The first limitation is context window constraints. Your prompts and examples must fit within the model's context window (e.g., 4K, 8K, 128K, 1024K tokens). If your task requires hundreds of examples to explain, the prompt won't fit.
The second limitation is unstable results. The same prompt can yield very different results with a slight change in wording. When the model is in a good mood (sampling randomness), the output is great; when it's not, it may completely go off-topic.
The third limitation is high inference cost. Every inference requires sending the long prompt, consuming many tokens and resulting in slow response speed. In high-concurrency scenarios, costs rise sharply.
The fourth limitation is the forgetting problem. Although the model sees your prompt, the massive pre-training data may still hold it back. For example, if you ask it to output JSON, it may still add a lot of natural language explanations.

Applicable Scenarios for Fine-tuning

When your task has the following characteristics, fine-tuning is worth considering.

Characteristic one: The task is clearly defined with a fixed output format. For example, "translate technical documents into easy-to-understand blog posts," "generate ticket summaries based on customer service records," or "convert natural language into SQL queries."
Characteristic two: You have hundreds to thousands of high-quality labeled data points. Too little data won't yield results from fine-tuning; too much is unnecessary—a range of 500-5000 is generally ideal.
Characteristic three: Professional performance is needed in a specific domain. For example, medical report interpretation, legal document summarization, or financial news analysis. General-purpose models may not understand industry terminology and conventions; fine-tuning can make them "enter the industry."
Characteristic four: Sensitivity to response speed and cost. A fine-tuned small model may outperform a general-purpose large model, with 10x faster inference speed and 100x lower cost.
Characteristic five: Strict adherence to output format is required. For example, must output JSON, must use a specific tone, or must include certain fields. Fine-tuning lets the model "remember" these requirements without repeating them in the prompt every time.

Cost-Benefit Analysis of Fine-tuning

Before fine-tuning, do the math.

Costs mainly come from three aspects: data preparation, computing resources, and manual debugging. Data preparation usually takes the most work—you need to collect, clean, and label data. Computing resources are now quite cheap; with LoRA + QLoRA, a consumer-grade GPU (RTX 3090/4090) can fine-tune 7B/13B models.

Benefits are reflected in several aspects: better results, lower inference costs, faster response speeds, and more stable outputs. If your model is to serve external users, these benefits will accumulate continuously.

When should you not fine-tune? If the task changes quickly and data formats shift weekly, prompt engineering is more flexible. If it's just a one-time exploration, or if you only have dozens of data points, fine-tuning isn't worth it.

Factor	Better for Prompt	Better for Fine-tuning
Data volume	< 100 entries	> 500 entries
Task stability	Frequently changes	Relatively fixed
Usage frequency	Occasional	High-frequency
Cost sensitivity	Low	High
Effect requirements	Good enough	Best possible

Full Fine-tuning

First understand the most traditional fine-tuning method, then see why we rarely use it today.

Principle and Process

The idea of full fine-tuning is simple: take all parameters of the pre-trained model and continue training on your dataset, updating all weights.

The process is roughly as follows:

Step one: Prepare data. Organize your task data into "input-output" pairs.
Step two: Load the pre-trained model. For example, download open-source models like LLaMA-2-7B or Mistral-7B.
Step three: Set training parameters. The learning rate should be very small (e.g., 1e-5 to 5e-5), because you don't want to completely overwrite the knowledge learned during pre-training.
Step four: Start training. Let the model run for a few epochs on your data (complete passes through the dataset).
Step five: Save the model. After training, you get a completely new weight file, the same size as the original model.

Example

# ============================================
# Full Fine-tuning Concept Demo (pseudocode, no framework dependency)
# Shows core idea; actual code would use HuggingFace
# ============================================

def full_finetuning_concept():
    """Core concept demonstration of full fine-tuning"""
    
    # 1. Load pre-trained model (all parameters are trainable)
    model = load_pretrained_model("llama-2-7b")
    # Model has 7 billion parameters, each will be updated
    print(f"Total model parameters: {count_parameters(model)}")
    # Output: Total model parameters: 7000000000
    
    # 2. Prepare your specific task data
    train_data = [
        {"input": "Product: ThermosnGenerate e-commerce description",
         "output": "This high-quality thermos uses 304 stainless steel inner lining, vacuum insulation technology, and keeps warm for up to 12 hours."},
        {"input": "Product: Wireless EarbudsnGenerate e-commerce description",
         "output": "True wireless Bluetooth earbuds, Bluetooth 5.3 stable connection, active noise cancellation, 24-hour battery life, comfortable to wear."},
        # ... hundreds or thousands of similar data entries
    ]
    
    # 3. Set optimizer with very small learning rate
    optimizer = create_optimizer(model, learning_rate=2e-5)
    
    # 4. Training loop
    for epoch in range(3):  # Usually train for 2-5 epochs
        for batch in create_batches(train_data):
            # Forward propagation
            loss = model.compute_loss(batch)
            # Backward propagation, update ALL parameters!
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        print(f"Epoch {epoch} completed")
    
    # 5. Save complete model (7 billion parameters, ~13GB)
    model.save("llama-2-7b-ecommerce-finetuned")
    print("Full fine-tuning completed, new model saved")

# This is just a concept demo; actual training requires substantial VRAM

Catastrophic Forgetting Problem

Full fine-tuning has a serious problem: catastrophic forgetting.

What does this mean? When you train on your data, the model gradually "forgets" the general knowledge it learned during pre-training. It's like a person who, in order to prepare for a math exam, frantically does math problems and ends up forgetting the Chinese and English they learned before.

Why does this happen? Because full fine-tuning updates all parameters. The knowledge learned during pre-training is encoded in these parameters, and large changes destroy the original knowledge.

The result: the fine-tuned model performs well on your task, but declines on other general tasks. Worse, it may lose some basic abilities, such as following instructions or understanding complex problems.

How to mitigate? You can mix pre-training data during training, but this returns to the old problem of "needing massive amounts of data." Or use regularization to constrain parameters from changing too much, but the effect is limited.

VRAM Requirement Calculation

Another problem with full fine-tuning is the enormous VRAM requirement. Let's calculate:

VRAM needed to train a model = model parameters + gradients + optimizer states + activations.

For the LLaMA-7B model, in full precision (FP32), 7 billion parameters × 4 bytes = 28 GB. Gradients are another 28 GB, optimizer states (Adam optimizer) need 56 GB (2× gradients). Activations vary with batch size and sequence length, requiring at least a few GB.

Adding these up, full fine-tuning a 7B model requires about 100+ GB of VRAM—meaning at least 8× A100 (40GB) or 2× A100 (80GB).

If using half precision (FP16/BF16), it can be halved, but still requires about 50-60 GB VRAM. This is still unaffordable for ordinary people.

Model Size	Full Fine-tuning VRAM (FP16)	Required GPU
7B	50-60 GB	2×A100 (40GB) or 1×A100 (80GB)
13B	90-100 GB	3×A100 (40GB) or 2×A100 (80GB)
70B	400+ GB	12×A100 (40GB) or 6×A100 (80GB)

Full fine-tuning isn't unusable, but it's costly and risky. Unless you have a compelling reason (e.g., very large data volume, need to fundamentally change model behavior), you should prioritize Parameter-Efficient Fine-Tuning (PEFT) methods.

Parameter-Efficient Fine-Tuning (PEFT)

PEFT (Parameter-Efficient Fine-Tuning) is a general term for methods that only train a small number of parameters. It retains the effects of fine-tuning while dramatically reducing costs.

Core Idea of PEFT

The core insight of PEFT is: the pre-trained model already contains sufficient knowledge; we don't need to modify all parameters, only "adjust" a small portion to adapt to new tasks.

It's like a piano that has already been built—you don't need to rebuild the entire piano, just tune a few strings to play the tone you want.

The benefits of PEFT are clear:

First, VRAM requirements are greatly reduced. What originally needed 60 GB may now only need 6 GB.
Second, no catastrophic forgetting. Original model parameters remain untouched; only new additions are made, so original knowledge isn't destroyed.
Third, storage costs are low. PEFT typically only saves the small number of newly added parameters (a few MB to hundreds of MB), without saving the entire model.
Fourth, "task combination" is possible. Train multiple small adapters and load them on demand during inference—one model can serve as multiple models.

Overview of Mainstream PEFT Methods

PEFT is not a single method but a general term for a class of methods. Let's look at several mainstream methods:

The first is Adapter. Insert small neural network modules between layers of the Transformer. During training, only these Adapters are trained; original model parameters are frozen. This was an early PEFT method and is less used now.
The second is Prefix Tuning. Add learnable "prefix" vectors before each layer of the model. These prefix vectors occupy only a small portion but can guide the model's output.
The third is Prompt Tuning. Only add learnable soft prompts at the input layer. This is the simplest method, but the effect is relatively limited.
The fourth is LoRA (Low-Rank Adaptation). Add low-rank matrices beside the model's attention layers. This is currently the most popular and effective PEFT method.
The fifth is QLoRA (Quantized LoRA). Add model quantization on top of LoRA to further reduce VRAM requirements. This is the mainstream choice today.

Method	Principle	Trainable Parameters	Effect
Full Fine-tuning	Update all parameters	100%	Good, but may forget
Adapter	Insert small networks	0.1%-1%	Moderate
Prefix Tuning	Add prefix vectors	0.1%-1%	Moderate
Prompt Tuning	Add soft prompts	< 0.1%	General
LoRA	Low-rank matrix adaptation	0.1%-1%	Near full fine-tuning
QLoRA	Quantization + LoRA	0.1%-1%	Near full fine-tuning

Today's best practice is clear: prefer LoRA; if VRAM is insufficient, use QLoRA. These two methods offer good results, low cost, and mature ecosystems, making them the optimal choice in most cases.

LoRA (Low-Rank Adaptation)

LoRA is currently the most mainstream fine-tuning method, and understanding its principle is important.

Low-Rank Decomposition Principle (Intuitive Understanding)

First, understand what "rank" and "low-rank" mean.

Suppose you have a 100 × 100 matrix. The data inside may not be completely random. If this data can be composed of very few "patterns," we say this matrix has a low rank.

For example: if every row is a multiple of the first row, then the matrix rank is 1. If every row is a linear combination of the first two rows, then the rank is 2. And so on.

The core insight of LoRA is: when large models adapt to new tasks, the change in weights is usually "low-rank"—that is, this change can be represented with very few parameters.

This is like: although there are 1000 different e-commerce product descriptions, their writing style may be determined by just a few core factors: tone, length, emphasis points. You don't need 1000 different directions to describe the change; 8, 16, or 32 are enough.

So LoRA's approach is: freeze all parameters of the original model, and add two small matrices A and B beside the attention layer. A is d × r, B is r × d, where r is the rank, usually taking small numbers like 8, 16, or 32.

Original model output = original model forward propagation.

LoRA output = original model forward propagation + (input × A × B) × scaling factor.

During training, only A and B are trained; original model parameters remain completely untouched.

LoRA Mathematical Derivation

Don't worry—the mathematical derivation is simple, just a few lines.

Suppose a layer of the original model is a linear transformation: h = W₀x, where W₀ is the pre-trained weight matrix with shape d × d.

During fine-tuning, we want to update W₀ but not too much. LoRA's approach is to represent the change in W₀ as the product of two small matrices: ΔW = BA.

So the output after fine-tuning is: h = W₀x + (BA)x.

Here B has shape d × r, A has shape r × d, and r is the rank.

More precisely, the formula in the LoRA paper is:

h = W₀x + α/r × (BA)x

Where α is a scaling coefficient, usually set to some multiple of r, so that when r is changed, hyperparameters don't need to be re-tuned.

During training, W₀ is completely frozen; only A and B are trainable.

At initialization, A is initialized with random Gaussian distribution, and B is initialized to zero, so that at the beginning of training, the LoRA output is zero and doesn't affect the original model's behavior.

Example

# ============================================
# Minimal implementation of LoRA core principle
# No framework dependency, pure NumPy demonstration
# ============================================

import numpy as np

class LoRALayer:
    """Minimal implementation of LoRA layer"""
    
    def __init__(self, d: int, r: int, alpha: float = 1.0):
        """
        d: input/output dimension (e.g., 4096)
        r: rank (e.g., 8, 16, 32)
        alpha: scaling coefficient
        """
        self.d = d
        self.r = r
        self.alpha = alpha
        
        # Frozen original weights (simulating pre-trained model)
        self.W_0 = np.random.randn(d, d) * 0.01
        self.W_0.flags.writeable = False  # Mark as non-trainable
        
        # LoRA's A and B matrices, trainable
        # A: r × d, initialized with random Gaussian
        self.A = np.random.randn(r, d) * 0.01
        # B: d × r, initialized to zero (doesn't affect original model at early training)
        self.B = np.zeros((d, r))
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """Forward propagation: original model output + LoRA output"""
        # Original model output
        original_output = x @ self.W_0.T
        
        # LoRA output: (x @ A.T) @ B.T = x @ (B @ A).T
        lora_output = (x @ self.A.T) @ self.B.T
        
        # Combined output
        return original_output + self.alpha * lora_output
    
    def get_trainable_params(self) -> dict:
        """Get trainable parameters (only A and B)"""
        return {"A": self.A, "B": self.B}
    
    def count_params(self) -> dict:
        """Count parameter numbers"""
        total_original = self.d * self.d
        total_lora = self.r * self.d + self.d * self.r
        
        return {
            "original_params": total_original,
            "lora_params": total_lora,
            "ratio": total_lora / total_original * 100,
        }

# ============================================
# Demonstrate LoRA parameter savings effect
# ============================================

# Create a LoRA layer with d=4096, r=16
lora = LoRALayer(d=4096, r=16, alpha=32)

# Check parameter statistics
stats = lora.count_params()
print(f"Original parameters: {stats['original_params']:,}")
print(f"LoRA parameters: {stats['lora_params']:,}")
print(f"Parameter ratio: {stats['ratio']:.4f}%")

# Simulate an input (batch_size=2, d=4096)
x = np.random.randn(2, 4096)

# Forward propagation
output = lora.forward(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")

# Verify: output should be original_output + LoRA contribution
# Since B is initialized to 0, initial LoRA contribution is 0
# So output should equal original_output

YouTip

Ai Fine Tuning

Why Fine-tuning is Needed

Limitations of Prompt Engineering

Applicable Scenarios for Fine-tuning

Cost-Benefit Analysis of Fine-tuning

Full Fine-tuning

Principle and Process

Example

Catastrophic Forgetting Problem

VRAM Requirement Calculation

Parameter-Efficient Fine-Tuning (PEFT)

Core Idea of PEFT

Overview of Mainstream PEFT Methods

LoRA (Low-Rank Adaptation)

Low-Rank Decomposition Principle (Intuitive Understanding)

LoRA Mathematical Derivation

Example

📂 Categories