Ai Deployment
You've got a model running on your notebook, generating responses quickly and well. But when you deploy it on a server with a hundred users calling it simultaneously, things change: **some users wait ten seconds for a response, some requests timeout directly, GPU memory fills up instantly, and the bill jumps to make you heartbroken.**
This is the problem AI deployment aims to solve: turning a runnable model into a usable service.
Traditional web service bottlenecks are usually in CPU and database.
AI service bottlenecks primarily involve GPUβvideo memory capacity, computation speed, and how concurrent requests are queued.
The core challenges of AI deployment can be summarized in three words: latency, throughput, and cost.
| Metric | Meaning | User Experience |
| --- | --- | --- |
| Latency | Time from user sending request to receiving the first character | How fast |
| Throughput | How many requests can be processed per second | Can serve many people simultaneously |
| Cost | GPU/server fees for running the service | How expensive |
> Good AI deployment means finding the balance between these three: low enough latency, high enough throughput, and controllable cost.
* * *
## Model Serving Frameworks
Packaging trained models into API services requires specialized frameworks. This section introduces three mainstream options: vLLM, TGI, and Ollama.
### vLLM: High-Performance Inference Driven by PagedAttention
vLLM is an inference engine developed by UC Berkeley, known for its speed.
vLLM's core innovation is PagedAttentionβan efficient video memory management technique inspired by operating system virtual memory paging.
In traditional inference frameworks, each request's KV Cache (key-value cache) needs to occupy contiguous video memory space.
When request lengths vary, video memory becomes fragmented with low utilization.
PagedAttention divides video memory into fixed-size pages, and each request's KV Cache can be storeddistributed across different pages, with position tracking through a page table.
This significantly improves video memory utilization, enabling service for more requests simultaneously.
## Example
# Install vLLM
pip install vllm
# Start an OpenAI-compatible API service with vLLM
# --model specifies the model, --host and --port specify the listening address
# --tensor-parallel-size specifies how many cards to use for parallel inference
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--max-model-len 8192
After the service starts, you can call it directly using the OpenAI SDK:
## Example
# File path: test_vllm_client.py
from openai import OpenAI
# Connect to local vLLM service
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="tutorial-demo-key"# vLLM doesn't require a real API key by default
)
# Call chat completion
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "system","content": "You are a helpful AI assistant."},
{"role": "user","content": "Introduce tutorial tutorials in one sentence"}
],
temperature=0.7,
max_tokens=500,
stream=True# Streaming output
)
print("AI Response:", end="", flush=True)
for chunk in response:
if chunk.choices.delta.content:
print(chunk.choices.delta.content, end="", flush=True)
print()
vLLM also supports writing custom services directly in Python code:
## Example
# File path: custom_vllm_server.py
from vllm import LLM, SamplingParams
# Initialize the model
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
gpu_memory_utilization=0.9,# GPU memory usage ratio
tensor_parallel_size=1,# Tensor parallelism (number of cards)
max_model_len=8192,# Maximum context length
)
# Configure sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=500,
stop=[""],
)
# Batch inference
prompts =[
"Introduce tutorial",
"What is AI?",
"How to learn programming?",
]
# Generate responses
outputs = llm.generate(prompts, sampling_params)
# Print results
for output in outputs
YouTip