Vector Database
A vector database is a database system specifically designed for storing, indexing, and retrieving high-dimensional vector data.
You can think of it as: storing similar things together and being able to quickly find the things most similar to this one.
Unlike traditional databases that query through exact matching (WHERE name = 'Alice'), vector databases query through similarity (finding the 10 images most similar to this image).
### An Intuitive Analogy
Imagine a library scenario:
| Database Type | Retrieval Method | Analogy |
| --- | --- | --- |
| Traditional Database | Exact search by book number or title | Find a book with a specificID |
| Vector Database | Search by content relevance | Find "all science fiction novels similar to 'The Three-Body Problem'" |
This semantic similarity is exactly the core problem that vector databases solve.
* * *
## Why Vector Databases Are Needed
Before diving into technical details, let's understand what problems vector databases solve.
### Limitations of Traditional Databases
Traditional relational databases (MySQL, PostgreSQL) are excellent at handling structured data, but they struggle with the following requirements:
* Image search (finding visually similar images)
* Semantic search (when user searches "Apple phone", can find content related to "iPhone")
* Recommendation systems (finding "songs similar to the style of songs you like")
* Anomaly detection (finding "logs with the greatest difference from normal behavior")
The common characteristic of these problems is: they require understanding the "meaning" of content, not literal matching.
### Problems with Traditional Solutions
Using LIKE '%apple%' search β Cannot find "iPhone", "Apple"
Using full-text index search β Cannot find content that is semantically related but uses different words
### Comparison Diagram
The following chart intuitively shows the fundamental difference in query methods between traditional databases and vector databases.
* * *
## Core Concepts: Vectors and Embeddings
Understanding vectors and embeddings is the first step to mastering vector databases.
### What is a Vector (Vector)
In mathematics, a vector is an ordered set of numbers.
[0.12, -0.54, 0.87, 0.03, ..., 0.61] β This is a vector
In machine learning, this set of numbers represents the semantic features of an object, with dimensions typically between 128 and 4096.
### What is Embedding (Embedding)
Embedding is the process and result of converting real-world objects (text, images, audio, etc.) into vectors.
This conversion is done by an embedding model, whose core idea is: objects with similar semantics have vectors that are closer in space.
### Semantic Proximity Means Vector Proximity
Use a 2D simplified example to understand (actual is hundreds to thousands of dimensions):
> Key Understanding: Two vectors that are close in vector space also have more semantically similar original content. This is the foundation of all vector database capabilities.
* * *
## Similarity Calculation Methods
The core of finding "the most similar vector" is calculating the distance or similarity between two vectors. Here are three of the most commonly used methods.
### Cosine Similarity
Cosine similarity measures the directional angle between two vectors, ignoring length. This is the most commonly used method, especially suitable for text scenarios.
Formula:
$$
\\\\text{CosineSimilarity} \\\\left(\\\\right. A , B \\\\left.\\\\right) = \\\\frac{A \\\\cdot B}{\\\\parallel A \\\\parallel \\\\parallel B \\\\parallel} = \\\\frac{\\\\sum_{i = 1}^{n} A_{i} B_{i}}{\\\\sqrt{\\\\sum_{i = 1}^{n} A_{i}^{2}} \\\\sqrt{\\\\sum_{i = 1}^{n} B_{i}^{2}}}
$$
* Result range: -1 to 1, higher values mean more similar
* Applicable scenarios: Text semantic search, document similarity
### Euclidean Distance
Euclidean distance measures the straight-line distance between two points; the smaller the distance, the more similar.
Formula:
$$
d \\\\left(\\\\right. A , B \\\\left.\\\\right) = \\\\sqrt{\\\\sum_{i = 1}^{n} \\\\left(\\\\right. A_{i} - B_{i} \\\\left.\\\\right)^{2}}
$$
* Result range: 0 to β, smaller values mean more similar
* Applicable scenarios: Image retrieval, location-related applications
### Dot Product
Dot product is the sum of vector multiplications, combining direction and length information.
Formula:
$$
A \\\\cdot B = \\\\sum_{i = 1}^{n} A_{i} B_{i}
$$
* Applicable scenarios: Recommendation systems (equivalent to cosine similarity when vectors are normalized)
### Comparison of Three Methods
### Python Code Example
The following example demonstrates Python implementation of three similarity calculation methods:
## Example
import numpy as np
# Cosine similarity: measures directional similarity
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Euclidean distance: measures absolute position difference
def euclidean_distance(a, b):
return np.linalg.norm(a - b)
# Dot product: combines direction and length
def dot_product(a, b):
return np.dot(a, b)
# Example vectors
v1 = np.array([0.12, -0.54,0.87,0.03])
v2 = np.array([0.10, -0.50,0.90,0.05])
v3 = np.array([-0.80,0.20, -0.30,0.70])
print(f"v1 vs v2 Cosine Similarity: {cosine_similarity(v1, v2):.4f}")# Approximately 0.9997 (very similar)
print(f"v1 vs v3 Cosine Similarity: {cosine_similarity(v1, v3):.4f}")# Approximately -0.55 (not similar)
v1 vs v2 Cosine Similarity: 0.9997 v1 vs v3 Cosine Similarity: -0.5512
* * *
## Vector Indexing Algorithms
When data volume is large (millions, billions), calculating similarity for every piece of data (brute-force search) is too slow. Vector databases use specialized indexing algorithms to accelerate queries.
### Brute-force Search (Flat / Brute-force)
Brute-force search traverses all vectors and calculates similarity one by one.
| Dimension | Description |
| --- | --- |
| Principle | Traverse all vectors, calculate similarity one by one |
| Advantages | 100% accurate results |
| Disadvantages | Extremely slow with large data volumes, O(n) complexity |
| Applicable | Data volume less than 100,000, extremely high accuracy requirements |
### IVF (Inverted File Index)
IVF execution steps:
1. Training phase: Use K-Means to cluster all vectors into N clusters, record the center of each cluster
2. Query phase: First find the centers of the closest clusters, then do precise search only within those clusters
### HNSW (Hierarchical Navigable Small World)
HNSW is currently the most mainstream vector indexing algorithm, balancing speed and accuracy.
HNSW core idea:
* Build a multi-layer graph structure, sparse at the top, dense at the bottom
* During query, start from the top-level entry, play "hopscotch": each layer greedily jumps to closer nodes, then dives to the next layer
* Greatly reduces the number of nodes that need to be compared, time complexity approximately O(log n)
### Other Common Indexes
| Index Type | Characteristics | Applicable Scenarios |
| --- | --- | --- |
| Flat (Brute-force) | Accurate but slow | Small datasets, accuracy priority |
| IVF_Flat | Clustered then precise search, fast | Medium to large scale, sufficient memory |
| IVF_PQ | Quantization compression, memory saving | Ultra-large scale, limited memory |
| HNSW | Fast speed, high accuracy, high memory usage | Most commonly used, recommended first choice |
| ScaNN | Google product, optimized throughput | High-concurrency production environments |
* * *
## Comparison of Mainstream Vector Databases
Below is a horizontal comparison of the most mainstream vector databases to help you make choices in different scenarios.
> Beginner suggestion: Start with Chroma or pgvector; the former is suitable for AI application prototyping, the latter for projects already using PostgreSQL.
* * *
## Quick Start: Python Example
Below uses Chroma (easiest to get started) to demonstrate the complete CRUD process.
### Installation
## Example
pip install chromadb openai
### Complete Example: Building a Document Semantic Search System
The following code demonstrates from start to finish how to use Chroma to build a semantic document search system.
## Example
import chromadb
from chromadb.utils import embedding_functions
# βββ 1. Initialize Client βββββββββββββββββββββββββββββββββββββββββββ
# Persist to local (recommended)
client = chromadb.PersistentClient(path="./
YouTip