YouTip LogoYouTip

Vector Database

A vector database is a database system specifically designed for storing, indexing, and retrieving high-dimensional vector data. You can think of it as: storing similar things together and being able to quickly find the things most similar to this one. Unlike traditional databases that query through exact matching (WHERE name = 'Alice'), vector databases query through similarity (finding the 10 images most similar to this image). ### An Intuitive Analogy Imagine a library scenario: | Database Type | Retrieval Method | Analogy | | --- | --- | --- | | Traditional Database | Exact search by book number or title | Find a book with a specificID | | Vector Database | Search by content relevance | Find "all science fiction novels similar to 'The Three-Body Problem'" | This semantic similarity is exactly the core problem that vector databases solve. * * * ## Why Vector Databases Are Needed Before diving into technical details, let's understand what problems vector databases solve. ### Limitations of Traditional Databases Traditional relational databases (MySQL, PostgreSQL) are excellent at handling structured data, but they struggle with the following requirements: * Image search (finding visually similar images) * Semantic search (when user searches "Apple phone", can find content related to "iPhone") * Recommendation systems (finding "songs similar to the style of songs you like") * Anomaly detection (finding "logs with the greatest difference from normal behavior") The common characteristic of these problems is: they require understanding the "meaning" of content, not literal matching. ### Problems with Traditional Solutions Using LIKE '%apple%' search β†’ Cannot find "iPhone", "Apple" Using full-text index search β†’ Cannot find content that is semantically related but uses different words ### Comparison Diagram The following chart intuitively shows the fundamental difference in query methods between traditional databases and vector databases. * * * ## Core Concepts: Vectors and Embeddings Understanding vectors and embeddings is the first step to mastering vector databases. ### What is a Vector (Vector) In mathematics, a vector is an ordered set of numbers. [0.12, -0.54, 0.87, 0.03, ..., 0.61] ← This is a vector In machine learning, this set of numbers represents the semantic features of an object, with dimensions typically between 128 and 4096. ### What is Embedding (Embedding) Embedding is the process and result of converting real-world objects (text, images, audio, etc.) into vectors. This conversion is done by an embedding model, whose core idea is: objects with similar semantics have vectors that are closer in space. ### Semantic Proximity Means Vector Proximity Use a 2D simplified example to understand (actual is hundreds to thousands of dimensions): > Key Understanding: Two vectors that are close in vector space also have more semantically similar original content. This is the foundation of all vector database capabilities. * * * ## Similarity Calculation Methods The core of finding "the most similar vector" is calculating the distance or similarity between two vectors. Here are three of the most commonly used methods. ### Cosine Similarity Cosine similarity measures the directional angle between two vectors, ignoring length. This is the most commonly used method, especially suitable for text scenarios. Formula: $$ \\\\text{CosineSimilarity} \\\\left(\\\\right. A , B \\\\left.\\\\right) = \\\\frac{A \\\\cdot B}{\\\\parallel A \\\\parallel \\\\parallel B \\\\parallel} = \\\\frac{\\\\sum_{i = 1}^{n} A_{i} B_{i}}{\\\\sqrt{\\\\sum_{i = 1}^{n} A_{i}^{2}} \\\\sqrt{\\\\sum_{i = 1}^{n} B_{i}^{2}}} $$ * Result range: -1 to 1, higher values mean more similar * Applicable scenarios: Text semantic search, document similarity ### Euclidean Distance Euclidean distance measures the straight-line distance between two points; the smaller the distance, the more similar. Formula: $$ d \\\\left(\\\\right. A , B \\\\left.\\\\right) = \\\\sqrt{\\\\sum_{i = 1}^{n} \\\\left(\\\\right. A_{i} - B_{i} \\\\left.\\\\right)^{2}} $$ * Result range: 0 to ∞, smaller values mean more similar * Applicable scenarios: Image retrieval, location-related applications ### Dot Product Dot product is the sum of vector multiplications, combining direction and length information. Formula: $$ A \\\\cdot B = \\\\sum_{i = 1}^{n} A_{i} B_{i} $$ * Applicable scenarios: Recommendation systems (equivalent to cosine similarity when vectors are normalized) ### Comparison of Three Methods ### Python Code Example The following example demonstrates Python implementation of three similarity calculation methods: ## Example import numpy as np # Cosine similarity: measures directional similarity def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) # Euclidean distance: measures absolute position difference def euclidean_distance(a, b): return np.linalg.norm(a - b) # Dot product: combines direction and length def dot_product(a, b): return np.dot(a, b) # Example vectors v1 = np.array([0.12, -0.54,0.87,0.03]) v2 = np.array([0.10, -0.50,0.90,0.05]) v3 = np.array([-0.80,0.20, -0.30,0.70]) print(f"v1 vs v2 Cosine Similarity: {cosine_similarity(v1, v2):.4f}")# Approximately 0.9997 (very similar) print(f"v1 vs v3 Cosine Similarity: {cosine_similarity(v1, v3):.4f}")# Approximately -0.55 (not similar) v1 vs v2 Cosine Similarity: 0.9997 v1 vs v3 Cosine Similarity: -0.5512 * * * ## Vector Indexing Algorithms When data volume is large (millions, billions), calculating similarity for every piece of data (brute-force search) is too slow. Vector databases use specialized indexing algorithms to accelerate queries. ### Brute-force Search (Flat / Brute-force) Brute-force search traverses all vectors and calculates similarity one by one. | Dimension | Description | | --- | --- | | Principle | Traverse all vectors, calculate similarity one by one | | Advantages | 100% accurate results | | Disadvantages | Extremely slow with large data volumes, O(n) complexity | | Applicable | Data volume less than 100,000, extremely high accuracy requirements | ### IVF (Inverted File Index) IVF execution steps: 1. Training phase: Use K-Means to cluster all vectors into N clusters, record the center of each cluster 2. Query phase: First find the centers of the closest clusters, then do precise search only within those clusters ### HNSW (Hierarchical Navigable Small World) HNSW is currently the most mainstream vector indexing algorithm, balancing speed and accuracy. HNSW core idea: * Build a multi-layer graph structure, sparse at the top, dense at the bottom * During query, start from the top-level entry, play "hopscotch": each layer greedily jumps to closer nodes, then dives to the next layer * Greatly reduces the number of nodes that need to be compared, time complexity approximately O(log n) ### Other Common Indexes | Index Type | Characteristics | Applicable Scenarios | | --- | --- | --- | | Flat (Brute-force) | Accurate but slow | Small datasets, accuracy priority | | IVF_Flat | Clustered then precise search, fast | Medium to large scale, sufficient memory | | IVF_PQ | Quantization compression, memory saving | Ultra-large scale, limited memory | | HNSW | Fast speed, high accuracy, high memory usage | Most commonly used, recommended first choice | | ScaNN | Google product, optimized throughput | High-concurrency production environments | * * * ## Comparison of Mainstream Vector Databases Below is a horizontal comparison of the most mainstream vector databases to help you make choices in different scenarios. > Beginner suggestion: Start with Chroma or pgvector; the former is suitable for AI application prototyping, the latter for projects already using PostgreSQL. * * * ## Quick Start: Python Example Below uses Chroma (easiest to get started) to demonstrate the complete CRUD process. ### Installation ## Example pip install chromadb openai ### Complete Example: Building a Document Semantic Search System The following code demonstrates from start to finish how to use Chroma to build a semantic document search system. ## Example import chromadb from chromadb.utils import embedding_functions # ─── 1. Initialize Client ─────────────────────────────────────────── # Persist to local (recommended) client = chromadb.PersistentClient(path="./
← Hugging Face TransformersMatplotlib Ref Triangles Polar β†’