Graphrag Usage
\\
GraphRAG is a structured, hierarchical Retrieval-Augmented Generation (RAG) system open-sourced by Microsoft Research. Unlike traditional RAG that uses pure text vector retrieval, GraphRAG first uses a large language model to extract entities and relationships from raw documents to build a knowledge graph, then performs retrieval and question-answering based on the graph structure.\\
\\
This tutorial is aimed at complete beginners. It will take you through the entire process from installation: document → knowledge graph → query and answer. The model uses qwen-plus, and Embedding uses Alibaba Cloud Tongyi's text-embedding-v4.\\
\\
!(#)\\
\\
* * *\\
\\
## What is GraphRAG\\
\\
To understand GraphRAG, you first need to understand how it improves upon traditional RAG.\\
\\
### Limitations of Traditional RAG\\
\\
The traditional RAG (Baseline RAG) workflow is: chunk documents → generate vectors → when a question comes, use vector similarity to retrieve the most relevant chunks → send to the model to generate answers.\\
\\
This solution works well for "exact retrieval" type questions, but performs poorly in two scenarios:\\
\\
* **Cross-document associative reasoning**: When the answer requires connecting information scattered across different documents, vector similarity cannot express this "relationship." For example, "What is the connection between the CEO of Company A and the founder of Company B?"\\
* **Global summary questions**: When the question requires a macro summary of the entire corpus, a few text fragments simply cannot cover the full picture. For example, "What are the most core risk points in these financial report documents?"\\
\\
### How GraphRAG Solves These Two Problems\\
\\
GraphRAG's core idea is: **don't directly retrieve text chunks, but first build a knowledge graph, then retrieve based on the graph structure**.\\
\\
The entire process is divided into two stages:\\
\\
| Stage | What it does | Output |\\
| --- | --- | --- |\\
| Index | Use LLM to read documents, extract entities (people, places, organizations, etc.) and relationships to build a knowledge graph; then use Leiden algorithm for hierarchical clustering of the graph, generating summaries for each community | Knowledge graph + community summaries + vector index |\\
| Query | When user asks a question, select different retrieval strategies based on question type, inject graph data, community summaries, and original text fragments into LLM context to generate answers | Structured answers with source citations |\\
\\
> The indexing phase of GraphRAG requires calling a large number of LLM APIs, performing entity extraction and summarization on every text segment. **Token consumption is much higher than traditional RAG**. It is recommended to first test the process with small documents (within a few thousand characters) when using for the first time, and then process large-scale corpora after confirming everything works correctly.\\
\\
### Comparison of Four Query Modes\\
\\
| Query Mode | Retrieval Strategy | Suitable Question Types | Resource Consumption |\\
| --- | --- | --- | --- |\\
| Global Search | Map-Reduce traverses all community summaries | Global, summary questions, such as "What is the core theme of these documents" | High (concurrent multiple LLM calls) |\\
| Local Search | Starting from related entities, expand neighbors and associated text | Questions about specific entities, such as "the main relationship network of a certain person" | Medium |\\
| DRIFT Search | Local Search + community information guided iterative questioning | Entity-related questions requiring broad context, balancing depth and breadth | Medium-High |\\
| Basic Search | Standard vector similarity (same as traditional RAG) | Simple exact retrieval, for comparing effects | Low |\\
\\
* * *\\
\\
## Environment Requirements\\
\\
Before starting, please confirm your environment meets the following conditions.\\
\\
| Dependency | Version Requirement | Description |\\
| --- | --- | --- |\\
| Python | 3.10 ~ 3.12 | Official support range, 3.13 not yet supported |\\
| pip | Latest version | It is recommended to first `pip install --upgrade pip` |\\
| Alibaba Cloud Bailian API Key | — | Apply at [bailian.console.aliyun.com/](https://bailian.console.aliyun.com/cn-beijing?userCode=i5mn5r7m&tab=model#/api-key) |\\
| Disk Space | ≥ 2GB | Index products (Parquet files, vector database) will occupy significant space |\\
\\
* * *\\
\\
## Installation\\
\\
GraphRAG is published on PyPI. It is recommended to install in a Python virtual environment to avoid polluting the system environment.\\
\\
### Create Project Directory and Initialize Virtual Environment\\
\\
mkdir graphrag_demo cd graphrag_demo python -m venv .venv\\
> You can also use (#) to specify Python version: uv venv --python 3.12\\
\\
### Activate Virtual Environment\\
\\
macOS / Linux:\\
\\
source .venv/bin/activate\\
Windows:\\
\\
.venvScriptsactivate\\
### Install GraphRAG\\
\\
pip install graphrag\\
After installation, verify:\\
\\
graphrag --help\\
It will output some graphrag command information, output similar to:\\
\\
!(#)\\
\\
> GraphRAG will automatically install dependencies such as LiteLLM, LanceDB, pandas, and numpy. The first installation takes a long time (usually 1~3 minutes), please be patient.\\
\\
* * *\\
\\
## Initialize Workspace\\
\\
Run the `graphrag init` command, and GraphRAG will generate configuration files and directory structure in the current directory.\\
\\
graphrag init\\
It will prompt us to set up the model. Just press Enter to accept the defaults for now, and we will modify it later:\\
\\
Specify the default chat model to use [gpt-4.1]: Specify the default embedding model to use : \\
During initialization, it will prompt you to enter the default Chat model and Embedding model. You can temporarily press Enter to skip, and we will manually modify `settings.yaml` later.\\
\\
After initialization, the directory structure is as follows:\\
\\
graphrag_demo/├── .env # Store API Key environment variables├── settings.yaml # Core configuration file├── input/ # Store text files to be processed├── output/ # Index product output directory (auto-generated)└── prompts/ # Prompt template directory (auto-generated)\\
> After each minor version upgrade of GraphRAG, it is recommended to re-run `graphrag init --force` to refresh the configuration format. Otherwise, running errors may occur due to configuration structure changes. No.te that this command will overwrite existing `settings.yaml` and prompts. It is recommended to back up first.\\
\\
* * *\\
\\
## Prepare Test Documents\\
\\
GraphRAG supports `.txt`, `.csv`, `.json` format documents. All files to be processed are placed in the `input/` directory.\\
\\
We use the officially recommended test document—Charles Dickens' "A Christmas Carol"—downloaded directly from Project Gutenberg:\\
\\
curl https://static.jyshare.com/download/pg24022.txt -o ./input/book.txt\\
If the network is not accessible, you can also create your own Chinese test text. The richer the content (with many people, places, event relationships), the more obvious the knowledge graph effect of GraphRAG:\\
\\
Place the file in the input directory, with the filename demo.txt:\\
\\
SpaceX SpaceXwas foundedin 2002 year,byElon Musk(Elon Musk)inthe United StatesCaliforniaHawthorneestablished。the company's goal is tosignificantly reducespace transportation costs,ultimately achievehuman colonization of Mars。MuskEarlyyearco-founded PayPal,2002 yearwith 15 billion USDsold to eBay,subsequentlyinvested personal assets into SpaceX。among the co-founders,Tom Mueller(Tom Mueller)served aschief designer of propulsion systems,led Merlin and Raptor engine development。Gwynne Shotwell(Gwynne Shotwell)in 2002 yearjoined,2008 yearfrom then onserved aspresident and COO,responsible forcompanydaily operationsandbusiness development。2008 year,Falcon 1 No..(Falcon 1)Successfully reached orbit for the first time after three failures, becoming the first liquid-fueled orbital rocket developed by a private company. In 2010, Falcon 9.(Falcon 9)First flight successful, and that same year the Dragon spacecraft completed its first commercial launch and recovery. In 2012, Dragon became the first commercial spacecraft to dock with the International Space Station. In December 2015, Falcon 9.first-stage rocketfirst achievedland vertical recovery,ushered inrocket reusabilitya new era。since then SpaceX accumulated overmore than 200 timesrocket recovery,a single boosterhighestreusenumber of timesreached 20 timesand above。2020 year
YouTip