Chapter 3: Four Steps to Build RAG

Through the learning in Chapter 1, we have gained a basic understanding of RAG and have prepared the virtual environment and API key. Next, we will try to use the LangChain and LlamaIndex frameworks to implement and run our first RAG application. Through an example, we will demonstrate how to load local Markdown documents, process text using embedding models, and combine with large language models (LLM) to answer questions related to document content.

1. Start Virtual Environment

1.1 Activate Virtual Environment

Assuming you have created a Conda virtual environment named all-in-rag following the guidance in the previous chapter. Before running the script, first activate the virtual environment:

If using Cloud Studio, you need to confirm whether you are currently in the user environment. If not, please run su ubuntu to switch to the user environment.

conda activate all-in-rag

1.2 Switch to Project Directory

# Assuming currently in the root directory of the all-in-rag project
cd code/C1

The code files for each chapter are stored in the code/Cx directory, where x represents the chapter number.

2. Run RAG Example Code

After completing all the above settings, you can run the RAG example.

Open the terminal, ensure the virtual environment is activated, then execute the following command:

python 01_langchain_example.py

If you encounter nltk-related errors, try running fix_nltk.py in the code path.

After the code runs, you can see output similar to the following (formatted):

Downloading Model from https://www.modelscope.cn to directory: Path\to\all-in-rag\models\bge-small-zh-v1.5
2025-06-08 02:36:19,318 - modelscope - INFO - Target directory already exists, skipping creation.
content='
文中举了以下例子：

1. **自然界中的羚羊**：刚出生的羚羊通过试错学习站立和奔跑，适应环境。
2. **股票交易**：通过买卖股票并根据市场反馈调整策略，最大化奖励。
3. **雅达利游戏（如Breakout和Pong）**：通过不断试错学习如何通关或赢得游戏。
4. **选择餐馆**：利用（去已知喜欢的餐馆）与探索（尝试新餐馆）的权衡。
5. **做广告**：利用（采取已知最优广告策略）与探索（尝试新广告策略）。
6. **挖油**：利用（在已知地点挖油）与探索（在新地点挖油，可能发现大油田）。
7. **玩游戏（如《街头霸王》）**：利用（固定策略如蹲角落出脚）与探索（尝试新招式如"大招"）。

这些例子用于说明强化学习中的核心概念（如探索与利用、延迟奖励等）及其在实际场景中的应用。
'
additional_kwargs={'refusal': None}
response_metadata={
    'token_usage': {
        'completion_tokens': 209,
        'prompt_tokens': 5576,
        'total_tokens': 5785,
        'completion_tokens_details': None,
        'prompt_tokens_details': {'audio_tokens': None, 'cached_tokens': 5568},
        'prompt_cache_hit_tokens': 5568,
        'prompt_cache_miss_tokens': 8
    },
    'model_name': 'deepseek-chat',
    'system_fingerprint': 'fp_8802369eaa_prod0425fp8',
    'id': '67a0580d-78b1-44d6-bccf-f654ae0e9bba',
    'service_tier': None,
    'finish_reason': 'stop',
    'logprobs': None
}
id='run--919cedcd-771e-4aed-8dfd-cf436795792e-0'
usage_metadata={
    'input_tokens': 5576,
    'output_tokens': 209,
    'total_tokens': 5785,
    'input_token_details': {'cache_read': 5568},
    'output_token_details': {}
}

When running for the first time, the script will download the BAAI/bge-small-zh-v1.5 embedding model.

Output parameter explanation:

content: This is the core part, which is the specific answer generated by the large language model (LLM) based on your question and the provided context.
additional_kwargs: Contains some additional parameters. In this example, it's {'refusal': None}, indicating that the model did not refuse to answer.
response_metadata: Contains metadata about the LLM response.
- token_usage: Shows the number of tokens consumed in this call, including completion_tokens, prompt_tokens, and total_tokens.
- model_name: The name of the LLM model used, currently deepseek-chat.
- system_fingerprint, id, service_tier, finish_reason, logprobs: These are more detailed API response information. For example, finish_reason: 'stop' indicates that the model completed generation normally.
id: The unique identifier for this run.
usage_metadata: Similar to token_usage in response_metadata, providing statistics on input and output tokens.

3. RAG Implementation Based on LangChain Framework

In Chapter 1, we mentioned that the four steps to build a minimum viable system are data preparation, index construction, retrieval optimization, and generation integration. Next, we will implement a RAG application based on the LangChain framework around these four aspects.

3.1 Initial Setup

First, perform basic configuration, including importing necessary libraries, loading environment variables, and downloading embedding models.

import os
# os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
from dotenv import load_dotenv
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_core.prompts import ChatPromptTemplate
from langchain_deepseek import ChatDeepSeek

# Load environment variables
load_dotenv()

3.2 Data Preparation

Load raw documents: First define the path to the Markdown file, then use TextLoader to load the file as a knowledge source.

markdown_path = "../../data/C1/markdown/easy-rl-chapter1.md"
loader = TextLoader(markdown_path)
docs = loader.load()

Text Chunking: To facilitate subsequent embedding and retrieval, long documents are split into smaller, manageable text chunks. Here we use a recursive character splitting strategy with its default parameters for chunking. When initializing RecursiveCharacterTextSplitter() without specifying parameters, its default behavior aims to preserve the semantic structure of the text to the maximum extent:
- Default separators and semantic preservation: Try to use a series of preset separators ["\n\n" (paragraphs), "\n" (lines), " " (spaces), "" (characters)] in order to recursively split the text. The purpose of this strategy is to maintain the integrity of paragraphs, sentences, and words as much as possible, as they are usually the most semantically relevant text units, until the text chunks reach the target size.
- Preserve separators: By default (keep_separator=True), the separators themselves are preserved in the split text chunks.
- Default chunk size and overlap: Use the default parameters chunk_size=4000 (chunk size) and chunk_overlap=200 (chunk overlap) defined in its base class TextSplitter. These parameters ensure that text chunks meet predetermined size limits and reduce the loss of contextual information through overlap.
```
text_splitter = RecursiveCharacterTextSplitter()
texts = text_splitter.split_documents(docs)
```

3.3 Index Construction

After data preparation is complete, next build the vector index:

Initialize Chinese embedding model: Use HuggingFaceEmbeddings to load the Chinese embedding model downloaded in the initial setup. Configure the model to run on CPU and enable embedding normalization (normalize_embeddings: True).
```
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-small-zh-v1.5",
    model_kwargs={'device': 'cpu'},
    encode_kwargs={'normalize_embeddings': True}
)
```
Build vector storage: Convert the split text chunks (texts) into vector representations through the initialized embedding model, then use InMemoryVectorStore to add these vectors and their corresponding original text content, thereby building a vector index in memory.
```
vectorstore = InMemoryVectorStore(embeddings)
vectorstore.add_documents(texts)
```
After this process is completed, a queryable knowledge index is built.

3.4 Query and Retrieval

After the index is built, you can query and retrieve based on user questions:

Define user query: Set a specific user question string.

question = "What examples are mentioned in the text?"

Query relevant documents in vector storage: Use the similarity_search method of vector storage to find the most relevant k (in this example k=3) text chunks in the index based on user questions.
```
retrieved_docs = vectorstore.similarity_search(question, k=3)
```
Prepare context: Merge the page content (doc.page_content) of multiple retrieved text chunks into a single string, separated by double newlines ("\n\n"), forming the final context information (docs_content) for the large language model to reference.
```
docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)
```
Using "\n\n" (double newlines) instead of "\n" (single newlines) to connect different retrieved document chunks is mainly to more clearly distinguish these independent text fragments semantically when passing to large language models (LLM). Double newlines usually represent the end of a paragraph and the beginning of a new paragraph. This format helps LLM treat each chunk as an independent context source, thereby better understanding and utilizing this information to generate answers.

3.5 Generation Integration

The final step is to combine the retrieved context with user questions and use large language models (LLM) to generate answers:

Build prompt template: Use ChatPromptTemplate.from_template to create a structured prompt template. This template guides the LLM to answer user questions based on the provided context (context) and clearly indicates how to respond when information is insufficient.

prompt = ChatPromptTemplate.from_template("""Please answer the question based on the context information provided below.
Please ensure your answer is completely based on this context.
If there is not enough information in the context to answer the question, please directly inform: "Sorry, I cannot find relevant information in the provided context to answer this question."

Context:
{context}

Question: {question}

Answer:"""
                                          )

Configure large language model: Initialize the ChatDeepSeek client, configure the model used (deepseek-chat), temperature parameter for generating answers (temperature=0.7), maximum number of tokens (max_tokens=2048), and API key (loaded from environment variables).
```
llm = ChatDeepSeek(
    model="deepseek-chat",
    temperature=0.7,
    max_tokens=2048,
    api_key=os.getenv("DEEPSEEK_API_KEY")
)
```
Call LLM to generate answer and output: Format the user question (question) and previously prepared context (docs_content) into the prompt template, then call ChatDeepSeek's invoke method to get the generated answer.
```
answer = llm.invoke(prompt.format(question=question, context=docs_content))
print(answer)
```

Complete Code

Teacher, teacher, LangChain is powerful but still requires too much operation. Do you have any simpler and more user-friendly framework recommendations?

Yes, brother, yes! There are other user-friendly frameworks like LlamaIndex😉

4. Low-Code (Based on LlamaIndex)

In terms of RAG, LlamaIndex provides more encapsulated API interfaces, which undoubtedly lowers the barrier to entry. Here's a simple implementation:

import os
# os.environ['HF_ENDPOINT']='https://hf-mirror.com'
from dotenv import load_dotenv
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.deepseek import DeepSeek
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

load_dotenv()

Settings.llm = DeepSeek(model="deepseek-chat", api_key=os.getenv("DEEPSEEK_API_KEY"))
Settings.embed_model = HuggingFaceEmbedding("BAAI/bge-small-zh-v1.5")

documents = SimpleDirectoryReader(input_files=["../../data/C1/markdown/easy-rl-chapter1.md"]).load_data()

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()

print(query_engine.get_prompts())

print(query_engine.query("What examples are mentioned in the text?"))

Exercises (You can use large models to assist completion)

Modify the parameters chunk_size and chunk_overlap of RecursiveCharacterTextSplitter() in the LangChain code and observe what changes occur in the output results.
The final output obtained from LangChain code carries various parameters. Look up relevant materials and try to filter out these parameters to get the specific answer in content.
Add code comments to the LlamaIndex code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 3: Four Steps to Build RAG

1. Start Virtual Environment

1.1 Activate Virtual Environment

1.2 Switch to Project Directory

2. Run RAG Example Code

3. RAG Implementation Based on LangChain Framework

3.1 Initial Setup

3.2 Data Preparation

3.3 Index Construction

3.4 Query and Retrieval

3.5 Generation Integration

4. Low-Code (Based on LlamaIndex)

Exercises (You can use large models to assist completion)

FilesExpand file tree

03_get_start_rag.md

Latest commit

History

03_get_start_rag.md

File metadata and controls

Chapter 3: Four Steps to Build RAG

1. Start Virtual Environment

1.1 Activate Virtual Environment

1.2 Switch to Project Directory

2. Run RAG Example Code

3. RAG Implementation Based on LangChain Framework

3.1 Initial Setup

3.2 Data Preparation

3.3 Index Construction

3.4 Query and Retrieval

3.5 Generation Integration

4. Low-Code (Based on LlamaIndex)

Exercises (You can use large models to assist completion)