LiHua-World is a dataset specifically designed for local RAG (Retrieval-Augmented Generation) scenarios. It contains one year's worth of chat records from a virtual user named LiHua.
- Includes three types of questions:
- Single-hop
- Multi-hop
- Summary
- Each question is accompanied by manually annotated answers and supporting documents
- The chat records cover various aspects of daily life, including:
- Social interactions
- Fitness training
- Entertainment activities
- Life affairs
- ...
The dataset mainly consists of the following parts:
- Chat messages organized in chronological order
- Each message contains:
- Timestamp
- Sender
- Message content
- Message type
- query_set.csv: Contains questions, standard answers, and evidence
- query_set.json: JSON format version of the CSV file
- User information
- Time range: January 2026 to December 2026
- List of conversation participants
Step 1. Unzip the LiHuaWorld.zip
file in the ./data
directory to obtain the original chat records.
Step 2. Use all the chat records in the ./data
directory as the knowledge base.
Step 3. Use query_set.csv
or query_set.json
in the ./qa
directory as the question set to conduct RAG testing.