- ❌ Original test.json contains many duplicate questions
- ❌ Question design is too simple, not conducive to hallucination detection
- ✅ Need to redesign 250 questions for 26 annual reports
- Total Questions: 250
- Unique Questions: 250 (0 duplicates)
- Report Coverage: 26 reports (2022-2024)
- Field Completeness: 100% (all items contain 5 fields)
| Type | Count | Percentage | Status |
|---|---|---|---|
| Fact Extraction | 50 | 20% | ✅ |
| List Enumeration | 50 | 20% | ✅ |
| Comparison & Calculation | 50 | 20% | ✅ |
| Judgment & Verification | 50 | 20% | ✅ |
| Reasoning & Analysis | 50 | 20% | ✅ |
| Dimension | Original test.json | New test_advanced_250.json | Improvement |
|---|---|---|---|
| Question Duplication | Many duplicates | 0 duplicates | ✅✅✅ |
| Type Diversity | Single (mostly reasoning) | 5 types balanced | ✅✅✅ |
| Complexity Level | Simple | Medium-High (5 gradients) | ✅✅ |
| Hallucination Detection Friendly | Medium | High (with type tags) | ✅✅✅ |
| Format Standardization | Basic | Complete (5 fields) | ✅✅ |
| Documentation Completeness | None | With README + validation report | ✅✅ |
Purpose: Test RAG's precise location and numerical extraction capabilities
What is the total operating revenue of PICC in 2022 in hundred million yuan?
What is the basic earnings per share (EPS) of Sifang Co., Ltd. in 2023?
Hallucination Detection: Easy to detect numerical fabrication and unit errors
Purpose: Test structured information extraction and completeness
List the names and sales revenue percentages of PICC's top five customers in 2022.
What are the main business segments of China Tourism Group Duty Free in 2023? What are the revenue percentages of each segment?
Hallucination Detection: Easy to detect information omission and list fabrication
Purpose: Test multi-period data comparison and calculation capabilities
Calculate the year-on-year growth rate and growth amount of China Shenhua's operating revenue in 2023.
Compare the quarterly operating revenues from Q1 to Q4 of ICBC in 2024.
Hallucination Detection: Easy to detect calculation errors and logical confusion
Purpose: Test conditional branch logic and detail extraction
Did CITIC Securities implement cash dividends in 2023? If so, what are the dividend amount and dividend rate?
Does CCB have goodwill impairment in 2024? If so, what is the impairment amount?
Hallucination Detection: Easy to detect fictitious events and detail fabrication
Purpose: Test deep understanding and causal reasoning capabilities
Attribution analysis: What are the main driving factors (price vs. quantity) for PICC's operating revenue growth in 2022?
ROE decomposition: Analyze the DuPont three-factor contribution to ICBC's ROE change in 2024.
Hallucination Detection: Easy to detect attribution errors and logical leaps
- ✅
datas/test_advanced_250.json- 250 question set (with blank answer fields) - ✅
datas/test_advanced_250_README.md- Detailed usage instructions - ✅
datas/test_advanced_250_VALIDATION.txt- Validation report
- ✅
tools/generate_advanced_questions.py- Question generation script (reusable) - ✅
tools/fill_answers_example.py- Answer filling example script
# Use RAG system to fill answers
python tools/fill_answers_example.pySample check the answer quality of each type of question (recommend checking 10 questions per type)
Build positive and negative sample pairs based on annotation results:
- Positive samples: High-quality answers
- Negative samples: Annotate 6 types of hallucinations (numerical fabrication/information omission/fictitious events/logical errors/time confusion/calculation errors)
- Strict Type Balance: First achievement of strictly balanced 50:50:50:50:50 distribution for 5 question types
- Type Labeling: Each question has a
typefield for classification evaluation - Answer Field Reserved: Unified format, convenient for RAG system filling
- Zero Duplication Guarantee: Global deduplication mechanism, 250 questions 100% unique
- Complete Documentation: Includes README + validation report + example scripts
- Type Diversity: Cover 5 dimensions of fact/list/calculation/judgment/reasoning
- Difficulty Progression: From simple queries to complex analysis
- Verifiability: All answers can be found in the documents
- Deduplication Mechanism: Global hash deduplication to avoid simple copying
- Fact Type: Easy for automatic verification (numerical matching)
- List Type: Easy to check completeness (set comparison)
- Calculation Type: Easy for mathematical verification
- Judgment Type: Easy for binary classification evaluation
- Reasoning Type: Requires human judgment of attribution rationality
Generation Date: 2025-11-05
Version: v2.1
Generation Tool: tools/generate_advanced_questions.py
Validation Status: ✅ All Passed