🔍 A toolkit for automated evaluation and fact-checking of long technical/research reports
Focus on report quality assessment and factuality verification
Reports are the most canonical and representative outputs of DeepResearch. High-quality research reports feature clear structure, rigorous logic, dense information, and trustworthy citations—crucial for knowledge-intensive research scenarios. To this end, we propose the DeepResearch-ReportEval framework: a hybrid evaluation approach combining LLM-as-a-Judge for automated report-quality assessment with expert judgments for reliability. The framework evaluates report quality across multiple dimensions, including comprehensiveness, redundancy, and factual accuracy. We also release a carefully curated dataset: 100 queries spanning diverse categories and 100 corresponding reports generated by Qwen-DeepResearch to support systematic evaluation.
| 🎯 Quality Scoring | 🔍 Fact Checking | 📈 Redundancy Detection |
|---|---|---|
| Five-dimension scoring | Web content verification | Pairwise paragraph analysis |
- ✅ Five dimensions: Comprehensiveness, Coherence, Clarity, Insightfulness, Overall
- 🔄 Redundancy detection: Smart sampling of paragraph pairs to compute average redundancy score
- 📝 Rationales: Provide reasoned explanations for the scores
- 💾 Checkpointing: Resume from checkpoints to avoid recomputation
- 🌐 Web scraping: Support both Firecrawl and Jina Reader
- 📋 Batch verification: Check each provided context against the scraped page
- 🎯 Ternary scoring: -1 (Not supported) / 0 (Uncertain) / 1 (Supported)
- 📄 Explanations: Provide supporting evidence and analysis
DeepResearch-ReportEval/
├── 📊 judge_score.py # Main script for quality evaluation
├── 🔍 judge_fact.py # Main script for fact checking
├── 🛠️ Atools.py # Utilities and model calls
├── 📝 Aprompts.py # Prompt templates
├── 📂 data/ # Dataset
│ ├── topic/ # High-quality topics
│ └── report/ # Reports from Qwen-DeepResearch, collected in early September, 2025
└── 📋 example/ # Example inputs & outputs
├── judge_fact_result/ # Fact-checking examples
└── judge_score_result/ # Quality and Redundancy examples
- Python: 3.10+
- OS: Windows / macOS / Linux
# Clone the project
git clone https://github.com/HKUDS/DeepResearch-Eval.git
cd DeepResearch-Eval
# Install dependencies
pip install openai json-repair firecrawl-python python-dotenv tqdm requests dashscopeCreate a .env file or export env vars:
# Required
export OPENAI_API_KEY="your-openai-api-key"
export FIRECRAWL_KEY="your-firecrawl-key" # or
export JINA_API_KEY="your-jina-api-key"
# Optional
export OPENAI_API_BASE="your api base" # Custom API endpointInput (JSONL):
{"topic": "Applications of AI in Healthcare", "report": "# Report Title\n\n## Introduction\n..."}Output (JSON):
{
"file_id": "abc123...",
"topic": "Applications of AI in Healthcare",
"comprehensiveness_score": 2,
"coherence_score": 3,
"clarity_score": 4,
"insight_score": 3,
"overall_score": 3,
"quality_reason": "The report is well-structured with sufficient arguments...",
"repeat_score": 3.12,
"repeat_results": [...]
}Input (JSONL):
{"https://example.com/page": {"contexts": ["Sentence A", "Sentence B", ...]}}Output (JSONL):
{"url": "https://example.com/page", "context": "Sentence A", "label": {"is_factual": 1, "sentence_support": "..."}}📌 Scoring:
is_factual∈ { -1 (Not supported), 0 (Uncertain), 1 (Supported) }
- Quality evaluation: Use
data/topic/high_quality_topics.jsonlor your own JSONL - Fact checking: Refer to
example/judge_fact_result/example_fact_judge_input.jsonl
# Basic run
python judge_score.py \
--inputpath data/topic/high_quality_topics.jsonl \
--outputpath exp/score_results
# Resume from checkpoint
python judge_score.py \
--inputpath data/topic/high_quality_topics.jsonl \
--outputpath exp/score_results \
--resume
# Clear checkpoint and restart
python judge_score.py \
--inputpath data/topic/high_quality_topics.jsonl \
--outputpath exp/score_results \
--clear_checkpoint# Judge mode (default)
python judge_fact.py \
--inputpath example/judge_fact_result/example_fact_judge_input.jsonl \
--outputpath example/judge_fact_result/example_fact_judge_output.jsonl \
--provider jina \
--limit 3 \
--task judge
# Scrape-only mode
python judge_fact.py \
--inputpath example/judge_fact_result/example_fact_judge_input.jsonl \
--outputpath example/judge_fact_result/example_fact_scrape.out.jsonl \
--provider jina \
--limit 3 \
--task scrapeexp/score_results/
├── abc123def456.json # Evaluation result for topic 1
├── def456ghi789.json # Evaluation result for topic 2
└── ...
exp/
├── judge.txt # Detailed logs
├── judge.json # Progress records
└── checkpoint.json # Checkpoint state
example/judge_fact_result/
├── example_fact_judge_output.jsonl # Judging results
└── example_fact_scrape.out.jsonl # Scraped content
🌐 Failing to scrape pages?
- ✅ Ensure
FIRECRAWL_KEYorJINA_API_KEYis set - 🔄 Try switching
--provider(firecrawl/jina) - ⚡ Some sites may throttle/deny access; try lower concurrency
🤖 Model output parsing errors?
- 🛠️ We use
json_repairfor robust parsing - 🔄 Built-in retries; failures are logged and skipped
- 📝 Check
./exp/judge.txtfor detailed errors
📊 Zero redundancy pairs?
- 📋 Ensure the report contains first-level headings starting with
## - 📏 Increase report length (recommended > 200 characters)
Released under the MIT License
@article{fan2025understanding,
title={Understanding DeepResearch via Reports},
author={Fan, Tianyu and Niu, Xinyao and Zheng, Yuxiang and Zhang, Fengji and Huang, Chengen and Chen, Bei and Lin, Junyang and Huang, Chao},
journal={arXiv preprint arXiv:2510.07861},
year={2025}
}Thank you for your interest in our work!