π A toolkit for automated evaluation and fact-checking of long technical/research reports
Focus on report quality assessment and factuality verification
Reports are the most canonical and representative outputs of DeepResearch. High-quality research reports feature clear structure, rigorous logic, dense information, and trustworthy citationsβcrucial for knowledge-intensive research scenarios. To this end, we propose the DeepResearch-ReportEval framework: a hybrid evaluation approach combining LLM-as-a-Judge for automated report-quality assessment with expert judgments for reliability. The framework evaluates report quality across multiple dimensions, including comprehensiveness, redundancy, and factual accuracy. We also release a carefully curated dataset: 100 queries spanning diverse categories and 100 corresponding reports generated by Qwen-DeepResearch to support systematic evaluation.
| π― Quality Scoring | π Fact Checking | π Redundancy Detection |
|---|---|---|
| Five-dimension scoring | Web content verification | Pairwise paragraph analysis |
- β Five dimensions: Comprehensiveness, Coherence, Clarity, Insightfulness, Overall
- π Redundancy detection: Smart sampling of paragraph pairs to compute average redundancy score
- π Rationales: Provide reasoned explanations for the scores
- πΎ Checkpointing: Resume from checkpoints to avoid recomputation
- π Web scraping: Support both Firecrawl and Jina Reader
- π Batch verification: Check each provided context against the scraped page
- π― Ternary scoring: -1 (Not supported) / 0 (Uncertain) / 1 (Supported)
- π Explanations: Provide supporting evidence and analysis
DeepResearch-ReportEval/
βββ π judge_score.py # Main script for quality evaluation
βββ π judge_fact.py # Main script for fact checking
βββ π οΈ Atools.py # Utilities and model calls
βββ π Aprompts.py # Prompt templates
βββ π data/ # Dataset
β βββ topic/ # High-quality topics
β βββ report/ # Reports from Qwen-DeepResearch, collected in early September, 2025
βββ π example/ # Example inputs & outputs
βββ judge_fact_result/ # Fact-checking examples
βββ judge_score_result/ # Quality and Redundancy examples
- Python: 3.10+
- OS: Windows / macOS / Linux
# Clone the project
git clone https://github.com/HKUDS/DeepResearch-Eval.git
cd DeepResearch-Eval
# Install dependencies
pip install openai json-repair firecrawl-python python-dotenv tqdm requests dashscopeCreate a .env file or export env vars:
# Required
export OPENAI_API_KEY="your-openai-api-key"
export FIRECRAWL_KEY="your-firecrawl-key" # or
export JINA_API_KEY="your-jina-api-key"
# Optional
export OPENAI_API_BASE="your api base" # Custom API endpointInput (JSONL):
{"topic": "Applications of AI in Healthcare", "report": "# Report Title\n\n## Introduction\n..."}Output (JSON):
{
"file_id": "abc123...",
"topic": "Applications of AI in Healthcare",
"comprehensiveness_score": 2,
"coherence_score": 3,
"clarity_score": 4,
"insight_score": 3,
"overall_score": 3,
"quality_reason": "The report is well-structured with sufficient arguments...",
"repeat_score": 3.12,
"repeat_results": [...]
}Input (JSONL):
{"https://example.com/page": {"contexts": ["Sentence A", "Sentence B", ...]}}Output (JSONL):
{"url": "https://example.com/page", "context": "Sentence A", "label": {"is_factual": 1, "sentence_support": "..."}}π Scoring:
is_factualβ { -1 (Not supported), 0 (Uncertain), 1 (Supported) }
- Quality evaluation: Use
data/topic/high_quality_topics.jsonlor your own JSONL - Fact checking: Refer to
example/judge_fact_result/example_fact_judge_input.jsonl
# Basic run
python judge_score.py \
--inputpath data/topic/high_quality_topics.jsonl \
--outputpath exp/score_results
# Resume from checkpoint
python judge_score.py \
--inputpath data/topic/high_quality_topics.jsonl \
--outputpath exp/score_results \
--resume
# Clear checkpoint and restart
python judge_score.py \
--inputpath data/topic/high_quality_topics.jsonl \
--outputpath exp/score_results \
--clear_checkpoint# Judge mode (default)
python judge_fact.py \
--inputpath example/judge_fact_result/example_fact_judge_input.jsonl \
--outputpath example/judge_fact_result/example_fact_judge_output.jsonl \
--provider jina \
--limit 3 \
--task judge
# Scrape-only mode
python judge_fact.py \
--inputpath example/judge_fact_result/example_fact_judge_input.jsonl \
--outputpath example/judge_fact_result/example_fact_scrape.out.jsonl \
--provider jina \
--limit 3 \
--task scrapeexp/score_results/
βββ abc123def456.json # Evaluation result for topic 1
βββ def456ghi789.json # Evaluation result for topic 2
βββ ...
exp/
βββ judge.txt # Detailed logs
βββ judge.json # Progress records
βββ checkpoint.json # Checkpoint state
example/judge_fact_result/
βββ example_fact_judge_output.jsonl # Judging results
βββ example_fact_scrape.out.jsonl # Scraped content
π Failing to scrape pages?
- β
Ensure
FIRECRAWL_KEYorJINA_API_KEYis set - π Try switching
--provider(firecrawl/jina) - β‘ Some sites may throttle/deny access; try lower concurrency
π€ Model output parsing errors?
- π οΈ We use
json_repairfor robust parsing - π Built-in retries; failures are logged and skipped
- π Check
./exp/judge.txtfor detailed errors
π Zero redundancy pairs?
- π Ensure the report contains first-level headings starting with
## - π Increase report length (recommended > 200 characters)
Released under the MIT License
@article{fan2025understanding,
title={Understanding DeepResearch via Reports},
author={Fan, Tianyu and Niu, Xinyao and Zheng, Yuxiang and Zhang, Fengji and Huang, Chengen and Chen, Bei and Lin, Junyang and Huang, Chao},
journal={arXiv preprint arXiv:2510.07861},
year={2025}
}Thank you for your interest in our work!