📊 Understanding DeepResearch via Reports

🔍 A toolkit for automated evaluation and fact-checking of long technical/research reports

Focus on report quality assessment and factuality verification

Why understand a deep research system via its reports?

Reports are the most canonical and representative outputs of DeepResearch. High-quality research reports feature clear structure, rigorous logic, dense information, and trustworthy citations—crucial for knowledge-intensive research scenarios. To this end, we propose the DeepResearch-ReportEval framework: a hybrid evaluation approach combining LLM-as-a-Judge for automated report-quality assessment with expert judgments for reliability. The framework evaluates report quality across multiple dimensions, including comprehensiveness, redundancy, and factual accuracy. We also release a carefully curated dataset: 100 queries spanning diverse categories and 100 corresponding reports generated by Qwen-DeepResearch to support systematic evaluation.

✨ Highlights

🎯 Quality Scoring	🔍 Fact Checking	📈 Redundancy Detection
Five-dimension scoring	Web content verification	Pairwise paragraph analysis

🚀 Core Features

📊 Quality Evaluation (`judge_score.py`)

✅ Five dimensions: Comprehensiveness, Coherence, Clarity, Insightfulness, Overall
🔄 Redundancy detection: Smart sampling of paragraph pairs to compute average redundancy score
📝 Rationales: Provide reasoned explanations for the scores
💾 Checkpointing: Resume from checkpoints to avoid recomputation

🔍 Fact Checking (`judge_fact.py`)

🌐 Web scraping: Support both Firecrawl and Jina Reader
📋 Batch verification: Check each provided context against the scraped page
🎯 Ternary scoring: -1 (Not supported) / 0 (Uncertain) / 1 (Supported)
📄 Explanations: Provide supporting evidence and analysis

📁 Project Structure

DeepResearch-ReportEval/
├── 📊 judge_score.py          # Main script for quality evaluation
├── 🔍 judge_fact.py           # Main script for fact checking
├── 🛠️ Atools.py               # Utilities and model calls
├── 📝 Aprompts.py             # Prompt templates
├── 📂 data/                   # Dataset
│   ├── topic/                 # High-quality topics
│   └── report/                # Reports from Qwen-DeepResearch, collected in early September, 2025
└── 📋 example/                # Example inputs & outputs
    ├── judge_fact_result/     # Fact-checking examples
    └── judge_score_result/    # Quality and Redundancy examples

⚙️ Environment Setup

📋 Requirements

Python: 3.10+
OS: Windows / macOS / Linux

📦 Installation

# Clone the project
git clone https://github.com/HKUDS/DeepResearch-Eval.git
cd DeepResearch-Eval

# Install dependencies
pip install openai json-repair firecrawl-python python-dotenv tqdm requests dashscope

🔑 Environment Variables

Create a .env file or export env vars:

# Required
export OPENAI_API_KEY="your-openai-api-key"
export FIRECRAWL_KEY="your-firecrawl-key"        # or
export JINA_API_KEY="your-jina-api-key"

# Optional
export OPENAI_API_BASE="your api base"  # Custom API endpoint

📊 Data Formats

📝 Quality Evaluation

Input (JSONL):

{"topic": "Applications of AI in Healthcare", "report": "# Report Title\n\n## Introduction\n..."}

Output (JSON):

{
  "file_id": "abc123...",
  "topic": "Applications of AI in Healthcare",
  "comprehensiveness_score": 2,
  "coherence_score": 3,
  "clarity_score": 4,
  "insight_score": 3,
  "overall_score": 3,
  "quality_reason": "The report is well-structured with sufficient arguments...",
  "repeat_score": 3.12,
  "repeat_results": [...]
}

🔍 Fact Checking

Input (JSONL):

{"https://example.com/page": {"contexts": ["Sentence A", "Sentence B", ...]}}

Output (JSONL):

{"url": "https://example.com/page", "context": "Sentence A", "label": {"is_factual": 1, "sentence_support": "..."}}

📌 Scoring: is_factual ∈ { -1 (Not supported), 0 (Uncertain), 1 (Supported) }

🚀 Quick Start

📋 Prepare Data

Quality evaluation: Use data/topic/high_quality_topics.jsonl or your own JSONL
Fact checking: Refer to example/judge_fact_result/example_fact_judge_input.jsonl

💻 Examples

📊 Quality Evaluation

# Basic run
python judge_score.py \
  --inputpath data/topic/high_quality_topics.jsonl \
  --outputpath exp/score_results

# Resume from checkpoint
python judge_score.py \
  --inputpath data/topic/high_quality_topics.jsonl \
  --outputpath exp/score_results \
  --resume

# Clear checkpoint and restart
python judge_score.py \
  --inputpath data/topic/high_quality_topics.jsonl \
  --outputpath exp/score_results \
  --clear_checkpoint

🔍 Fact Checking

# Judge mode (default)
python judge_fact.py \
  --inputpath example/judge_fact_result/example_fact_judge_input.jsonl \
  --outputpath example/judge_fact_result/example_fact_judge_output.jsonl \
  --provider jina \
  --limit 3 \
  --task judge

# Scrape-only mode
python judge_fact.py \
  --inputpath example/judge_fact_result/example_fact_judge_input.jsonl \
  --outputpath example/judge_fact_result/example_fact_scrape.out.jsonl \
  --provider jina \
  --limit 3 \
  --task scrape

📁 Outputs

📊 Quality Evaluation Outputs

exp/score_results/
├── abc123def456.json    # Evaluation result for topic 1
├── def456ghi789.json    # Evaluation result for topic 2
└── ...

exp/
├── judge.txt            # Detailed logs
├── judge.json           # Progress records
└── checkpoint.json      # Checkpoint state

🔍 Fact Checking Outputs

example/judge_fact_result/
├── example_fact_judge_output.jsonl    # Judging results
└── example_fact_scrape.out.jsonl      # Scraped content

❓ FAQ

🌐 Failing to scrape pages?

✅ Ensure FIRECRAWL_KEY or JINA_API_KEY is set
🔄 Try switching --provider (firecrawl/jina)
⚡ Some sites may throttle/deny access; try lower concurrency

🤖 Model output parsing errors?

🛠️ We use json_repair for robust parsing
🔄 Built-in retries; failures are logged and skipped
📝 Check ./exp/judge.txt for detailed errors

📊 Zero redundancy pairs?

📋 Ensure the report contains first-level headings starting with ##
📏 Increase report length (recommended > 200 characters)

📄 License

Released under the MIT License

🌟Citation

@article{fan2025understanding,
  title={Understanding DeepResearch via Reports},
  author={Fan, Tianyu and Niu, Xinyao and Zheng, Yuxiang and Zhang, Fengji and Huang, Chengen and Chen, Bei and Lin, Junyang and Huang, Chao},
  journal={arXiv preprint arXiv:2510.07861},
  year={2025}
}

Thank you for your interest in our work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📊 Understanding DeepResearch via Reports

Why understand a deep research system via its reports?

✨ Highlights

🚀 Core Features

📊 Quality Evaluation (`judge_score.py`)

🔍 Fact Checking (`judge_fact.py`)

📁 Project Structure

⚙️ Environment Setup

📋 Requirements

📦 Installation

🔑 Environment Variables

📊 Data Formats

📝 Quality Evaluation

🔍 Fact Checking

🚀 Quick Start

📋 Prepare Data

💻 Examples

📊 Quality Evaluation

🔍 Fact Checking

📁 Outputs

📊 Quality Evaluation Outputs

🔍 Fact Checking Outputs

❓ FAQ

📄 License

🌟Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

📊 Understanding DeepResearch via Reports

Why understand a deep research system via its reports?

✨ Highlights

🚀 Core Features

📊 Quality Evaluation (judge_score.py)

🔍 Fact Checking (judge_fact.py)

📁 Project Structure

⚙️ Environment Setup

📋 Requirements

📦 Installation

🔑 Environment Variables

📊 Data Formats

📝 Quality Evaluation

🔍 Fact Checking

🚀 Quick Start

📋 Prepare Data

💻 Examples

📊 Quality Evaluation

🔍 Fact Checking

📁 Outputs

📊 Quality Evaluation Outputs

🔍 Fact Checking Outputs

❓ FAQ

📄 License

🌟Citation

📊 Quality Evaluation (`judge_score.py`)

🔍 Fact Checking (`judge_fact.py`)