ResearchAgent: Iterative Research Idea Generation over Scientific Literature

🚀 Welcome to the official repository of ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models!

Authors: Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang

ResearchAgent leverages Large Language Models (LLMs) to help researchers rapidly ideate and refine research problems grounded in existing literature. Starting from a core scientific paper, the system retrieves relevant publications and knowledge entities, then iteratively proposes and improves problems, methods, and experiment designs using collaborating LLM-based reviewing agents that provide structured feedback across multiple dimensions.

Overview

Inputs: a set of Semantic Scholar paper IDs and a knowledge store mined from papers (entities and co-occurrences).
Retrieval: fetch the target paper, pull relevant references via the Semantic Scholar Graph API, and select related entities from the knowledge store.
Problem Identification: generate a candidate research problem and rationale using LLMs.
Problem Validation: obtain multi-criteria reviews and feedback from LLM reviewers (five metrics) in parallel.
Iteration: refine the problem based on low-scoring aspects and repeat for a few rounds, keeping a concise history.

Repository structure

code/
- main.py — entrypoint to run the end-to-end pipeline
- knowledge/
  - store.py — lightweight knowledge store and entity retrieval
- models/
  - openai.py — OpenAI Chat Completions wrapper with retries/timeouts
- pipelines/
  - research_pipeline.py — orchestration of generate and validate iterations
  - agents/
    - base.py — shared prompt-formatting helpers
    - problem_identifier.py — generates/refines problems
    - problem_validator.py — reviews problems across 5 metrics in parallel
    - ...
- utils/
  - s2.py — Semantic Scholar API helpers (papers, references, embeddings)
  - data_io.py — JSONL loading and ID utilities
  - formatting.py — small text utilities
data/
- papers.jsonl — input list of paper IDs
- knowledge.jsonl — knowledge base (entities/co-occurrence)

Running

Set your OpenAI key and run the pipeline:

export OPENAI_API_KEY=YOUR_KEY
python ./code/main.py \
	--data-path ./data/papers.jsonl \
	--knowledge-path ./data/knowledge.jsonl \
	--model-name gpt-4o

Citation

If you use or build upon this project, please cite:

@inproceedings{Baek2025ResearchAgent,
  title={ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models},
  author={Jinheon Baek and Sujay Kumar Jauhar and Silviu Cucerzan and Sung Ju Hwang},
  booktitle={NAACL},
  year={2025},
  url={https://api.semanticscholar.org/CorpusID:269042844}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
code		code
data		data
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ResearchAgent: Iterative Research Idea Generation over Scientific Literature

Overview

Repository structure

Running

Citation

About

Uh oh!

Releases

Packages

Languages

JinheonBaek/ResearchAgent

Folders and files

Latest commit

History

Repository files navigation

ResearchAgent: Iterative Research Idea Generation over Scientific Literature

Overview

Repository structure

Running

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages