Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation

Takyoung Kim^1,, Kyungjae Lee², Young Rok Jang², Ji Yong Cho^2,3, Gangwoo Kim^4,, Minseok Cho², Moontae Lee^2,5
_{¹University of Illinois Urbana-Champaign, ²LG AI Research, ³Cornell University, ⁴Korea University, ⁵University of Illinois Chicago}
_{^*Work done as a research intern at LG AI Research}

Abstract

Interactions with large language models (LLMs) often yield long and detailed responses, leveraging both parametric knowledge and retrieval-augmented generation (RAG). While these responses can provide rich insights, they often include redundant or less engaging content not aligned with user interests. This issue becomes apparent when users specify particular subtopics to include or exclude -- termed coverage-conditioned ($C^2$) queries -- as LLMs often struggle to provide tailored responses. To address this challenge, we investigate the role of query outlines, sequences of subqueries designed to guide LLMs in generating responses that meet specific user requirements. To systematically create and evaluate these outlines, we introduce QTree, a dataset of 10K hierarchical sets of information-seeking subqueries that define structured boundaries for outline creation and evaluation in $C^2$ scenarios. Additionally, we develop QPlanner, a 7B language model trained to generate customized outlines within boundaries of QTree. We evaluate the effectiveness of the generated outlines through automatic and human judgements, focusing on their impact within retrieval-augmented generation (RAG) systems. Experimental results demonstrate that QPlanner, especially when trained with alignment techniques like DPO, generates higher-quality outlines that better fulfill diverse user needs.

Resource (QTree)

Train set

# of dataset: 10,580 [LINK]
- Note: There are three more samples than those specified in the paper.
Configuration
- question: Base query ($q_{base}$)
- instruction: Coverage query ($q_{cov}$)
- background: Background query
- intention: Intent operation (include/exclude)
- tree: QTree (a hierarchical set of queries)
- candidates: Three candidate query outlines (i.e., four subqueries from QTree) extracted by LLM

Test set

# of dataset: 300 [LINK]
Configuration
- question: Base query ($q_{base}$)
- instruction: Coverage query ($q_{cov}$)
- background: Background query
- intention: Intent operation (include/exclude)
- tree: QTree (a hierarchical set of queries)

Acknowledgement

Our QTree is based on seed queries from ASQA, Longform, and ExpertQA.
We appreciate 🤗alignment-handbook for providing easy LM training framework!

Citation

@inproceedings{kim-etal-2025-learning,
    title = "Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation",
    author = "Kim, Takyoung  and
      Lee, Kyungjae  and
      Jang, Young Rok  and
      Cho, Ji Yong  and
      Kim, Gangwoo  and
      Cho, Minseok  and
      Lee, Moontae",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2025",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-naacl.80/",
    pages = "1460--1480",
    ISBN = "979-8-89176-195-7",
    abstract = "Interactions with large language models (LLMs) often yield long and detailed responses, leveraging both parametric knowledge and retrieval-augmented generation (RAG). While these responses can provide rich insights, they often include redundant or less engaging content not aligned with user interests. This issue becomes apparent when users specify particular subtopics to include or exclude {--} termed **coverage-conditioned ($C^2$)** queries {--} as LLMs often struggle to provide tailored responses. To address this challenge, we investigate the role of query outlines, sequences of subqueries designed to guide LLMs in generating responses that meet specific user requirements. To systematically create and evaluate these outlines, we introduce **QTree**, a dataset of 10K hierarchical sets of information-seeking subqueries that define structured boundaries for outline creation and evaluation in $C^2$ scenarios. Additionally, we develop **QPlanner**, a 7B language model trained to generate customized outlines within boundaries of QTree. We evaluate the effectiveness of the generated outlines through automatic and human judgements, focusing on their impact within retrieval-augmented generation (RAG) systems. Experimental results demonstrate that QPlanner, especially when trained with alignment techniques like DPO, generates higher-quality outlines that better fulfill diverse user needs."
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
prompt		prompt
README.md		README.md
generate_eval.py		generate_eval.py
generate_tree.py		generate_tree.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation

Abstract

Resource (QTree)

Train set

Test set

Acknowledgement

Citation

About

Releases

Packages

Languages

youngerous/qtree

Folders and files

Latest commit

History

Repository files navigation

Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation

Takyoung Kim1,*, Kyungjae Lee2, Young Rok Jang2, Ji Yong Cho2,3, Gangwoo Kim4,*, Minseok Cho2, Moontae Lee2,5 1University of Illinois Urbana-Champaign, 2LG AI Research, 3Cornell University, 4Korea University, 5University of Illinois Chicago *Work done as a research intern at LG AI Research

Abstract

Resource (QTree)

Train set

Test set

Acknowledgement

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages