A curated list of Papers, Datasets and Code Repositories for Multi-turn Interactions with Large Language Models. This repository compiles a majority of research works in the multi-turn LLM field, though it may not be fully exhaustive.
⭐⭐⭐Our detailed thoughts and review of multi-turn LLMs, including task types, common improvements, and open challenges, are presented in this survey: Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models.
If you notice any missing research works or spot inaccuracies, feel free to reach out or open an issue. We also welcome submissions of multi-turn related work from everyone!
- Awesome-Multi-Turn-LLMs
New dataset created in the work.
Benchmark proposed in the work.
Reinforcement Learning used in the work.
Other improvement method(s) used in the work.
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [NeurIPS 2023] [GitHub]
- (MT-Bench++) Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models [ACL 2024] [Hugging Face]
- MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models [EMNLP 2024] [GitHub]
- MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues [ACL 2024] [GitHub]
- M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models [arXiv] [Hugging Face]
- FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback [arXiv] [GitHub]
- Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [arXiv] [GitHub] [Hugging Face]
- FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs [ICLR 2025] [GitHub]
- AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability [arXiv] [GitHub]
- MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback [ICLR 2024] [GitHub]
- WILT: A Multi-Turn, Memorization-Robust Inductive Logic Benchmark for LLMs [arXiv] [GitHub]
- Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions [arXiv] [GitHub]
- SysBench: Can Large Language Models Follow System Messages? [ICLR 2025] [GitHub]
- MathChat: Converse to Tackle Challenging Math Problems with LLM Agents [ICLR 2024 Workshop]
- Building Math Agents with Multi-Turn Iterative Preference Learning [arXiv]
- MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions [arXiv][GitHub]
- Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step [ACL 2024][GitHub]
- Steering Large Language Models between Code Execution and Textual Reasoning [ICLR 2025][GitHub] [Hugging Face]
- From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging[arXiv][GitHub]
- InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback [NeurIPS 2023 Datasets and Benchmarks][GitHub]
- What Makes Large Language Models Reason in (Multi-Turn) Code Generation? [arXiv]
- PyBench: Evaluating LLM Agent on various real-world coding tasks [arXiv] [GitHub][Hugging Face]
- CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis [ICLR 2023] [GitHub] [Hugging Face]
- CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [ICLR 2023][GitHub] [Hugging Face]
- CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance [arXiv][GitHub] [Hugging Face]
- OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement [ACL 2024] [GitHub] [Hugging Face]
- Executable Code Actions Elicit Better LLM Agents [ICML 2024] [GitHub] [Hugging Face]
- EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records [EMNLP 2024] [GitHub]
- Evaluating and Enhancing LLMs for Multi-turn Text-to-SQL with Multiple Question Types [IJCNN 2025] [GitHub]
- PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits [ACL 2024] [GitHub]
- CharacterChat: Learning towards Conversational AI with Personalized Social Support [arXiv] [GitHub]
- Better Zero-Shot Reasoning with Role-Play Prompting [ACL 2024] [GitHub]
- PIPPA: A Partially Synthetic Conversational Dataset [arXiv] [Hugging Face]
- Enhancing Chat Language Models by Scaling High-quality Instructional Conversations [EMNLP 2023] [GitHub]
- PRODIGy: a PROfile-based DIalogue Generation dataset [ACL 2024] [GitHub]
- ChatHaruhi: Reviving Anime Character in Reality via Large Language Model [arXiv] [GitHub]
- CharacterGLM: Customizing Chinese Conversational AI Characters with Large Language Models [EMNLP 2024] [GitHub]
- RoleCraft-GLM: Advancing Personalized Role-Playing in Large Language Models [ACL 2024]
- Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment [ACL 2024] [GitHub]
- Character-LLM: A Trainable Agent for Role-Playing [EMNLP 2023] [GitHub]
- PersonaPKT: Building Personalized Dialogue Agents via Parameter-efficient Knowledge Transfer [ACL 2023]
- LLMs + Persona-Plug = Personalized LLMs [arXiv] [Hugging Face]
- Neeko: Leveraging dynamic lora for efficient multi-character role-playing agent [EMNLP 2024] [GitHub]
- Instruct Once, Chat Consistently in Multiple Rounds: An Efficient Tuning Framework for Dialogue [ACL 2024] [GitHub]
- Building Persona Consistent Dialogue Agents with Offline Reinforcement Learning [EMNLP 2023]
- Beyond Retrieval: Embracing Compressive Memory in Real-World Long-Term Conversations [arXiv] [GitHub]
- LaMP: When Large Language Models Meet Personalization [ACL 2024] [GitHub]
- CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation [ACL 2024] [GitHub]
- RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models [arXiv] [GitHub]
- TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models [ACL 2024] [GitHub]
- InCharacter: Evaluating Personality Fidelity in Role-Playing Agents through Psychological Interviews [ACL 2024] [GitHub]
- RoleInteract: Evaluating the Social Interaction of Role-Playing Agents [arXiv] [GitHub]
- SIMULBENCH: Evaluating Language Models with Creative Simulation Tasks [arXiv] [GitHub] [Hugging Face]
- Evaluating Character Understanding of Large Language Models via Character Profiling from Fictional Work [EMNLP 2024] [GitHub]
- Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation [JMIR Med Inform]
- Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding [CoRR 2023]
- HuatuoGPT, towards Taming Language Model to Be a Doctor [arXiv] [GitHub]
- DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation [arXiv] [GitHub]
- SMILE: Single-turn to Multi-turn Inclusive Language Expansion via ChatGPT for Mental Health Support [EMNLP 2024] [GitHub]
- Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue [arXiv] [GitHub]
- An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models [arXiv]
- BianQue: Balancing the Questioning and Suggestion Ability of Health LLMs with Multi-turn Health Conversations Polished by ChatGPT [arXiv] [GitHub]
- Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model [arXiv]
- Towards Conversational Diagnostic AI [arXiv]
- CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling [arXiv] [GitHub]
- Automatic Interactive Evaluation for Large Language Models with State Aware Patient Simulator [arXiv] [GitHub]
- HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs [COLM 2024] [GitHub]
- Aqulia-Med LLM: Pioneering Full-Process Open-Source Medical Language Models [arXiv] [Hugging Face]
- T-Agent: A Term-Aware Agent for Medical Dialogue Generation [IJCNN 2024]
- MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning [NeurIPS 2024] [Hugging Face] [GitHub]
- BiMediX: Bilingual Medical Mixture of Experts LLM [arXiv] [Hugging Face] [GitHub]
- PsycoLLM: Enhancing LLM for Psychological Understanding and Evaluation [IEEE Trans. Comput. Soc.] [GitHub]
- Interactive Evaluation for Medical LLMs via Task-oriented Dialogue System [COLING 2025] [GitHub]
- Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning [arXiv] [GitHub]
- An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models [arXiv]
- SocraticLM: Exploring Socratic Personalized Teaching with Large Language Models [NeurIPS 2024] Code]
- MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems [EMNLP 2023] [GitHub] [Hugging Face]
- Boosting Large Language Models with Socratic Method for Conversational Mathematics Teaching [ACM] [GitHub]
- MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors [arXiv] [GitHub]
- Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging [EMNLP 2024] [GitHub]
- Towards the Pedagogical Steering of Large Language Models for Tutoring: A Case Study with Modeling Productive Failure [arXiv] [GitHub]
- One Size doesn't Fit All: A Personalized Conversational Tutoring Agent for Mathematics Instruction [arXiv]
- A Step Towards Adaptive Online Learning: Exploring the Role of GPT as Virtual Teaching Assistants in Online Education
- Book2Dial: Generating Teacher-Student Interactions from Textbooks for Cost-Effective Development of Educational Chatbots [ACL 2024] [Github]
- CourseAssist: Pedagogically Appropriate AI Tutor for Computer Science Education [ACM] [GitHub]
- Designing Safe and Relevant Generative Chats for Math Learning in Intelligent Tutoring Systems [[Journal of Educational Data Mining (https://jedm.educationaldatamining.org/index.php/JEDM/article/view/840)]
- Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues [ArXiv] [GitHub]
- Crescendo: Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack [arXiv] [GitHub]
- ActorAttack: Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues [arXiv] [GitHub] [Hugging Face]
- Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks [arXiv] [Hugging Face]
- Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue [arXiv]
- Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models [arXiv]
- MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue [arXiv]
- RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking [arXiv] [GitHub]
- When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers [arXiv] [GitHub]
In our survey paper on multi-turn interactions and tasks for large language models (LLMs), we categorize a wide range of tasks, including instruction-following scenarios and more complex conversational engagement tasks. To complement this, we also include an illustration highlighting key open challenges in this domain. If you're interested in the detailed improvement methods and a deeper discussion of the open challenges, please refer to our Full Paper.
