Firm or Fickle? Evaluating LLM Consistency in Sequential Interactions

This official repository accompanies our paper "Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions".

The work introduces a systematic evaluation framework for assessing the consistency of large language models (LLMs) over multi-turn interactions. It also proposes a novel Position-Weighted Consistency (PWC) score and a Confidence-Aware Response Generation (CARG) framework for robust multi-turn consistency of LLMs in high-stakes domains.

🔊 Play the audio demo

Abstract

Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent performance across multiple interaction rounds. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions. First, we propose a novel Position-Weighted Consistency (PWC) score that captures both the importance of earlystage stability and recovery patterns in multiturn interactions. Second, we present a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce Confidence-Aware Response Generation (CARG), a framework that significantly improves response stability by incorporating model confidence signals into the generation process. Empirical results demonstrate that CARG significantly improves response stability without sacrificing accuracy, underscoring its potential for reliable LLM deployment in critical applications.

Experimental Design & CARG Framework Overview

Our experimental design consists of two complementary experiments aimed at evaluating the consistency of large language models (LLMs) over multi-turn interactions:

Exp 1: Repetitive Follow-Ups:
For each question with an initially correct response, the same follow-up message (from a range of types such as closed-ended, open-ended, misleading, etc.) is applied repeatedly across multiple rounds. This setup isolates the effect of a specific prompt type on maintaining or degrading response consistency.
Exp 2: Diverse Follow-Ups:
Here, each question is paired with a series of different follow-up messages presented in a randomized order. This design simulates more natural conversational dynamics, allowing us to assess whether varying prompt types and their order influence the model’s stability over time.

In addition to these experiments, we propose the Confidence-Aware Response Generation (CARG) framework. CARG enhances consistency by:

Extracting token-level log probabilities to compute confidence scores for each response.
Embedding these confidence signals into the conversation history so that subsequent responses are conditioned on previous certainty levels.
Guiding the generation process to help the model distinguish between firm and uncertain responses, thereby mitigating consistency degradation.

Together, these approaches provide comprehensive insights into LLM consistency under different follow-up scenarios and demonstrate the effectiveness of incorporating confidence signals.

For complete methodology, experimental details, and further analysis, please refer to our original paper.

Main Results

Plot 1: Initial Accuracy of LLMs on Benchmark Tasks

Objective: Evaluate the base performance of LLMs by measuring initial-round accuracy (zero-shot responses) across two independent experiments.
Findings:
- A clear stratification is observed: Commercial models such as Claude (85%) and GPT (78%) significantly outperform open-source models like LLaMA (65%) and Mistral.
- The performance spread is approximately 20 percentage points (∆ = 0.18, p < 0.001 via paired permutation test).
- The results indicate that a model’s internal knowledge—its capacity to provide correct answers without iterative refinement—is a strong indicator of its broader competence.

Plot 2: Accuracy Trends Across Follow-Up Rounds

Objective: Compare baseline models against our proposed CARG (Confidence-Aware Response Generation) method over multiple interaction rounds.
Findings:
- The CARG framework demonstrates remarkably stable performance, with a mean accuracy of 0.7482 (σ = 0.0058), maintaining consistency from round 1 (0.7543) through round 8 (0.7414).
- Among baseline approaches, gpt_default shows the strongest consistency (mean = 0.7134, σ = 0.0157), yet CARG significantly outperforms it (p < 0.001, paired t-test).
- This comparison highlights CARG's effectiveness in mitigating consistency degradation across multi-turn interactions.

Usage

Running Experiments

To run experiments with different models and configurations:

python run_experiment.py

The experiment script supports both repetitive and diverse follow-up scenarios. Configure the experiment parameters by editing the variables in run_experiment.py:

exp: Set to 'diverse' or 'repetitive'
model_list: Choose from available models
batch_size: Number of questions per batch
rounds: Number of follow-up rounds

Evaluation

To evaluate experiment results and generate visualizations:

python evaluate.py

Citation

If you find our survey useful, please cite it as follows:

@inproceedings{li-etal-2025-firm,
    title = "Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions",
    author = "Li, Yubo  and
      Miao, Yidi  and
      Ding, Xueying  and
      Krishnan, Ramayya  and
      Padman, Rema",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.347/",
    doi = "10.18653/v1/2025.findings-acl.347",
    pages = "6679--6700",
    ISBN = "979-8-89176-256-5",
    abstract = "Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent and coherent behavior across multiple rounds of user interaction. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions . First, we introduce Position-Weighted Consistency (PWC), a metric designed to capture both the importance of early-stage stability and recovery patterns in multi-turn interactions. Second, we present MT-Consistency, a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce Confidence-Aware Response Generation (CARG), a framework that significantly improves response stability by explicitly integrating internal model confidence scores during the generation process. Experimental results demonstrate that CARG significantly improves response stability without sacrificing accuracy, offering a practical path toward more dependable LLM behavior in critical, real-world deployments."
}

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
audio		audio
figs		figs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
evaluate.py		evaluate.py
evaluate_health.py		evaluate_health.py
merge_results.py		merge_results.py
requirements.yml		requirements.yml
run_experiment.py		run_experiment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Firm or Fickle? Evaluating LLM Consistency in Sequential Interactions

Abstract

Experimental Design & CARG Framework Overview

Main Results

Plot 1: Initial Accuracy of LLMs on Benchmark Tasks

Plot 2: Accuracy Trends Across Follow-Up Rounds

Usage

Running Experiments

Evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

License

yubol-bobo/MT-Consistency

Folders and files

Latest commit

History

Repository files navigation

Firm or Fickle? Evaluating LLM Consistency in Sequential Interactions

Abstract

Experimental Design & CARG Framework Overview

Main Results

Plot 1: Initial Accuracy of LLMs on Benchmark Tasks

Plot 2: Accuracy Trends Across Follow-Up Rounds

Usage

Running Experiments

Evaluation

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages