An experimental framework for analyzing the consistency and calibration of confidence scores across different Large Language Models (LLMs) and confidence representation formats.
This project investigates how consistent LLMs are when expressing confidence in their classifications, comparing three different confidence formats:
- Float: Confidence as a decimal between 0.0-1.0
- Categorical: Confidence as categories (very low, low, medium, high, very high)
- Integer: Confidence as integers between 0-5
We tested on two different classification tasks:
- SST-2: Stanford Sentiment Treebank (positive/negative sentiment)
- SMS Spam: Spam detection in text messages
Figure 1: Complete experimental results showing consistency, calibration, and distribution analysis across models and confidence types
-
Confidence Format Impact: Different confidence formats show varying levels of consistency, with float formats generally providing more granular and consistent confidence estimates.
-
Model Differences: GPT-4o demonstrates better calibration compared to GPT-4o-mini, with confidence scores more closely matching actual accuracy.
-
Task Dependency: Confidence consistency varies significantly between sentiment analysis and spam detection tasks, suggesting task complexity affects confidence reliability.
-
Calibration Quality: Models tend to be overconfident, with actual accuracy often lower than expressed confidence levels.
- Python 3.10+
- OpenAI API key (set in
.env
file)
# Clone the repository
git clone <repository-url>
cd llm-as-a-judge-consistency
# Activate virtual environment
source .venv/bin/activate
# Create .env file with your API key
echo "OPENAI_API_KEY=your_api_key_here" > .env
make format # Format code with ruff
make lint # Lint code
make test # Run tests
make run # Run experiment
make move-assets # Move results to assets/
src/
βββ config/
β βββ base_models.py # Pydantic models and configurations
β βββ constants.py # Project constants and enums
βββ experiment/
β βββ classifier.py # LLM judge implementation
β βββ confidence.py # Experiment runner and analysis
βββ logger.py # Logging configuration
βββ main.py # Main experiment runner
tests/
βββ test_judge.py # Unit tests
assets/
βββ plots/ # Generated visualizations
βββ data/ # Experimental data (CSV, JSON)
- GPT-4o-mini: Faster, more cost-effective model
- GPT-4o: Larger, more capable model
- Float (0.0-1.0): Continuous confidence scores
- Categorical: Five-level ordinal scale
- Integer (0-5): Discrete confidence levels
- Coefficient of Variation (CV): Measures consistency across multiple trials
- Lower CV indicates higher consistency
- Brier Score: Measures the accuracy of probabilistic predictions
- Calibration Curves: Shows relationship between confidence and actual accuracy
- Perfect calibration: confidence = accuracy
- Skewness: Asymmetry of confidence distributions
- Kurtosis: Tail heaviness of distributions
- Entropy: Information content of confidence patterns
Models show varying consistency across confidence formats:
- Float formats generally provide more consistent confidence estimates
- Categorical formats show moderate consistency
- Integer formats may suffer from discrete choice limitations
- Models tend toward overconfidence
- Calibration varies significantly by task type
- GPT-4o shows better calibration than GPT-4o-mini
ANOVA tests reveal significant differences between:
- Confidence formats (p < 0.05)
- Model types (p < 0.01)
- Task types (p < 0.001)
- Sample Selection: Stratified sampling to maintain class balance
- Multiple Trials: 3 trials per configuration to measure consistency
- Concurrent Processing: Async execution with rate limiting
- Error Handling: Fallback responses for API failures
- ANOVA: Compare consistency across groups
- Kruskal-Wallis: Non-parametric group comparisons
- Mann-Whitney U: Pairwise comparisons
- Brier Score: Probabilistic prediction accuracy
make help # Show all available commands
make install # Install dependencies
make format # Format code with ruff
make lint # Lint and fix code issues
make test # Run test suite
make run # Run full experiment
make move-assets # Organize generated files
make clean # Clean up temporary files
make typecheck # Run type checking with mypy
# Run all tests
make test
This project uses:
- Ruff: For fast Python linting and formatting
- Pytest: For comprehensive testing
- MyPy: For static type checking
- Pre-commit hooks: For automated quality checks
OPENAI_API_KEY=your_openai_api_key
from src.config.base_models import ExperimentConfig, DatasetChoice
config = ExperimentConfig(
dataset_choice=DatasetChoice.SMS_SPAM,
sample_size=100,
models=["gpt-4o-mini", "gpt-4o"],
confidence_types=["float", "categorical", "integer"]
)
Key dependencies include:
langchain-openai
: LLM integrationdatasets
: HuggingFace dataset loadingpandas
: Data manipulationmatplotlib/seaborn
: Visualizationscipy
: Statistical analysispydantic
: Data validation
- Fork the repository
- Create a feature branch
- Run code quality checks:
make dev
- Add tests for new features
- Submit a pull request
- Follow PEP 8 style guidelines
- Add type hints to all functions
- Include docstrings for classes and methods
- Maintain test coverage above 80%
This project is licensed under the MIT License - see the LICENSE file for details.
- Stanford Sentiment Treebank
- SMS Spam Collection Dataset
- LangChain Documentation
- OpenAI API Documentation
- Add support for Anthropic Claude models
- Implement cross-model consistency analysis
- Add confidence interval estimation
- Develop uncertainty quantification metrics
- Create interactive result visualization dashboard
For questions or issues, please open a GitHub issue or contact the maintainers.# llm-as-a-judge-consistency