A comprehensive evaluation framework for assessing fine-tuned language models on the ARC-Challenge benchmark, including accuracy measurement, calibration analysis, and error fingerprinting with advanced threshold sensitivity analysis.
This project implements a complete evaluation pipeline for assessing model performance on the ARC-Challenge dataset (299 questions). It goes beyond basic accuracy metrics to provide deep insights into model calibration, uncertainty estimation, and abstention behavior.
- Accuracy Measurement: Standard accuracy calculation on 299 ARC-Challenge questions
- Calibration Analysis: Brier Score calculation to assess confidence calibration
- Error Fingerprint: Selective Accuracy and Coverage metrics with confidence-based abstention
- Confidence Distribution Visualization: Histogram showing confidence patterns for correct vs incorrect predictions
- Threshold Sensitivity Analysis: Systematic testing of confidence thresholds to identify optimal abstention behavior
- Calibration Paradox Detection: Identifies cases where the model is more confident on errors than correct answers
- Error Pattern Analysis: Deep dive into low-confidence correct predictions and high-confidence errors
The evaluation demonstrates:
- Accuracy: 89.0% on ARC-Challenge validation set
- Brier Score: 0.1035 (lower is better, 0 = perfect calibration)
- Selective Accuracy: 91.8% (accuracy on high-confidence predictions)
- Coverage: 94.0% (percentage of questions answered with confidence ≥ 0.99)
- Model shows good overall calibration with a 0.0338 confidence gap between correct and incorrect predictions
- Effective abstention: 62.5% of abstained questions were incorrect, showing the model correctly identifies uncertain cases
- Calibration issue detected: Model is more confident on errors (avg: 0.9630) than on some correct answers (avg: 0.7700 for low-confidence corrects)
- Python 3.12
- OpenAI API: For model inference with logprobs
- Hugging Face Datasets: For loading ARC-Challenge dataset
- NumPy: For numerical computations
- Matplotlib: For data visualization
- Jupyter Notebook: For interactive analysis
section 1/
├── ARC_Challenge_Evaluation.ipynb # Main evaluation notebook
└── README.md # This file
-
Install Dependencies
pip install "openai>=1.0.0" "datasets>=2.0.0" numpy matplotlib
-
Configure API Key
- Update the
OPENAI_API_KEYvariable in the notebook with your API key - The notebook uses a fine-tuned model:
ft:gpt-4.1-nano-2025-04-14:algoverse-ai-safety:arc-v2:CrBuBGfj
- Update the
-
Run the Notebook
- Execute cells sequentially from top to bottom
- The evaluation processes 299 questions (takes ~5-10 minutes)
- Formats questions as multiple-choice prompts (handles any number of choices)
- Calls OpenAI API with
temperature=0for deterministic outputs - Flexible answer extraction: The
extract_answer()function dynamically usesquestion['choices']['label']to handle questions with >4 answer choices (A, B, C, D, E, F, or numeric labels like 1, 2, 3, 4) - Compares to ground truth and calculates accuracy as percentage (1 decimal place)
- Requests logprobs from API to extract confidence scores
- Converts log probabilities to probabilities:
confidence = exp(logprob) - Creates binary outcomes: 1 if correct, 0 if incorrect
- Calculates Brier Score:
BS = (1/N) * Σ(confidence - outcome)² - Reports to 4 decimal places
- Applies confidence threshold of 0.99
- Questions with confidence ≥ 0.99 are "answered", others "abstained"
- Calculates Selective Accuracy (accuracy on answered questions)
- Calculates Coverage (percentage of questions answered)
- Both reported as percentages (1 decimal place)
- Confidence Distribution: Visual histogram showing confidence patterns
- Lowest Confidence Analysis: Identifies edge cases (low-confidence correct, high-confidence errors)
- Abstained Questions Breakdown: Analyzes which questions trigger abstention
- Threshold Sensitivity: Tests thresholds from 0.90 to 0.99 to find optimal abstention behavior
- Calibration Paradox: Highlights when model confidence doesn't align with correctness
This project demonstrates research-grade evaluation by:
- Causal Hypothesis Testing: The threshold sensitivity analysis tests whether abstention behavior is stable or threshold-dependent
- Intervention-Based Analysis: Systematically varies the confidence threshold to understand its impact
- Quantitative Insights: Provides data-driven conclusions about model calibration and uncertainty
- Overconfidence Detection: The model shows high confidence (avg: 0.9630) even on incorrect predictions, indicating a calibration issue
- Effective Abstention: The 0.99 threshold effectively filters uncertain cases (62.5% of abstained were incorrect)
- Threshold Stability: Analysis shows how Selective Accuracy and Coverage trade off across different thresholds
- Calibration Paradox: Model is more confident on errors than on some correct answers, highlighting the need for better uncertainty estimation
- Robust Error Handling: Comprehensive try-except blocks for API calls
- Flexible Answer Extraction: The
extract_answer()function usesvalid_labelsparameter to dynamically handle questions with any number of answer choices (A-D, E-F, or numeric labels), fully compliant with Part A's requirement to "flexibly handle cases where there are more than 4 answer choices" - Efficient Processing: Single API call per question with logprobs included
- Data Visualization: Professional matplotlib visualizations with clear legends and labels
- Traceable Examples: Question IDs included for all example cases
- Research-Grade Analysis: Goes beyond requirements with threshold sensitivity testing, reliability diagrams, and ECE calculation
- LLM API integration (OpenAI)
- Model evaluation and benchmarking
- Calibration analysis and uncertainty estimation
- Statistical analysis (Brier Score, confidence metrics)
- Data visualization (matplotlib)
- Error analysis and pattern identification
- Research methodology (hypothesis testing, causal analysis)
- Python programming and Jupyter notebooks
This project was completed as part of the Algoverse AI Safety Fellowship take-home challenge.
- Section 2: Mechanistic Interpretability - Direct Logit Attribution analysis on GPT-2 Small
Note: This evaluation framework can be adapted for other multiple-choice benchmarks and model evaluation tasks.