ARC-Challenge Model Evaluation

A comprehensive evaluation framework for assessing fine-tuned language models on the ARC-Challenge benchmark, including accuracy measurement, calibration analysis, and error fingerprinting with advanced threshold sensitivity analysis.

📋 Overview

This project implements a complete evaluation pipeline for assessing model performance on the ARC-Challenge dataset (299 questions). It goes beyond basic accuracy metrics to provide deep insights into model calibration, uncertainty estimation, and abstention behavior.

🎯 Key Features

Core Evaluation Metrics

Accuracy Measurement: Standard accuracy calculation on 299 ARC-Challenge questions
Calibration Analysis: Brier Score calculation to assess confidence calibration
Error Fingerprint: Selective Accuracy and Coverage metrics with confidence-based abstention

Advanced Analysis

Confidence Distribution Visualization: Histogram showing confidence patterns for correct vs incorrect predictions
Threshold Sensitivity Analysis: Systematic testing of confidence thresholds to identify optimal abstention behavior
Calibration Paradox Detection: Identifies cases where the model is more confident on errors than correct answers
Error Pattern Analysis: Deep dive into low-confidence correct predictions and high-confidence errors

📊 Results

The evaluation demonstrates:

Accuracy: 89.0% on ARC-Challenge validation set
Brier Score: 0.1035 (lower is better, 0 = perfect calibration)
Selective Accuracy: 91.8% (accuracy on high-confidence predictions)
Coverage: 94.0% (percentage of questions answered with confidence ≥ 0.99)

Key Findings

Model shows good overall calibration with a 0.0338 confidence gap between correct and incorrect predictions
Effective abstention: 62.5% of abstained questions were incorrect, showing the model correctly identifies uncertain cases
Calibration issue detected: Model is more confident on errors (avg: 0.9630) than on some correct answers (avg: 0.7700 for low-confidence corrects)

🛠️ Technologies Used

Python 3.12
OpenAI API: For model inference with logprobs
Hugging Face Datasets: For loading ARC-Challenge dataset
NumPy: For numerical computations
Matplotlib: For data visualization
Jupyter Notebook: For interactive analysis

📁 Project Structure

section 1/
├── ARC_Challenge_Evaluation.ipynb  # Main evaluation notebook
└── README.md                        # This file

🚀 Setup Instructions

Install Dependencies

pip install "openai>=1.0.0" "datasets>=2.0.0" numpy matplotlib

Configure API Key
- Update the OPENAI_API_KEY variable in the notebook with your API key
- The notebook uses a fine-tuned model: ft:gpt-4.1-nano-2025-04-14:algoverse-ai-safety:arc-v2:CrBuBGfj
Run the Notebook
- Execute cells sequentially from top to bottom
- The evaluation processes 299 questions (takes ~5-10 minutes)

📈 Methodology

Part A: Accuracy Measurement

Formats questions as multiple-choice prompts (handles any number of choices)
Calls OpenAI API with temperature=0 for deterministic outputs
Flexible answer extraction: The extract_answer() function dynamically uses question['choices']['label'] to handle questions with >4 answer choices (A, B, C, D, E, F, or numeric labels like 1, 2, 3, 4)
Compares to ground truth and calculates accuracy as percentage (1 decimal place)

Part B: Calibration Analysis (Brier Score)

Requests logprobs from API to extract confidence scores
Converts log probabilities to probabilities: confidence = exp(logprob)
Creates binary outcomes: 1 if correct, 0 if incorrect
Calculates Brier Score: BS = (1/N) * Σ(confidence - outcome)²
Reports to 4 decimal places

Part C: Error Fingerprint

Applies confidence threshold of 0.99
Questions with confidence ≥ 0.99 are "answered", others "abstained"
Calculates Selective Accuracy (accuracy on answered questions)
Calculates Coverage (percentage of questions answered)
Both reported as percentages (1 decimal place)

Error Analysis & Threshold Sensitivity

Confidence Distribution: Visual histogram showing confidence patterns
Lowest Confidence Analysis: Identifies edge cases (low-confidence correct, high-confidence errors)
Abstained Questions Breakdown: Analyzes which questions trigger abstention
Threshold Sensitivity: Tests thresholds from 0.90 to 0.99 to find optimal abstention behavior
Calibration Paradox: Highlights when model confidence doesn't align with correctness

🔬 Research Contributions

This project demonstrates research-grade evaluation by:

Causal Hypothesis Testing: The threshold sensitivity analysis tests whether abstention behavior is stable or threshold-dependent
Intervention-Based Analysis: Systematically varies the confidence threshold to understand its impact
Quantitative Insights: Provides data-driven conclusions about model calibration and uncertainty

💡 Key Insights

Overconfidence Detection: The model shows high confidence (avg: 0.9630) even on incorrect predictions, indicating a calibration issue
Effective Abstention: The 0.99 threshold effectively filters uncertain cases (62.5% of abstained were incorrect)
Threshold Stability: Analysis shows how Selective Accuracy and Coverage trade off across different thresholds
Calibration Paradox: Model is more confident on errors than on some correct answers, highlighting the need for better uncertainty estimation

📝 Technical Highlights

Robust Error Handling: Comprehensive try-except blocks for API calls
Flexible Answer Extraction: The extract_answer() function uses valid_labels parameter to dynamically handle questions with any number of answer choices (A-D, E-F, or numeric labels), fully compliant with Part A's requirement to "flexibly handle cases where there are more than 4 answer choices"
Efficient Processing: Single API call per question with logprobs included
Data Visualization: Professional matplotlib visualizations with clear legends and labels
Traceable Examples: Question IDs included for all example cases
Research-Grade Analysis: Goes beyond requirements with threshold sensitivity testing, reliability diagrams, and ECE calculation

🎓 Skills Demonstrated

LLM API integration (OpenAI)
Model evaluation and benchmarking
Calibration analysis and uncertainty estimation
Statistical analysis (Brier Score, confidence metrics)
Data visualization (matplotlib)
Error analysis and pattern identification
Research methodology (hypothesis testing, causal analysis)
Python programming and Jupyter notebooks

📄 License

This project was completed as part of the Algoverse AI Safety Fellowship take-home challenge.

🔗 Related Projects

Section 2: Mechanistic Interpretability - Direct Logit Attribution analysis on GPT-2 Small

Note: This evaluation framework can be adapted for other multiple-choice benchmarks and model evaluation tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
ARC_Challenge_Evaluation.ipynb		ARC_Challenge_Evaluation.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARC-Challenge Model Evaluation

📋 Overview

🎯 Key Features

Core Evaluation Metrics

Advanced Analysis

📊 Results

Key Findings

🛠️ Technologies Used

📁 Project Structure

🚀 Setup Instructions

📈 Methodology

Part A: Accuracy Measurement

Part B: Calibration Analysis (Brier Score)

Part C: Error Fingerprint

Error Analysis & Threshold Sensitivity

🔬 Research Contributions

💡 Key Insights

📝 Technical Highlights

🎓 Skills Demonstrated

📄 License

🔗 Related Projects

About

Uh oh!

Releases

Packages

Languages

richiectr360/ai-safety-evaluation-tools

Folders and files

Latest commit

History

Repository files navigation

ARC-Challenge Model Evaluation

📋 Overview

🎯 Key Features

Core Evaluation Metrics

Advanced Analysis

📊 Results

Key Findings

🛠️ Technologies Used

📁 Project Structure

🚀 Setup Instructions

📈 Methodology

Part A: Accuracy Measurement

Part B: Calibration Analysis (Brier Score)

Part C: Error Fingerprint

Error Analysis & Threshold Sensitivity

🔬 Research Contributions

💡 Key Insights

📝 Technical Highlights

🎓 Skills Demonstrated

📄 License

🔗 Related Projects

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages