Skip to content

richiectr360/ai-safety-evaluation-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

ARC-Challenge Model Evaluation

A comprehensive evaluation framework for assessing fine-tuned language models on the ARC-Challenge benchmark, including accuracy measurement, calibration analysis, and error fingerprinting with advanced threshold sensitivity analysis.

📋 Overview

This project implements a complete evaluation pipeline for assessing model performance on the ARC-Challenge dataset (299 questions). It goes beyond basic accuracy metrics to provide deep insights into model calibration, uncertainty estimation, and abstention behavior.

🎯 Key Features

Core Evaluation Metrics

  • Accuracy Measurement: Standard accuracy calculation on 299 ARC-Challenge questions
  • Calibration Analysis: Brier Score calculation to assess confidence calibration
  • Error Fingerprint: Selective Accuracy and Coverage metrics with confidence-based abstention

Advanced Analysis

  • Confidence Distribution Visualization: Histogram showing confidence patterns for correct vs incorrect predictions
  • Threshold Sensitivity Analysis: Systematic testing of confidence thresholds to identify optimal abstention behavior
  • Calibration Paradox Detection: Identifies cases where the model is more confident on errors than correct answers
  • Error Pattern Analysis: Deep dive into low-confidence correct predictions and high-confidence errors

📊 Results

The evaluation demonstrates:

  • Accuracy: 89.0% on ARC-Challenge validation set
  • Brier Score: 0.1035 (lower is better, 0 = perfect calibration)
  • Selective Accuracy: 91.8% (accuracy on high-confidence predictions)
  • Coverage: 94.0% (percentage of questions answered with confidence ≥ 0.99)

Key Findings

  • Model shows good overall calibration with a 0.0338 confidence gap between correct and incorrect predictions
  • Effective abstention: 62.5% of abstained questions were incorrect, showing the model correctly identifies uncertain cases
  • Calibration issue detected: Model is more confident on errors (avg: 0.9630) than on some correct answers (avg: 0.7700 for low-confidence corrects)

🛠️ Technologies Used

  • Python 3.12
  • OpenAI API: For model inference with logprobs
  • Hugging Face Datasets: For loading ARC-Challenge dataset
  • NumPy: For numerical computations
  • Matplotlib: For data visualization
  • Jupyter Notebook: For interactive analysis

📁 Project Structure

section 1/
├── ARC_Challenge_Evaluation.ipynb  # Main evaluation notebook
└── README.md                        # This file

🚀 Setup Instructions

  1. Install Dependencies

    pip install "openai>=1.0.0" "datasets>=2.0.0" numpy matplotlib
  2. Configure API Key

    • Update the OPENAI_API_KEY variable in the notebook with your API key
    • The notebook uses a fine-tuned model: ft:gpt-4.1-nano-2025-04-14:algoverse-ai-safety:arc-v2:CrBuBGfj
  3. Run the Notebook

    • Execute cells sequentially from top to bottom
    • The evaluation processes 299 questions (takes ~5-10 minutes)

📈 Methodology

Part A: Accuracy Measurement

  • Formats questions as multiple-choice prompts (handles any number of choices)
  • Calls OpenAI API with temperature=0 for deterministic outputs
  • Flexible answer extraction: The extract_answer() function dynamically uses question['choices']['label'] to handle questions with >4 answer choices (A, B, C, D, E, F, or numeric labels like 1, 2, 3, 4)
  • Compares to ground truth and calculates accuracy as percentage (1 decimal place)

Part B: Calibration Analysis (Brier Score)

  • Requests logprobs from API to extract confidence scores
  • Converts log probabilities to probabilities: confidence = exp(logprob)
  • Creates binary outcomes: 1 if correct, 0 if incorrect
  • Calculates Brier Score: BS = (1/N) * Σ(confidence - outcome)²
  • Reports to 4 decimal places

Part C: Error Fingerprint

  • Applies confidence threshold of 0.99
  • Questions with confidence ≥ 0.99 are "answered", others "abstained"
  • Calculates Selective Accuracy (accuracy on answered questions)
  • Calculates Coverage (percentage of questions answered)
  • Both reported as percentages (1 decimal place)

Error Analysis & Threshold Sensitivity

  • Confidence Distribution: Visual histogram showing confidence patterns
  • Lowest Confidence Analysis: Identifies edge cases (low-confidence correct, high-confidence errors)
  • Abstained Questions Breakdown: Analyzes which questions trigger abstention
  • Threshold Sensitivity: Tests thresholds from 0.90 to 0.99 to find optimal abstention behavior
  • Calibration Paradox: Highlights when model confidence doesn't align with correctness

🔬 Research Contributions

This project demonstrates research-grade evaluation by:

  1. Causal Hypothesis Testing: The threshold sensitivity analysis tests whether abstention behavior is stable or threshold-dependent
  2. Intervention-Based Analysis: Systematically varies the confidence threshold to understand its impact
  3. Quantitative Insights: Provides data-driven conclusions about model calibration and uncertainty

💡 Key Insights

  1. Overconfidence Detection: The model shows high confidence (avg: 0.9630) even on incorrect predictions, indicating a calibration issue
  2. Effective Abstention: The 0.99 threshold effectively filters uncertain cases (62.5% of abstained were incorrect)
  3. Threshold Stability: Analysis shows how Selective Accuracy and Coverage trade off across different thresholds
  4. Calibration Paradox: Model is more confident on errors than on some correct answers, highlighting the need for better uncertainty estimation

📝 Technical Highlights

  • Robust Error Handling: Comprehensive try-except blocks for API calls
  • Flexible Answer Extraction: The extract_answer() function uses valid_labels parameter to dynamically handle questions with any number of answer choices (A-D, E-F, or numeric labels), fully compliant with Part A's requirement to "flexibly handle cases where there are more than 4 answer choices"
  • Efficient Processing: Single API call per question with logprobs included
  • Data Visualization: Professional matplotlib visualizations with clear legends and labels
  • Traceable Examples: Question IDs included for all example cases
  • Research-Grade Analysis: Goes beyond requirements with threshold sensitivity testing, reliability diagrams, and ECE calculation

🎓 Skills Demonstrated

  • LLM API integration (OpenAI)
  • Model evaluation and benchmarking
  • Calibration analysis and uncertainty estimation
  • Statistical analysis (Brier Score, confidence metrics)
  • Data visualization (matplotlib)
  • Error analysis and pattern identification
  • Research methodology (hypothesis testing, causal analysis)
  • Python programming and Jupyter notebooks

📄 License

This project was completed as part of the Algoverse AI Safety Fellowship take-home challenge.

🔗 Related Projects


Note: This evaluation framework can be adapted for other multiple-choice benchmarks and model evaluation tasks.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published