Skip to content

Conversation

@gsarti
Copy link
Collaborator

@gsarti gsarti commented Jan 7, 2026

Summary

This PR adds a comprehensive benchmark suite for evaluating NNsight skills across different difficulty levels and interpretability techniques.

What's included

  • 17 query files organized by difficulty (6 easy, 6 medium, 5 hard) covering:

    • nnsight-basics: Core tracing, saving, interventions
    • logit-lens: Layer-wise prediction decoding
    • activation-patching: Causal intervention via swapping
    • attribution-patching: Gradient-based approximation
    • causal-tracing: Mediation analysis
    • model-steering: Steering vectors, persistent edits
  • Infrastructure:

    • Schema definitions for queries and results (schema.py)
    • Structural validator for API correctness (validators/structural.py)
    • Deprecated pattern checker for pre-0.5 NNsight detection (validators/deprecated.py)
    • Mock, Claude Code CLI, and Claude API runners (runners/)
    • Results analysis and comparison tools (analyze.py)
  • Documentation:

    • Comprehensive README with query taxonomy
    • Evaluation metrics (execution, API correctness, functional correctness, efficiency)
    • A/B testing protocol for skill effectiveness

Filename changes

Query files now use descriptive hyphen-separated titles based on their content (e.g., extract-hidden-states.yaml, position-specific-head-patching.yaml) instead of generic numeric IDs.

Test plan

  • Verify validators work against reference solutions
  • Run mock benchmark: python skills_benchmark/runners/base.py --num-runs 1
  • Test API runner with Claude (requires API key)

Add a comprehensive benchmark suite for evaluating NNsight skills across
different difficulty levels (easy, medium, hard) and techniques (basics,
logit-lens, activation-patching, attribution-patching, causal-tracing,
model-steering).

The benchmark includes:
- 17 query files testing specific interpretability techniques
- Schema definitions for queries and results
- Structural validation for API correctness
- Deprecated pattern detection (pre-0.5 NNsight)
- Claude Code and API runners
- Results analysis and comparison tools

Filenames now use descriptive hyphen-separated titles based on the query
content rather than generic numeric IDs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants