All

17 repositories

roc-n-reroll
Public
Code used for "ROC-n-reroll: How verifier imperfection affects test-time scaling" at ICLR 2026.
language-model test-time-scaling
language-model test-time-scaling
Jupyter Notebook
•0•2•0•0•Updated Feb 20, 2026Feb 20, 2026
folktexts
Public
Evaluate uncertainty, calibration, accuracy, and fairness of LLMs on real-world survey data!
python machine-learning tabular-data
python machine-learning tabular-data transformers uncertainty fairness large-language-models
Jupyter Notebook
•
MIT License
•5•25•0•1•Updated Dec 14, 2025Dec 14, 2025
mono-multi
Public
Code to reproduce the paper "Monoculture or Multiplicity: Which Is It?"
Jupyter Notebook
•
MIT License
•0•0•0•0•Updated Oct 27, 2025Oct 27, 2025
benchbench
Public
BenchBench is a Python package to evaluate multi-task benchmarks.
Python
•
MIT License
•1•19•1•0•Updated Oct 12, 2025Oct 12, 2025
lm-harmony
Public
Jupyter Notebook
•
MIT License
•0•5•0•0•Updated Sep 22, 2025Sep 22, 2025
benchmark-prediction
Public
Python
•
MIT License
•1•5•0•0•Updated Aug 30, 2025Aug 30, 2025
error-parity
Public
Achieve error-rate fairness between societal groups for any score-based classifier.
Python
•
MIT License
•4•19•0•2•Updated Aug 21, 2025Aug 21, 2025
lm-evaluation-harness
Public
A framework for few-shot evaluation of language models.
Python
•
MIT License
•3.2k•1•0•0•Updated May 4, 2025May 4, 2025
causal-features
Public
Code to reproduce the paper "Do causal predictors generalize better to new domains?"
Python
•
Other
•18•15•0•0•Updated Feb 7, 2025Feb 7, 2025
twitter-predictability
Public
Jupyter Notebook
•
MIT License
•0•0•0•0•Updated Jan 22, 2025Jan 22, 2025
surveying-language-models
Public
Code to reproduce the paper "Questioning the Survey Responses of Large Language Models"
Jupyter Notebook
•
MIT License
•2•9•0•0•Updated Dec 8, 2024Dec 8, 2024
training-on-the-test-task
Public
Code to reproduce the experiments in the paper Training on the Test Task Confounds Evaluation and Emergence.
Jupyter Notebook
•1•11•0•0•Updated Dec 3, 2024Dec 3, 2024
lawma
Public
Lawma: A lightly fine-tuned Llama model for legal classification tasks.
language-model legaltech legaltools
language-model legaltech legaltools
Jupyter Notebook
•1•28•0•0•Updated Sep 14, 2024Sep 14, 2024
folktables
Public
Datasets derived from US census data
Python
•
MIT License
•22•281•7•4•Updated May 15, 2024May 15, 2024
tttlm
Public
Test-time-training on nearest neighbors for large language models
Python
•
MIT License
•6•50•0•0•Updated Apr 18, 2024Apr 18, 2024
backward_baselines
Public
Code for "Is your model predicting the past?"
Jupyter Notebook
•
MIT License
•0•2•0•0•Updated Mar 10, 2024Mar 10, 2024
whynot
Public
A Python sandbox for decision making in dynamics
Python
•
MIT License
•44•426•8•2•Updated Aug 21, 2023Aug 21, 2023

ProTip! When viewing an organization's repositories, you can use the props. filter to filter by custom property.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Social Foundations of Computation

All

All

17 repositories

roc-n-reroll

folktexts

mono-multi

benchbench

lm-harmony

benchmark-prediction

error-parity

lm-evaluation-harness

causal-features

twitter-predictability

surveying-language-models

training-on-the-test-task

lawma

folktables

tttlm

backward_baselines

whynot

All

All

Repositories list

17 repositories