Trustworthy-ML-Lab repositories

ReflCtrl

Public

Python

•0•0•0•0•Updated

Dec 19, 2025

[EMNLP 25] An effective and interpretable weight-editing method for mitigating overly short reasoning in LLMs, and a mechanistic study uncovering how reasoning length is encoded in the model’s representation space.

deep-learning interpretable-machine-learning large-language-modelsgenerative-ai mechanistic-interpretability reasoning-language-models

Python

•1•16•0•0•Updated

Dec 17, 2025

Efficient-Interpretability-Eval

Public

0•0•0•0•Updated

Dec 2, 2025

posthoc-generative-cbm

Public

[CVPR 2025] Concept Bottleneck Autoencoder (CB-AE) -- efficiently transform any pretrained (black-box) image generative model into an interpretable generative concept bottleneck model (CBM) with minimal concept supervision, while preserving image quality

computer-vision deep-learning interpretable-deep-learningconcept-bottleneck-models interpretability-and-explainability generative-ai mechanistic-interpretability

Jupyter Notebook

•2•16•1•0•Updated

Nov 4, 2025

Training_Trustworthy_LRM_with_Refine

Public

A new training framework for Trustworthy Large Reasoning Models

machine-learning deep-learning interpretabilitytrustworthy-ai llms faithfulness llms-reasoning

Python

•1•4•1•0•Updated

Oct 31, 2025

Concept-Bottleneck-LLM

Public

Python

•0•5•0•0•Updated

Aug 15, 2025

Robust_HighUtil_Smoothed_DRL

Public

[ICML 24] S-DQN and S-PPO: Robust smoothed deep RL agents without sacrificing performance

deep-learning deep-reinforcement-learning robustnessadversarial-machine-learning robust-machine-learning robust-learning randomized-smoothing

Python

•0•5•0•0•Updated

Aug 15, 2025

CB-LLMs

Public

[ICLR 25] A novel framework for building intrinsically interpretable LLMs with human-understandable concepts to ensure safety, reliability, transparency, and trustworthiness.

natural-language-processing deep-learning interpretable-deep-learningexplainable-ai large-language-models mechanistic-interpretability

Python

•14•28•0•0•Updated

Aug 15, 2025

Neuron_Eval

Public

[ICML 25] A unified mathematical framework to evaluate neuron explanations of deep learning models with sanity tests

deep-neural-networks computer-vision interpretable-deep-learningexplainable-ai large-language-models mechanistic-interpretability

Jupyter Notebook

•0•7•0•0•Updated

Jul 1, 2025

efficient_neuron_eval

Public

0•1•0•0•Updated

Jun 10, 2025

VLG-CBM

Public

[NeurIPS 24] A new training and evaluation framework for learning interpretable deep vision models and benchmarking different interpretable concept-bottleneck-models (CBMs)

deep-neural-networks computer-vision deep-learningexplainable-ai interpretable-machine-learning concept-bottleneck-models large-language-models

Jupyter Notebook

•5•27•1•0•Updated

Jun 5, 2025

Linear-Explanations

Public

[ICML 24] A novel automated neuron explanation framework that can accurately describe poly-semantic concepts in deep neural networks

computer-vision deep-learning interpretable-machine-learningmechanistic-interpretability

Jupyter Notebook

•0•13•0•0•Updated

May 2, 2025

effective_skill_unlearning

Public

[NAACL 25] Two novel, light-weight, and training-free skill unlearning methods for LLMs

natural-language-processing deep-learning interpretabilitylarge-language-model

Python

•0•4•0•0•Updated

Mar 27, 2025

RAT_MisD

Public

Boosting misclassification detection ability by radius-aware training (RAT)

deep-learning misclassification-detection

Python

•0•0•0•0•Updated

Mar 21, 2025

Describe-and-Dissect

Public

[TMLR 25] An automated method for explaining complex neuron behaviors in deep vision models using large language models

deep-neural-networks computer-vision deep-learningexplainable-ai interpretable-machine-learning large-language-models generative-ai mechanistic-interpretability

Jupyter Notebook

•2•10•1•0•Updated

Feb 20, 2025

provable-efficient-dataset-distill-KRR

Public

Python

•

Apache License 2.0

•0•1•0•0•Updated

Dec 10, 2024

Interpretability-Guided-Defense

Public

[ECCV 24] A new and low-cost test-time defense for DNNs based on neuron-level-interpretability methods

computer-vision deep-learning interpretabilityrobustness adversarial-machine-learning adversarial-examples

Python

•1•4•0•0•Updated

Oct 1, 2024

Audio_Network_Dissection

Public

[ICML 24] AND: the first framework to provide automatic natural language explanations for deep acoustic network

deep-neural-networks deep-learning interpretable-machine-learningmechanistic-interpretability

Jupyter Notebook

•0•4•0•0•Updated

Sep 29, 2024

DSC-210-NLA-FA22

Public

Jupyter Notebook

•0•1•0•0•Updated

Sep 23, 2024

concept-driven-continual-learning

Public

official code repo

Jupyter Notebook

•1•0•0•0•Updated

Sep 10, 2024

NN-LPK

Public

Python

•0•2•0•0•Updated

Jun 14, 2024

Provably-Robust-Conformal-Prediction

Public

[ICLR 24] This work proposes RSCP+ to provide robustness guarantee in evaluation, and two novel methods PTT and RCT to robustify conformal predictions with improved efficiency through post-hoc transformation and training.

deep-neural-networks deep-learning robustnessadversarial-machine-learning robust-machine-learning

Python

•1•5•0•0•Updated

Apr 3, 2024

Label-free-CBM

Public

[ICLR 23] A new framework to transform any neural networks into an interpretable concept-bottleneck-model (CBM) without needing labeled concept data

deep-neural-networks computer-vision deep-learninginterpretability interpretable-deep-learning

Jupyter Notebook

•28•128•2•0•Updated

Mar 31, 2024

Efficient-LLM-automated-interpretability

Public

[NeurIPS'23 ATTRIB] An efficient framework to generate neuron explanations for LLMs

deep-learning interpretability explainable-ailarge-language-models mechanistic-interpretability

Python

•1•5•0•0•Updated

Dec 23, 2023

CLIP-dissect

Public

[ICLR 23 spotlight] An automatic and efficient tool to describe functionalities of individual neurons in DNNs

deep-neural-networks computer-vision deep-learninginterpretable-deep-learning explainable-ai interpretable-machine-learning mechanistic-interpretability

Jupyter Notebook

•16•59•0•0•Updated

Nov 6, 2023

corrupting_neuron_explanations

Public

[ICCV 23] Evaluating robustness of neuron explanation methods

deep-neural-networks computer-vision deep-learningrobustness interpretable-machine-learning robust-machine-learning mechanistic-interpretability

Jupyter Notebook

•

MIT License

•1•4•0•0•Updated

Sep 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trustworthy-ML-Lab

All

All

26 repositories

ReflCtrl

ThinkEdit

Efficient-Interpretability-Eval

posthoc-generative-cbm

Training_Trustworthy_LRM_with_Refine

Concept-Bottleneck-LLM

Robust_HighUtil_Smoothed_DRL

CB-LLMs

Neuron_Eval

efficient_neuron_eval

VLG-CBM

Linear-Explanations

effective_skill_unlearning

RAT_MisD

Describe-and-Dissect

provable-efficient-dataset-distill-KRR

Interpretability-Guided-Defense

Audio_Network_Dissection

DSC-210-NLA-FA22

concept-driven-continual-learning

NN-LPK

Provably-Robust-Conformal-Prediction

Label-free-CBM

Efficient-LLM-automated-interpretability

CLIP-dissect

corrupting_neuron_explanations

All

All

Repositories list

26 repositories