Add Direct Logit Attribution tool for TransformerBridge by TravisHaa · Pull Request #1316 · TransformerLensOrg/TransformerLens

TravisHaa · 2026-05-20T01:33:07Z

Description

Implemented a Direct Logit Attribution (DLA) tool for the new TransformerBridge system, closes [Proposal] Direct Logit Attribution Tool #1263. Based on the stale PR (Draft) Add DLA function to utils #466 but adapted to utilize 3.0 TransformerBridge.
returns per-component (or per-layer) contributions to logit difference ebtween correct and wrong token, decomposing residual stream based off accumulated bool.
generated docstring (according to contributing.md) highlighting important warnings of limitations of DLA tool (currently does not support hybrid layers like mamba, requires bridge compatibility mode)

Acknowledged limitations

Strict mode only. hybrid architectures (Mamba, SSM, Mixer, LinearAttention) raise
NotImplementedError. ActivationCache.decompose_resid only knows how to decompose attn_out + mlp_out per layer; supporting hybrid blocks requires extending that method and is out of scope
for this PR. (will be working on this in next steps)
Requires compatibility mode. Raises ValueError if bridge.enable_compatibility_mode()
hasn't been called. Without folded LayerNorm weights, the projection direction is wrong and
per-component scores don't reflect actual logit contributions.
Excludes unembedding bias. Per-component scores sum to actual_logit_diff − (b_U[correct] − b_U[wrong]). This matches the convention in cache.decompose_resid and PR (Draft) Add DLA function to utils #466 — the bias is a constant offset

Type of change

New feature (non-breaking change which adds functionality)
This change requires a documentation update

Screenshots

Checklist:

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have not rewritten tests relating to key interfaces which would affect backward compatibility

jlarson4

Hi @TravisHaa! I have reviewed your PR and left a few comments. I cannot run the code in its current state, until some of these comments are addressed. Let me know if you have any questions.

Also, the PR is marked as "tests added", but I am not seeing any tests in the diff. Could you add tests/unit/tools/test_direct_logit_attribution.py covering at least:

Correct path on a small bridge (e.g., gpt2 with compatibility mode)
ValueError when compatibility mode is off
NotImplementedError when a Mamba-like adapter is present
Both accumulated=True and accumulated=False

Feel free to tag me once these edits are in and I will re-review. Thank you for your work on this, it is coming along nicely!

@jlarson4

Resolved review feedback from @jlarson4, added tests covering reconstruction invariants on a distilgpt2 bridge in compatibility mode, arguments, asserting sum(scores) == logit_diff - (b_U[correct] - b_U[wrong]) against the model's real logits, plus labels/shape and batch-averaging checks. Added additional hardening: - Fix a latent direction-shape bug: replace the fragile answer_tokens.numel()==1 branch with a robust reshape so single-prompt, single-token inputs are handled correctly - Detect hybrid blocks via bridge.layer_types() instead of substring matching named_modules(), the codebase's own semantic mechanism - Import get_act_name from transformer_lens.utilities to avoid the transformer_lens.utils DeprecationWarning; drop the invalid return_type kwarg to run_with_cache - Register the analysis subpackage in tools/__init__.py Closes TransformerLensOrg#1263.

TravisHaa · 2026-06-07T07:12:48Z

@jlarson4 I apologize for the delay once again, I spent a lot of time understanding how I can improve this PR on top of your suggested fixes by exploring the codebase. I will make sure to re-iterate within a couple hours of your review for this brand new commit 👍

Add Direct Logit Attribution tool for TransformerBridge

b7fc39a

jlarson4 reviewed May 20, 2026

View reviewed changes

jlarson4 force-pushed the dev branch from da370b8 to f6192fa Compare May 22, 2026 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Direct Logit Attribution tool for TransformerBridge#1316

Add Direct Logit Attribution tool for TransformerBridge#1316
TravisHaa wants to merge 2 commits into
TransformerLensOrg:devfrom
TravisHaa:feat/dla-tool-1263

TravisHaa commented May 20, 2026 •

edited

Loading

Uh oh!

jlarson4 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TravisHaa commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TravisHaa commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Acknowledged limitations

Type of change

Screenshots

Checklist:

Uh oh!

jlarson4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TravisHaa commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TravisHaa commented May 20, 2026 •

edited

Loading