Skip to content

[moe] Add residual bottleneck variant for 10T gate experiments#4061

Open
claude[bot] wants to merge 1 commit intomainfrom
agent/20260323-fix-4035
Open

[moe] Add residual bottleneck variant for 10T gate experiments#4061
claude[bot] wants to merge 1 commit intomainfrom
agent/20260323-fix-4035

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude Bot commented Mar 23, 2026

Create experiments/grug/moe_resid_bottleneck/ variant from MoE base, adding per-layer learnable residual scaling (init 1.0, applied before each block) and per-head zero-init sigmoid attention gates (gate = 2 * sigmoid(W @ x[:12])). Both features are config-toggled via use_residual_lambdas and use_attention_gates flags. Existing grug variant contract tests auto-discover and validate the new variant.

Fixes #4035

…tion gates

Create experiments/grug/moe_resid_bottleneck/ variant from MoE base adding
per-layer learnable residual scaling (init 1.0) and per-head zero-init sigmoid
attention gates. Both features are config-toggled. For the great 10T gate
residual bottleneck experiments.

Fixes #4035

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude claude Bot added the agent-generated Created by automation/agent label Mar 23, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🤖 Grug variant diff report

New Variant Closest Existing Variant Distance Score Diff
moe_resid_bottleneck moe 123 Open

Artifact fallback: Download report bundle

This was referenced Mar 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

@github-actions github-actions Bot added the stale label Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[moe] Great 10T: residual bottleneck experiments

0 participants