Skip to content

Bug: beam search crashes with HybridMambaAttentionDynamicCache (4 bugs in modeling_nemotron_h.py) (NemotronH 30B A3B) #142

@FahdSeddik

Description

@FahdSeddik

Bug: beam search crashes with HybridMambaAttentionDynamicCache (4 bugs in modeling_nemotron_h.py)

Beam search (num_beams > 1, use_cache=True) is broken due to 4 bugs in modeling_nemotron_h.py. Tested with transformers 5.5.0, causal-conv1d 1.6.1, mamba-ssm 2.3.1, PyTorch 2.6, CUDA 12.6, trust_remote_code=True (commit cbd3fa9f).

Bug 1 -- HybridMambaAttentionDynamicCache.__init__ (line 177) computes conv_kernel_size = config.conv_kernel as a local variable but never stores it on self. cuda_kernels_forward (line 461) accesses cache_params.conv_kernel_size and crashes:

AttributeError: 'HybridMambaAttentionDynamicCache' object has no attribute 'conv_kernel_size'

Fix: add self.conv_kernel_size = config.conv_kernel in __init__.

Bug 2 -- update_conv_state (lines 249, 252) and update_ssm_state (line 256) call self.conv_states.device / self.ssm_states.device but both are Python lists:

AttributeError: 'list' object has no attribute 'device'

Fix: use self.conv_states[layer_idx].device / self.ssm_states[layer_idx].device.

Bug 3 -- __init__ allocates conv_states with intermediate_size = mamba_num_heads * mamba_head_dim (4096 for this model) but the mixer stores hidden_states_B_C of size conv_dim = intermediate_size + 2 * n_groups * ssm_state_size (6144). The CUDA kernel detects the mismatch:

RuntimeError: weight must have shape (dim, width)

Fix: allocate conv_states with conv_dim instead of intermediate_size.

Bug 4 -- With device_map="auto" across multiple GPUs, update_conv_state(cache_init=True) calls .to(self.conv_states[layer_idx].device) which moves the tensor back to the initialisation device (cuda:0), even for layers running on cuda:1. The next decode step then runs causal_conv1d_update with x/weight on cuda:1 and conv_state on cuda:0. The CUDA kernel reports this as the same shape error from Bug 3, which made it hard to diagnose.
Fix: for cache_init=True, assign directly without .to() so the tensor stays on the device where the mixer ran. Same applies to update_ssm_state.

All four bugs are in the incremental decode path (cache_position[0] > 0), which is only hit during cached multi-step generation. Greedy and sampling with a fresh cache never reach it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions