Bug: beam search crashes with HybridMambaAttentionDynamicCache (4 bugs in modeling_nemotron_h.py)
Beam search (num_beams > 1, use_cache=True) is broken due to 4 bugs in modeling_nemotron_h.py. Tested with transformers 5.5.0, causal-conv1d 1.6.1, mamba-ssm 2.3.1, PyTorch 2.6, CUDA 12.6, trust_remote_code=True (commit cbd3fa9f).
Bug 1 -- HybridMambaAttentionDynamicCache.__init__ (line 177) computes conv_kernel_size = config.conv_kernel as a local variable but never stores it on self. cuda_kernels_forward (line 461) accesses cache_params.conv_kernel_size and crashes:
AttributeError: 'HybridMambaAttentionDynamicCache' object has no attribute 'conv_kernel_size'
Fix: add self.conv_kernel_size = config.conv_kernel in __init__.
Bug 2 -- update_conv_state (lines 249, 252) and update_ssm_state (line 256) call self.conv_states.device / self.ssm_states.device but both are Python lists:
AttributeError: 'list' object has no attribute 'device'
Fix: use self.conv_states[layer_idx].device / self.ssm_states[layer_idx].device.
Bug 3 -- __init__ allocates conv_states with intermediate_size = mamba_num_heads * mamba_head_dim (4096 for this model) but the mixer stores hidden_states_B_C of size conv_dim = intermediate_size + 2 * n_groups * ssm_state_size (6144). The CUDA kernel detects the mismatch:
RuntimeError: weight must have shape (dim, width)
Fix: allocate conv_states with conv_dim instead of intermediate_size.
Bug 4 -- With device_map="auto" across multiple GPUs, update_conv_state(cache_init=True) calls .to(self.conv_states[layer_idx].device) which moves the tensor back to the initialisation device (cuda:0), even for layers running on cuda:1. The next decode step then runs causal_conv1d_update with x/weight on cuda:1 and conv_state on cuda:0. The CUDA kernel reports this as the same shape error from Bug 3, which made it hard to diagnose.
Fix: for cache_init=True, assign directly without .to() so the tensor stays on the device where the mixer ran. Same applies to update_ssm_state.
All four bugs are in the incremental decode path (cache_position[0] > 0), which is only hit during cached multi-step generation. Greedy and sampling with a fresh cache never reach it.
Bug: beam search crashes with
HybridMambaAttentionDynamicCache(4 bugs inmodeling_nemotron_h.py)Beam search (
num_beams > 1,use_cache=True) is broken due to 4 bugs inmodeling_nemotron_h.py. Tested with transformers 5.5.0, causal-conv1d 1.6.1, mamba-ssm 2.3.1, PyTorch 2.6, CUDA 12.6,trust_remote_code=True(commitcbd3fa9f).Bug 1 --
HybridMambaAttentionDynamicCache.__init__(line 177) computesconv_kernel_size = config.conv_kernelas a local variable but never stores it onself.cuda_kernels_forward(line 461) accessescache_params.conv_kernel_sizeand crashes:Fix: add
self.conv_kernel_size = config.conv_kernelin__init__.Bug 2 --
update_conv_state(lines 249, 252) andupdate_ssm_state(line 256) callself.conv_states.device/self.ssm_states.devicebut both are Python lists:Fix: use
self.conv_states[layer_idx].device/self.ssm_states[layer_idx].device.Bug 3 --
__init__allocatesconv_stateswithintermediate_size = mamba_num_heads * mamba_head_dim(4096 for this model) but the mixer storeshidden_states_B_Cof sizeconv_dim = intermediate_size + 2 * n_groups * ssm_state_size(6144). The CUDA kernel detects the mismatch:Fix: allocate
conv_stateswithconv_diminstead ofintermediate_size.Bug 4 -- With
device_map="auto"across multiple GPUs,update_conv_state(cache_init=True)calls.to(self.conv_states[layer_idx].device)which moves the tensor back to the initialisation device (cuda:0), even for layers running oncuda:1. The next decode step then runscausal_conv1d_updatewithx/weightoncuda:1andconv_stateoncuda:0. The CUDA kernel reports this as the same shape error from Bug 3, which made it hard to diagnose.Fix: for
cache_init=True, assign directly without.to()so the tensor stays on the device where the mixer ran. Same applies toupdate_ssm_state.All four bugs are in the incremental decode path (
cache_position[0] > 0), which is only hit during cached multi-step generation. Greedy and sampling with a fresh cache never reach it.