Bug: beam search crashes with `HybridMambaAttentionDynamicCache` (4 bugs in `modeling_nemotron_h.py`) (NemotronH 30B A3B)

**Bug: beam search crashes with `HybridMambaAttentionDynamicCache` (4 bugs in `modeling_nemotron_h.py`)**

Beam search (`num_beams > 1`, `use_cache=True`) is broken due to 4 bugs in `modeling_nemotron_h.py`. Tested with transformers 5.5.0, causal-conv1d 1.6.1, mamba-ssm 2.3.1, PyTorch 2.6, CUDA 12.6, `trust_remote_code=True` (commit `cbd3fa9f`).

**Bug 1** -- `HybridMambaAttentionDynamicCache.__init__` (line 177) computes `conv_kernel_size = config.conv_kernel` as a local variable but never stores it on `self`. `cuda_kernels_forward` (line 461) accesses `cache_params.conv_kernel_size` and crashes:
```
AttributeError: 'HybridMambaAttentionDynamicCache' object has no attribute 'conv_kernel_size'
```
Fix: add `self.conv_kernel_size = config.conv_kernel` in `__init__`.

**Bug 2** -- `update_conv_state` (lines 249, 252) and `update_ssm_state` (line 256) call `self.conv_states.device` / `self.ssm_states.device` but both are Python lists:
```
AttributeError: 'list' object has no attribute 'device'
```
Fix: use `self.conv_states[layer_idx].device` / `self.ssm_states[layer_idx].device`.

**Bug 3** -- `__init__` allocates `conv_states` with `intermediate_size = mamba_num_heads * mamba_head_dim` (4096 for this model) but the mixer stores `hidden_states_B_C` of size `conv_dim = intermediate_size + 2 * n_groups * ssm_state_size` (6144). The CUDA kernel detects the mismatch:
```
RuntimeError: weight must have shape (dim, width)
```
Fix: allocate `conv_states` with `conv_dim` instead of `intermediate_size`.

**Bug 4** -- With `device_map="auto"` across multiple GPUs, `update_conv_state(cache_init=True)` calls `.to(self.conv_states[layer_idx].device)` which moves the tensor back to the initialisation device (`cuda:0`), even for layers running on `cuda:1`. The next decode step then runs `causal_conv1d_update` with `x`/`weight` on `cuda:1` and `conv_state` on `cuda:0`. The CUDA kernel reports this as the same shape error from Bug 3, which made it hard to diagnose.
Fix: for `cache_init=True`, assign directly without `.to()` so the tensor stays on the device where the mixer ran. Same applies to `update_ssm_state`.

All four bugs are in the incremental decode path (`cache_position[0] > 0`), which is only hit during cached multi-step generation. Greedy and sampling with a fresh cache never reach it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: beam search crashes with `HybridMambaAttentionDynamicCache` (4 bugs in `modeling_nemotron_h.py`) (NemotronH 30B A3B) #142

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: beam search crashes with HybridMambaAttentionDynamicCache (4 bugs in modeling_nemotron_h.py) (NemotronH 30B A3B) #142

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug: beam search crashes with `HybridMambaAttentionDynamicCache` (4 bugs in `modeling_nemotron_h.py`) (NemotronH 30B A3B) #142