Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…e/transformers into fix-aligned-data-ptr-grouped-mm
…e/transformers into fix-aligned-data-ptr-grouped-mm
| else: | ||
| # (S, input_dim) @ grouped (num_experts, output_dim, input_dim).T -> (S, output_dim) | ||
| out = _grouped_mm(input, weight.transpose(-2, -1), offs=offs) | ||
| out = _grouped_mm(input, weight.transpose(-2, -1).contiguous(), offs=offs) |
There was a problem hiding this comment.
.contiguous() after .transpose(-2, -1) in _grouped_linear is to ensure the weight tensor memory layout is contiguous before passing it to _grouped_mm, fixing RuntimeError: expected data_ptr to be aligned to 16 bytes. We could maybe forced Forced16BytesAlignment as in with MoE during the weight converter maybe ? (this way no other model will have this issue)
| partial_rotary_factor = config.rope_parameters.get("partial_rotary_factor", 1.0) | ||
| dim = getattr(config, "head_dim", None) or config.hidden_size // config.num_attention_heads | ||
|
|
||
| dim = int(dim * partial_rotary_factor) # Mixtral4 doesn't apply ROPE to the full attention head |
There was a problem hiding this comment.
this was failing test_model_rope_scaling_frequencies with AssertionError: The values for attribute 'shape' do not match: torch.Size([1, 64]) != torch.Size([1, 128]). Mistral4 does not apply the rope to the full attention head cf (qk_rope and qk_nope)
| @@ -44,12 +44,21 @@ | |||
|
|
|||
|
|
|||
| class Mistral4ModelTester(CausalLMModelTester): | |||
There was a problem hiding this comment.
test_model_is_small fails because the common inherited tester config made the tiny test model too large (1,233,664 params). Needs to be < 1000000
| _supports_flex_attn = True | ||
|
|
||
| _can_compile_fullgraph = True | ||
| _can_compile_fullgraph = False |
There was a problem hiding this comment.
TorchInductor error so 🙈
…e/transformers into fix-aligned-data-ptr-grouped-mm
| cache_position = torch.arange(query_states.shape[2], device=query_states.device) + past_seen_tokens | ||
| position_ids = kwargs.get("position_ids") | ||
| if position_ids is None: | ||
| position_ids = torch.arange(query_states.shape[2], device=query_states.device) + past_seen_tokens |
There was a problem hiding this comment.
we need to reuse the RoPE's position_ids otherwise we end up with different positions ( the fact that we recomputetorch.arange(seq_len)all the time) than RoPE's position_ids for the tokens which fucks up generation
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, mistral4 |
|
run-slow: auto, mistral4 |
|
This comment contains models: ["models/auto", "models/mistral4"] |
CI ResultsCommit Info
Model CI Report❌ 2 new failed tests from this PR 😭
|
#44825