Skip to content

Proposal: ggml_rope_v2 #1401

@ngxson

Description

@ngxson

Motivation

While working with Devstral 2 and more recently, GLM4V, I notice that both the RoPE kernels and the API ggml_rope_ext is quite difficult to work with. I also spotted some opportunities to improve a bit the performance, so I'm writing this proposal

Problems with the existing kernel:

  • It does some redundant calculations repeatedly, for example: theta_scale, or the mscale calculation in yarn code path. They can be calculated once on the host device then pass the result into the kernel
  • M-RoPE code paths is difficult to track (it got worse after interleaved M-RoPE was added along side with Qwen3-VL)

Problems with the API:

  • ggml_rope_ext has too many arguments
  • ggml_rope_multi has even more arguments and half of them are hardly used in practice
  • mode was supposed to be a bit field, but got messy with M-RoPE
  • attn_factor is mis-aligned with transformers which made the recent Devstral 2 debugging quite tricky

Proposal

My proposal contains 2 parts: The API and the kernel

For the API, I propose an array of calls like this:

// rope Qcur with pos
struct ggml_tensor * roped = ggml_rope_v2(ctx, Qcur, pos, n_dims, freq_base);
// with yarn
ggml_rope_v2_set_yarn(ctx, roped, ext_factor, mscale, beta_fast, beta_slow);
// with m-rope
ggml_rope_v2_set_mrope(ctx, roped, sections);
// with ordering (NEOX or NORMAL)
ggml_rope_v2_set_ordering(ctx, roped, GGML_ROPE_ORDERING_NEOX);

Behind the scene, these call will pre-calculate as many things as possible before passing args to the kernel. For example theta_scale = powf(freq_base, -2.0f/n_dims) will be calculated inside ggml_rope_v2's implementation

For the kernel, we can statically compile templated kernels with the combination of:

  • 2 x direction: forward or backward
  • 2 x types: f16, f32
  • 2 x ordering: normal, neox --> the indexing will be controlled via an input arg
  • 2 x modes: normal, M-RoPE

Note: mode vision (aka 2D-RoPE with Neox) will become a composed op of 2 x ggml_rope_v2 instead

So in total, we will have 8 statically compiled kernels

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions