mtmd: add NVIDIA LocateAnything-3B vision support by sfallah · Pull Request #24749 · ggml-org/llama.cpp

sfallah · 2026-06-18T05:20:44Z

This adds support for NVIDIA LocateAnything-3B,
a visual grounding / detection VLM.

This PR covers the model itself (autoregressive decode). Its Parallel Box Decoding ("fast mode")
will come in a follow-up PR.

What it reuses

Vision encoder: MoonViT-SO-400M, same as Kimi-K2.5. clip_graph_locateanything subclasses
clip_graph_kimik25 and only overrides build().
Connector: an "Eagle MLP" (LayerNorm(4608) → Linear → GELU → Linear) on the shared
mm_projector tensors. The LayerNorm runs over the merged 4608-dim feature (the one diff from Kimi-K2.5).
Text: plain Qwen2.5-3B via the existing Qwen2Model. No src/ changes.

What's new

PROJECTOR_TYPE_LOCATEANYTHING and its graph.
Converter conversion/locateanything.py: renames vision_model. to vision_tower., maps the
Eagle MLP to mm_projector.*, permutes the fused wqkv to split Q/K, writes the pixel budget
from in_token_limit.
image_resize_round_up hparam (off by default, set only here): the HF processor rounds the
resize up to a multiple of patch*merge, but calc_size_preserved_ratio rounds to nearest,
which squished images and caused repeated-box loops.

Converting and running

python convert_hf_to_gguf.py /path/to/LocateAnything-3B --outtype bf16            # text (qwen2)
python convert_hf_to_gguf.py /path/to/LocateAnything-3B --mmproj --outtype bf16   # mmproj

llama-mtmd-cli -m la-text-bf16.gguf --mmproj la-mmproj-bf16.gguf \
  --image image.jpg --chat-template chatml \
  -p "<image 1><__media__>Locate all the instances that matches the following description: dog."

Output is <box><x1><y1><x2><y2></box>, or <box><x><y></box> for a point, or <box>None</box>
for no match. Coordinates are control tokens <0>..<1000>; convert to pixels with
px = coord / 1000 * image_dim.

Testing

I compared the output against the HF reference (CPU sdpa, bf16, greedy) on a few images covering
detection, no-match, point, and natural-image detection. The boxes match token-for-token, apart
from a couple of coordinate tokens off by 1-2px on small boxes, which is bf16 backend noise.

License

The weights use NVIDIA's non-commercial license. The code here is MIT like the rest of llama.cpp.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - I used AI assistance for code review, debugging, implementation checks, and testing. I have reviewed the submitted changes and take responsibility for the full contents of this PR.

- MoonViT-SO-400M encoder (same as Kimi-K2.5) + Eagle MLP connector + Qwen2.5-3B text - locateanything graph reuses the Kimi-K2.5 path; only the connector LayerNorm differs (merged 4608-dim) - text auto-routes to Qwen2, no src/ changes

- HF rounds the resize up to a multiple of patch*merge; the shared preprocessor rounds to nearest - add opt-in image_resize_round_up flag (default off), set for LocateAnything

github-actions Bot added examples python python script changes labels Jun 18, 2026

sfallah force-pushed the sf/locateanything-3b branch from 1ad84e5 to 5adaaaa Compare June 18, 2026 05:24

sfallah added 2 commits June 18, 2026 07:28

mtmd: add NVIDIA LocateAnything-3B vision support

dc08cd1

- MoonViT-SO-400M encoder (same as Kimi-K2.5) + Eagle MLP connector + Qwen2.5-3B text - locateanything graph reuses the Kimi-K2.5 path; only the connector LayerNorm differs (merged 4608-dim) - text auto-routes to Qwen2, no src/ changes

mtmd: ceil LocateAnything image size to match the HF processor

f8c60ee

- HF rounds the resize up to a multiple of patch*merge; the shared preprocessor rounds to nearest - add opt-in image_resize_round_up flag (default off), set for LocateAnything

sfallah force-pushed the sf/locateanything-3b branch from 5adaaaa to f8c60ee Compare June 18, 2026 05:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtmd: add NVIDIA LocateAnything-3B vision support#24749

mtmd: add NVIDIA LocateAnything-3B vision support#24749
sfallah wants to merge 2 commits into
ggml-org:masterfrom
sfallah:sf/locateanything-3b

sfallah commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sfallah commented Jun 18, 2026

What it reuses

What's new

Converting and running

Testing

License

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant