Skip to content

mtmd: add NVIDIA LocateAnything-3B vision support#24749

Draft
sfallah wants to merge 2 commits into
ggml-org:masterfrom
sfallah:sf/locateanything-3b
Draft

mtmd: add NVIDIA LocateAnything-3B vision support#24749
sfallah wants to merge 2 commits into
ggml-org:masterfrom
sfallah:sf/locateanything-3b

Conversation

@sfallah

@sfallah sfallah commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

This adds support for NVIDIA LocateAnything-3B,
a visual grounding / detection VLM.

This PR covers the model itself (autoregressive decode). Its Parallel Box Decoding ("fast mode")
will come in a follow-up PR.

What it reuses

  • Vision encoder: MoonViT-SO-400M, same as Kimi-K2.5. clip_graph_locateanything subclasses
    clip_graph_kimik25 and only overrides build().
  • Connector: an "Eagle MLP" (LayerNorm(4608) → Linear → GELU → Linear) on the shared
    mm_projector tensors. The LayerNorm runs over the merged 4608-dim feature (the one diff from Kimi-K2.5).
  • Text: plain Qwen2.5-3B via the existing Qwen2Model. No src/ changes.

What's new

  • PROJECTOR_TYPE_LOCATEANYTHING and its graph.
  • Converter conversion/locateanything.py: renames vision_model. to vision_tower., maps the
    Eagle MLP to mm_projector.*, permutes the fused wqkv to split Q/K, writes the pixel budget
    from in_token_limit.
  • image_resize_round_up hparam (off by default, set only here): the HF processor rounds the
    resize up to a multiple of patch*merge, but calc_size_preserved_ratio rounds to nearest,
    which squished images and caused repeated-box loops.

Converting and running

python convert_hf_to_gguf.py /path/to/LocateAnything-3B --outtype bf16            # text (qwen2)
python convert_hf_to_gguf.py /path/to/LocateAnything-3B --mmproj --outtype bf16   # mmproj

llama-mtmd-cli -m la-text-bf16.gguf --mmproj la-mmproj-bf16.gguf \
  --image image.jpg --chat-template chatml \
  -p "<image 1><__media__>Locate all the instances that matches the following description: dog."

Output is <box><x1><y1><x2><y2></box>, or <box><x><y></box> for a point, or <box>None</box>
for no match. Coordinates are control tokens <0>..<1000>; convert to pixels with
px = coord / 1000 * image_dim.

Testing

I compared the output against the HF reference (CPU sdpa, bf16, greedy) on a few images covering
detection, no-match, point, and natural-image detection. The boxes match token-for-token, apart
from a couple of coordinate tokens off by 1-2px on small boxes, which is bf16 backend noise.

License

The weights use NVIDIA's non-commercial license. The code here is MIT like the rest of llama.cpp.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - I used AI assistance for code review, debugging, implementation checks, and testing. I have reviewed the submitted changes and take responsibility for the full contents of this PR.

@github-actions github-actions Bot added examples python python script changes labels Jun 18, 2026
@sfallah sfallah force-pushed the sf/locateanything-3b branch from 1ad84e5 to 5adaaaa Compare June 18, 2026 05:24
sfallah added 2 commits June 18, 2026 07:28
- MoonViT-SO-400M encoder (same as Kimi-K2.5) + Eagle MLP connector + Qwen2.5-3B text
- locateanything graph reuses the Kimi-K2.5 path; only the connector LayerNorm differs (merged 4608-dim)
- text auto-routes to Qwen2, no src/ changes
- HF rounds the resize up to a multiple of patch*merge; the shared preprocessor rounds to nearest
- add opt-in image_resize_round_up flag (default off), set for LocateAnything
@sfallah sfallah force-pushed the sf/locateanything-3b branch from 5adaaaa to f8c60ee Compare June 18, 2026 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant