Skip to content

Inconsisten documentation and lack of support of LLama-Nemotron-VL #15023

@adithya-s-k

Description

@adithya-s-k

Describe the bug

There is inconsistent documentation and missing implementation support for LLaMA-Nemotron-VL within the NVIDIA NeMo framework. The current official documentation references model setup, configuration, and inference examples that are either incomplete, outdated, or unsupported in the actual codebase. Additionally, critical pull requests (such as PR #13819) that were intended to address this issue remain unmerged, leaving the LLaMA-Nemotron-VL pipeline in a partially functional state.

This leads to confusion for users attempting to load or finetune the model, as the documented components (e.g., model config paths, tokenizer references, and visual encoder integration) do not align with the available code in the latest release of NeMo.


Steps/Code to reproduce bug

  1. Follow the setup as per the official LLaMA-Nemotron-VL documentation.
  2. Attempt to initialize the model as shown:
    from nemo.collections.multimodal.models.vlms import LlamaNemotronVLModel
    model = LlamaNemotronVLModel.from_pretrained("nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1")
  3. Observe that the import either fails or the model cannot be loaded due to missing components in the vlm module or missing configuration entries.
  4. Attempting to run training/inference scripts from the documentation results in attribute errors or unresolved references (e.g., missing visual_backbone config keys).

Expected behavior

The LLaMA-Nemotron-VL model should be fully supported in NeMo, with:

  • Corresponding class definitions and configuration files in the vlm submodule.
  • A reproducible pipeline for inference and finetuning as described in the documentation.
  • A validated pretrained model checkpoint loadable through the from_pretrained() interface.
  • Updated examples aligned with the current repository structure and dependencies.

Environment overview (please complete the following information)

  • Environment location: Cloud (Azure VM)
  • Method of NeMo install: pip install nemo_toolkit['all']
  • Additional attempts: Installed from source (latest main branch, as of Nov 2025)
  • PR reference: Llama Nemotron VL #13819

Environment details

  • OS version: Ubuntu 22.04
  • PyTorch version: 2.4.1
  • Python version: 3.10.14
  • CUDA version: 12.2
  • GPU model: NVIDIA A100 80GB

Additional context

  • The PR #13819 appears to introduce partial support for LLaMA-Nemotron-VL but has not been merged, leaving users unable to replicate documented workflows.
  • The documentation page still lists configuration options and usage patterns referencing unimplemented modules.
  • Several users have reported similar issues on GitHub Discussions, but no stable release or example notebook currently demonstrates a working multimodal inference setup for this model.

This gap between the documentation and repository codebase severely limits reproducibility and adoption of LLaMA-Nemotron-VL within the NeMo ecosystem.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions