Add Molmo (7B-D, 7B-O, 70B) #33962

molbap · 2024-10-04T18:51:32Z

What does this PR do?

As mentioned in issue #33710 , this is a draft to add support for Molmo natively in transformers.
It is also using the new modular framework introduced in #33248 .

Molmo has several existing variants:

MolmoE, a mixture of experts multimodal model, which is not covered in this PR but will be in a follow-up one.
Molmo-7B-D, based on Qwen2 + CLIP.
Molmo-7B-O, based on a yet to be released Olmo model, and CLIP.
Molmo-70B, a scaled up version.

The last three models share the same modeling, and thus will be covered by this PR.

Relative to the modular framework:

Choose a base model that's as close as possible from the one you're porting.

In my case, I'm using Llava as a reference. The differences I identify at a glance are the 2d pooling,

Figure out the differences.

Some differences will be a complete modification of the original module, in that case, all have to be redefined.

class MolmoMultiModalProjector(LlavaMultiModalProjector):
    def __init__(self, config: MolmoConfig):
        super().__init__()
        self.linear_1 = nn.Linear(
            config.vision_config.hidden_size,
            config.text_config.intermediate_size // 2,
            bias=False,
            )
        self.linear_2 = nn.Linear(
            config.text_config.intermediate_size // 2,
            config.text_config.hidden_size,
            bias=False,
            )
        self.linear_3 = nn.Linear(
            config.vision_config.hidden_size,
            config.text_config.intermediate_size // 2,
            bias=False,
            )
    
    def forward(self, image_features):
        hidden_states = self.linear_1(image_features)
        hidden_states = self.act(hidden_states)
        intermediate_states = self.linear_3(image_features)
        hidden_states = self.linear_2(hidden_states, intermediate_states)
        return hidden_states

Some differences will be very tiny. For instance, some layers might be the same, but initialized with a different configuration key.
For instance, the position embeddings are slightly different.

class MolmoVisionEmbeddings(CLIPVisionEmbeddings):
    def __init__(self, config):
        super().__init__()
        self.position_embedding = nn.Embedding(config.num_image_positions, config.hidden_size)

Preserving inheritance across model components renames.

For instance, the code above will trigger

python utils/modular_model_converter.py --files_to_parse src/transformers/models/molmo/modular_molmo.py  --old_model_name="Llava" --new_model_name="Molmo"

> ValueError: Unable to find dependencies for CLIPVisionEmbeddings in transformers.models.clip.modeling_clip. Here are the dependencies found: {'molmo_loss': {'contrastive_loss'}, 'MOLMOVisionModelOutput': {'ModelOutput'}, 'MOLMOTextModelOutput': {'ModelOutput'}, 'MOLMOOutput': {'Mod
elOutput'}, 'MOLMOVisionEmbeddings': {'nn.Module'},

Because the supported pattern is currently searching for a caps-based model name. However, using modular is very promising and makes for a much smaller modeling file to review.

I'll write down hurdles encountered here for future reference so that adding multimodal models to transformers ends up being a breeze.

ArthurZucker

Wow looks super nice! Will finish #33859 asap to let you continue!

upstream merge of Arthur's modular PR

molbap · 2024-10-08T18:28:01Z

Still seeing some duplicate imports in the modeling code:

from ...modeling_outputs import (
    BaseModelOutputWithPast,
    CausalLMOutputWithPast,
)
from ...modeling_rope_utils import ROPE_INIT_FUNCTIONS
from ...modeling_utils import PreTrainedModel
from ...utils import (
    add_start_docstrings,
    add_start_docstrings_to_model_forward,
    is_flash_attn_2_available,
    is_flash_attn_greater_or_equal_2_10,
    logging,
    replace_return_docstrings,
)
from .configuration_molmo import MolmoConfig


if is_flash_attn_2_available():
    from ...modeling_flash_attention_utils import _flash_attention_forward


from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling, ModelOutput
from ...utils import (
    ModelOutput,
    is_flash_attn_2_available,
    torch_int,
)
from .configuration_molmo import MOLMOConfig, MOLMOVisionConfig

One quick&dirty solution would be to do a pass on the imports once the transformer in modular has finished, so that imports from various modules get merged and normalized to the most likely - but there's also some capitalized (wrong) model names that remain as well, strangely, like MOLMOEncoder where we should get MolmoEncoder

class MolmoVisionTransformer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        embed_dim = config.hidden_size
        self.embeddings = MolmoVisionEmbeddings(config)
        self.pre_layrnorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
        self.encoder = MOLMOEncoder(config)  #  wut 
        self.post_layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps, bias=True)

getting there however!

…olmo

ArthurZucker · 2024-10-15T10:18:11Z

Do you need a review? 🤗

d-rau · 2024-10-15T10:23:40Z

Maybe a bit pre-mature but when using the script to convert the model to hf I got missmatch issues here:

transformers/src/transformers/models/molmo/convert_molmo_weights_to_hf.py

Line 215 in ab79d0e

q_proj, k_proj, v_proj = torch.split(fused_qkv, fused_dims, 0)

htahboub · 2025-04-26T07:51:20Z

Hi @molbap, was just wondering if you had an ETA on this? Great work here by the way!

zucchini-nlp

Thanks a lot for working on it! Left a few tiny comments here and there, overall looks good to me

docs/source/en/model_doc/molmo.md

src/transformers/models/auto/modeling_auto.py

src/transformers/models/molmo/modular_molmo.py

zucchini-nlp · 2025-06-17T10:55:55Z

src/transformers/models/molmo/modular_molmo.py

+            valid_positions = image_token_indices_flat >= 0
+            valid_indices = image_token_indices_flat[valid_positions].long()
+            valid_features = image_features_flat[valid_positions.to(image_features_flat.device)]
+            valid_batch_indices = valid_batch_indices_expanded[
+                valid_positions.to(valid_batch_indices_expanded.device)
+            ].long()


I don't remember anymore why we needed this hehe. Is it possible for us to hide it somewhere in processing so that the model does simple embeds.masked_scatter(ids == image_id, image_features)?

haha, I don't remember either 😆 I'm sure it's doable yes!

src/transformers/models/molmo/processing_molmo.py

molbap added 3 commits October 1, 2024 11:42

add base convert keys + chat template

dc6fcac

Merge branch 'main' into add_molmo

574e01f

draft: add up modular files for molmo

0bd413b

ArthurZucker added the New model label Oct 4, 2024

ArthurZucker reviewed Oct 4, 2024

View reviewed changes

molbap and others added 6 commits October 8, 2024 10:12

Squashed commit of the following:

9e454e4

upstream merge of Arthur's modular PR

sync changes

d82c471

push a simple fix

339a8d3

finish fixing

c0c25d6

Merge branch 'main' into add_molmo

5ee6a44

suppress diff

33e43ec

molbap and others added 12 commits October 10, 2024 18:04

Merge branch 'main' into add_molmo

d23e1c1

fix

c8c12fe

style

0909c02

add config + 2d pooling

1799d20

suppress changes

fb133d4

Merge branch 'add_molmo' of github.com:molbap/transformers into add_m…

5ba4105

…olmo

fix

a2a6a9b

Merge branch 'add_molmo' of github.com:molbap/transformers into add_m…

8fe7a9f

…olmo

conversion works 🙌

20681f5

fixup

c85af98

handle missing MOLMO_VISION_ATTENTION_CLASSES

35ea3cc

fix

ab79d0e

sergiopaniego mentioned this pull request Oct 14, 2024

[SFT VLM] Add support for Molmo models huggingface/trl#2136

Open

molbap added 3 commits October 15, 2024 14:13

fix fused keys mismatch

b9bdf99

fix

98d5ccd

[Modular-breaking] add manually vision attention classes list

3bca742

molbap added 2 commits April 22, 2025 13:44

merge issue

c8f9553

rebase

ff1862e

molbap added 8 commits April 26, 2025 10:08

wrong stash pop

4770401

left padding, chat template, and wrong pad token

caf6257

add docs

53a5801

remove debug, fix left-padded batched generation :warning_sign:

fd417ae

fixes

cc91650

style

fdcfadd

fixup config

d159c2a

woops

39a78d7

qubvel removed request for muellerzr and qubvel May 5, 2025 18:32

molbap added 7 commits May 12, 2025 14:19

clean up a bit

025c075

clean up

5641b62

Merge branch 'main' into add_molmo

886778b

separate head from model

3d2f6d9

happify CI

d7f89a2

more prettifying +docs

7766001

fixups

a89dbac

molbap requested a review from ArthurZucker May 12, 2025 16:39

molbap requested a review from zucchini-nlp June 17, 2025 08:38

zucchini-nlp approved these changes Jun 17, 2025

View reviewed changes

molbap added 7 commits June 19, 2025 10:28

update doc

fc9ea4f

remove vision2seq

f642ade

minor changes doc + format

4b62b00

Merge branch 'main' into add_molmo

3ff333e

fixup

dbd47b4

fixes after main merge

6d536f6

apply remainder of code review

8b54db9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Molmo (7B-D, 7B-O, 70B) #33962

Add Molmo (7B-D, 7B-O, 70B) #33962

molbap commented Oct 4, 2024

Uh oh!

ArthurZucker left a comment

Uh oh!

molbap commented Oct 8, 2024

Uh oh!

ArthurZucker commented Oct 15, 2024

Uh oh!

d-rau commented Oct 15, 2024

Uh oh!

htahboub commented Apr 26, 2025

Uh oh!

zucchini-nlp left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp Jun 17, 2025

Uh oh!

molbap Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add Molmo (7B-D, 7B-O, 70B) #33962

Are you sure you want to change the base?

Add Molmo (7B-D, 7B-O, 70B) #33962

Conversation

molbap commented Oct 4, 2024

What does this PR do?

Molmo has several existing variants:

Choose a base model that's as close as possible from the one you're porting.

Figure out the differences.

Preserving inheritance across model components renames.

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

molbap commented Oct 8, 2024

Uh oh!

ArthurZucker commented Oct 15, 2024

Uh oh!

d-rau commented Oct 15, 2024

Uh oh!

htahboub commented Apr 26, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

molbap Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!