Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
As mentioned in issue #33710 , this is a draft to add support for Molmo natively in
transformers
.It is also using the new
modular
framework introduced in #33248 .Molmo has several existing variants:
The last three models share the same modeling, and thus will be covered by this PR.
Relative to the modular framework:
Choose a base model that's as close as possible from the one you're porting.
In my case, I'm using Llava as a reference. The differences I identify at a glance are the 2d pooling,
Figure out the differences.
Some differences will be a complete modification of the original module, in that case, all have to be redefined.
Some differences will be very tiny. For instance, some layers might be the same, but initialized with a different configuration key.
For instance, the position embeddings are slightly different.
Preserving inheritance across model components renames.
For instance, the code above will trigger
Because the supported pattern is currently searching for a caps-based model name. However, using
modular
is very promising and makes for a much smaller modeling file to review.I'll write down hurdles encountered here for future reference so that adding multimodal models to
transformers
ends up being a breeze.