Major VLM tracker (standardize the API) #33948
Labels
Discussion
Discussion on a topic (keep it focused or open a new issue though)
Vision
WIP
Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Feature request
This will track general plans on VLM and composite models so that we can align with work in TGI and other libraries. I already have some trackers so in this one I'll lay out a more bigger picture with links to respective discussions/topics
Motivation
We already have a pretty good working standards when it comes to language models, and when adding a new model usually a few "copy from" statements will do the work. We also cover most cases for LMs in out test suite. But for wave of multimodal models we still lack any form of standardization and uniform API. Each new model added to the library introduces something new, that forces us to accept it as is until we figure out how to handle it later
So we need to try to standardize those models, currently starting from VLMs. VLMs are the most commonly added models currently, but we may have more audio+text or pure multimodal ones in the future. For now we start off by working on VLM and see how things fit in the general API
Your contribution
The major changes we are working on and planning to work are:
Standardization for Processors:
Standardization in terms of modeling code:
merge_embeds
method and cover VLMs with more generation related tests, as we were getting many issues after a small change. Slow tests unfortunately don't cover everything and are not run every time a PR is merged. That is being tracked in Track progress for VLMs refactoring #33374get_image_features
method for all VLMs so we can have more modularity and prob make the code much cleaner. Was proposed by one of the community contributor and I'll handle propagating the change in all models. See Refactor image features selection in LlaVa #33696Standardization for chat templates:
(tokenize=True, return_tensors="pt")
kwargs in processor's apply_chat_template, so that the method returns already vectorized outputs. Similar to tokenizers, the main point is to feed in a chat history and get tensor inputs ready for generation/train. The only difference is that users will have to explicitly add image file/url orImageInput
so we can process it internally and turn intopixel_values
. Below is the general design. No work started yet, I am planning to make a PR some time in OctoberSpecialTokenMixin
more flexible so that we can simply change the class attributeSPECIAL_TOKENS_ATTRIBUTES
and everything else will work out-of-the-box. Seems to me the easiest way to expand special tokens for multimodal cases without flooding simple language model tokenizers.The text was updated successfully, but these errors were encountered: