Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major VLM tracker (standardize the API) #33948

Open
zucchini-nlp opened this issue Oct 4, 2024 · 1 comment
Open

Major VLM tracker (standardize the API) #33948

zucchini-nlp opened this issue Oct 4, 2024 · 1 comment
Assignees
Labels
Discussion Discussion on a topic (keep it focused or open a new issue though) Vision WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress

Comments

@zucchini-nlp
Copy link
Member

Feature request

This will track general plans on VLM and composite models so that we can align with work in TGI and other libraries. I already have some trackers so in this one I'll lay out a more bigger picture with links to respective discussions/topics

Motivation

We already have a pretty good working standards when it comes to language models, and when adding a new model usually a few "copy from" statements will do the work. We also cover most cases for LMs in out test suite. But for wave of multimodal models we still lack any form of standardization and uniform API. Each new model added to the library introduces something new, that forces us to accept it as is until we figure out how to handle it later

So we need to try to standardize those models, currently starting from VLMs. VLMs are the most commonly added models currently, but we may have more audio+text or pure multimodal ones in the future. For now we start off by working on VLM and see how things fit in the general API

Your contribution

The major changes we are working on and planning to work are:

  • Standardization for Processors:

    • We have ongoing work on uniform processor kwargs which currently will help us enable pipelines for VLMs and thus we can have correct automodel tag on the hub. The work is under progress by @yonigozlan and @molbap
    • Parallel to that I will work on separating out video models under a new class (VideoProcessor) and handling a whole lot of deprecation cycle for the processing config files. At the end we should have separate file/separate class for video processing and save its params in its own config file. That will be tracked in Video Processor as a separate class #33504 and has discussions with Amy in the linked issue under that
  • Standardization in terms of modeling code:

    • One major thing was to get rid of buggy merge_embeds method and cover VLMs with more generation related tests, as we were getting many issues after a small change. Slow tests unfortunately don't cover everything and are not run every time a PR is merged. That is being tracked in Track progress for VLMs refactoring #33374
    • Another major topic is setting attention implementation for composite models (not only VLMs) which will fix red CI and add uniformity to how we work with composite models in general. After that PR we should enforce each composite model to have a separate PreTrainedConfig for each model backbone in its architecture. And each sub-config should be part of one major ModelConfig which may hold specific attr for the composte model only (not its sub-backbones). See Attn implementation for composite models #32238
    • Separate out get_image_features method for all VLMs so we can have more modularity and prob make the code much cleaner. Was proposed by one of the community contributor and I'll handle propagating the change in all models. See Refactor image features selection in LlaVa #33696
  • Standardization for chat templates:

    • We can support (tokenize=True, return_tensors="pt") kwargs in processor's apply_chat_template, so that the method returns already vectorized outputs. Similar to tokenizers, the main point is to feed in a chat history and get tensor inputs ready for generation/train. The only difference is that users will have to explicitly add image file/url or ImageInput so we can process it internally and turn into pixel_values. Below is the general design. No work started yet, I am planning to make a PR some time in October
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": {"url": "https://...."}}},
            {"type": "text", "text": "What do you see here?"},
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "Stop sign [...]"},
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image":  {"path": "my_image.png"}}},
            {"type": "text", "text": "What color is the cat?"},
        ]
    },       
]
  • Standardization for tokenizers:
    • We can have new special tokens added to the tokinizers if they are loaded from a VLM model repo. Currently I have a plan to add at lest 3 new special tokens (image, boi and eoi), but given a wave of new models I might expand that list. I had a PR prev but that was a very basic design (Make special image tokens attribute of tokenizer #31967). Currently working on making SpecialTokenMixin more flexible so that we can simply change the class attribute SPECIAL_TOKENS_ATTRIBUTES and everything else will work out-of-the-box. Seems to me the easiest way to expand special tokens for multimodal cases without flooding simple language model tokenizers.
@zucchini-nlp zucchini-nlp added Feature request Request for a new feature WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress Vision Discussion Discussion on a topic (keep it focused or open a new issue though) and removed Feature request Request for a new feature labels Oct 4, 2024
@zucchini-nlp
Copy link
Member Author

cc @ArthurZucker , here is the general plan I have. Let me know if something is missing or not very clear 😄

Feedback/ideas are welcome :D

@zucchini-nlp zucchini-nlp changed the title Major VLM tracker Major VLM tracker (standardize the API) Oct 4, 2024
@zucchini-nlp zucchini-nlp self-assigned this Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Discussion on a topic (keep it focused or open a new issue though) Vision WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

No branches or pull requests

1 participant