Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement LlamaGen for Image Generation #33905

Open
ighoshsubho opened this issue Oct 3, 2024 · 4 comments
Open

Implement LlamaGen for Image Generation #33905

ighoshsubho opened this issue Oct 3, 2024 · 4 comments
Labels
Feature request Request for a new feature New model Vision

Comments

@ighoshsubho
Copy link

Feature request

Add support for LlamaGen, an autoregressive image generation model, to the Transformers library. LlamaGen applies the next-token prediction paradigm of large language models to visual generation.

Paper: https://arxiv.org/abs/2406.06525
Code: https://github.com/FoundationVision/LlamaGen

Key components to implement:

  1. Image tokenizer
  2. Autoregressive image generation model (based on Llama architecture)
  3. Class-conditional and text-conditional image generation
  4. Classifier-free guidance for sampling

Motivation

LlamaGen demonstrates that vanilla autoregressive models without vision-specific inductive biases can achieve state-of-the-art image generation performance. Implementing it in Transformers would enable easier experimentation and integration with existing language models.

Your contribution

I can help by contributing to this model, and provide examples and detailed explanations of the model architecture and training process if needed.

@ighoshsubho ighoshsubho added the Feature request Request for a new feature label Oct 3, 2024
@SOGeKING-NUL
Copy link

This looks like an incredible feature Shubo! Please allow me to work with you on this for my open sourced contribution for hacktoberfest.

@LysandreJik
Copy link
Member

Thanks for the request! cc @qubvel, @molbap, what do you think?

@qubvel
Copy link
Member

qubvel commented Oct 4, 2024

Very interesting! As far as I know, we don't have image-generation models in transformers yet or am I missing it? So, wondering where is the better place for such a model, in transformers or in diffusers (it's not a diffusion model though).
cc @sayakpaul maybe

@zucchini-nlp
Copy link
Member

Hey! Just saw this issue, and I've been working/reviewing some VLM models that can generate image or text from image+text. TBH we have only ImageGPT as a very old architecture for image generation, very similar to llama-gen iiuc. And two more PRs are open for VLM with image generation: Chameleon's decoder VQ-VAE support which got stale due to contributor being busy and Emu3 which I hopefully can work on in the next weeks

I like Llama-Gen and I think it can be a nice addition. From what I see the model doesn't take image as input, so no inpainting or other tasks, only generation from text. It shouldn't be hard to fit in the general model API. Do we need to have any controlled structured generation for ex: limit tokens to be generated to a specific subset and length? Would be super nice if that kind of control can be done with existing LogitsProcessors, adding new processors is gonna add more maintainment burden to us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature New model Vision
Projects
None yet
Development

No branches or pull requests

5 participants