Skip to content

Reproduce VGT-AR Pretraining (InternVL3) #12

@DingShizhe

Description

@DingShizhe

I’m pretraining VGT InternVL3 0.6B (448px) using the official pretraining pipeline, but the generation quality is much lower than expected. Training appears stable and the loss decreases normally, but visual quality remains poor and improves only slightly over time.

I attach generated samples from 2k, 50k, and 100k iterations, as well as the loss curves.

samples from 2k iterations

Image

samples from 50k iterations

Image

samples from 100k iterations

Image

loss curves

Image

Training Setup

  • 100k iterations, 8 GPUs (DDP)
  • Global batch size: 256
  • LR: 3e-4 peak, cosine decay to 1e-4
  • Warmup: 1k iters
  • AdamW (0.9, 0.95), weight decay 0.05
  • EMA start 10k (momentum 0.0002)
  • REPA loss weight 0.5

Config: configs/pretrain/vgt_internvl3_0_6B_448px_pretrain.py


Dataset

Mixed training data:

  • megalith10m
  • text2image2m
  • imagenet1k_t2i_qwenvl_flux

Question

Is this level of generation quality expected for this setup, or is there anything important I should check or adjust?

Thanks — happy to share more details if needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions