-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
The max text sequence length for this model is 512. The image sequence length is usually between 1024 and 4096 depending on resolution. This means that a significant fraction of the processed tokens is text tokens - but due to Chroma's design, most of them are masked unless you have very long prompts.
Masked tokens can be removed:
seq_lengths = bool_attention_mask.sum(dim=1)
max_seq_length = seq_lengths.max().item()
text_encoder_output = text_encoder_output[:, :max_seq_length, :]
bool_attention_mask = bool_attention_mask[:, :max_seq_length]
(applied after the mask has been expanded, otherwise max_seq_length is off by 1)
Training and inference takes about 25% less time at 512 px. Less at time saving at higher resolutions
Metadata
Metadata
Assignees
Labels
No labels