This document explains the end-to-end Ideogram 4 inference pipeline conceptually. For the architecture spec and code pointers, see model_architecture.md.
Ideogram 4 is a flow-matching text-to-image model built on a single-stream DiT (Diffusion Transformer). The pipeline has four main components:
┌─────────────┐ ┌──────────────────────┐ ┌──────────────┐ ┌───────────┐
│ Qwen3-VL │ │ Ideogram4 │ │ KL VAE │ │ │
│ Text ├──►│ Transformer (DiT) ├──►│ VAE ├──►│ Image │
│ Encoder │ │ + Euler Sampler │ │ Decoder │ │ │
└─────────────┘ └──────────────────────┘ └──────────────┘ └───────────┘
frozen trainable frozen
The text encoder is a frozen Qwen3-VL-8B-Instruct vision-language model, used in text-only mode (no vision inputs).
What it does:
- Tokenizes the prompt using the Qwen3 chat template.
- Runs a forward pass through the 36-layer transformer.
- Extracts hidden states from 13 specific layers: 0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 35.
- Concatenates these hidden states along the feature dimension, producing a multi-scale text representation.
Why multi-layer extraction? Different layers capture different levels of abstraction — early layers encode surface-level token information, while later layers encode deeper semantic meaning. Concatenating them gives the DiT access to the full spectrum.
Output: A tensor of shape (batch, num_text_tokens, hidden_dim * 13).
The core generative model is a 34-layer single-stream Diffusion Transformer.
Text tokens and image latent tokens are concatenated into one sequence and processed through the same self-attention layers.
Sequence layout (per sample):
┌───────────────────┬────────────────────────┐
│ text tokens │ image latent tokens │
│ (up to 2048) │ (grid_h × grid_w) │
└───────────────────┴────────────────────────┘
▲ ▲
Qwen3-VL features noisy latents z_t
- Self-attention with QK-RMSNorm and 3D Multimodal RoPE (MRoPE). The positional encoding is 3-dimensional: for text tokens it uses a 1D position broadcast to 3 axes; for image tokens it uses (temporal, height, width) coordinates. This lets text and image tokens coexist in a unified positional space.
- SwiGLU MLP — the feed-forward layer uses a gated linear unit with SiLU activation.
- Adaptive Layer Norm (AdaLN) — the timestep
tis embedded as a scalar and generates per-block scale and gate parameters. This conditions every layer on the current noise level.
The model is trained with a flow-matching objective. Instead of predicting
noise (as in DDPM), the model predicts a velocity field v(z_t, t) that
defines the ODE:
dz/dt = v(z_t, t)
At inference time, we start from pure Gaussian noise z_1 and integrate
backward to z_0 (the clean image) using the Euler method:
z_{t-dt} = z_t + v(z_t, t) * dt
The timestep distribution follows a logit-normal schedule parameterized by
(mu, sigma). The mean mu controls how much time the sampler spends at
different noise levels — higher mu shifts more steps toward higher noise
(important for high-resolution images). The schedule auto-adjusts for
resolution:
mu_adjusted = mu_base + 0.5 * log(num_pixels / base_pixels)
where base_pixels = 512 * 512.
At each sampling step, two forward passes are run through the DiT:
- Conditional (positive): full text features + noisy image latents.
- Unconditional (negative): zeroed text features + noisy image latents (image-only tokens, asymmetric CFG).
The guided velocity is a weighted combination:
v_guided = gw * v_conditional + (1 - gw) * v_unconditional
where gw is the per-step guidance weight. With
gw > 1, the model amplifies the text-conditional signal and suppresses the
unconditional prediction, producing images that follow the prompt more
faithfully.
Asymmetric CFG: The unconditional branch only processes image tokens (no text padding), making it computationally cheaper than a full-sequence negative pass.
Per-step schedules: The guidance weight can vary across steps. The
V4_QUALITY_48 preset, for example, uses gw=7 for the first 45 steps and
gw=3 for the final 3 "polish" steps near t=0.
The denoised latent z_0 is decoded to pixel space using a frozen KL
autoencoder.
What it does:
- Unpatching: The DiT works with 2×2 patches of latent pixels. The decoder
input is reshaped from
(batch, grid_h * grid_w, channels * 4)to(batch, channels, grid_h * 2, grid_w * 2). - Denormalization: Per-channel shift and scale are applied to undo the latent normalization used during training.
- Decoding: The VAE decoder maps latents to RGB pixels.
- Clipping: Output is clamped to [-1, 1] and rescaled to [0, 255] uint8.
Compression factor: The autoencoder provides 8× spatial compression on each axis, and the 2×2 patching in the DiT adds another 2×. So a 1024×1024 image is represented as a 64×64 grid of latent tokens, each with 128 channels (32 base channels × 2² patch).
# Pseudocode for one generation call:
# 1. Encode text
text_features = qwen3_vl.encode(prompt) # (B, L_text, D)
# 2. Initialize noise
z = torch.randn(B, grid_h * grid_w, 128) # pure noise at t=1
# 3. Euler integration from t=1 to t=0
for step in reversed(range(num_steps)):
t = schedule(step)
s = schedule(step - 1)
# Conditional pass (text + image)
v_cond = dit(text_features, z, t)
# Unconditional pass (image only, zeroed text)
v_uncond = dit(zeros, z, t)
# CFG combination
v = gw[step] * v_cond + (1 - gw[step]) * v_uncond
# Euler step
z = z + v * (s - t)
# 4. Decode to pixels
image = vae.decode(z)