[Roadmap] Diffusion LLMs  (2026 S1)

### Checklist
- [ ] If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [ ] Please use English. Otherwise, it will be closed.

### Motivation
Earlier this year, LLaDA released the first diffusion LLM (dLLM), immediately capturing significant attention from both the academic and industrial communities. But there were no production-ready dLLM serving engine. 

We plan to implement the most performant, production-ready dLLM framework in SGLang, make dLLM robust !

### Features
- [x] Initial diffusion LLM framework @ClawSeven @btw616  https://github.com/sgl-project/sglang/pull/12588
	- [x] Support LLaDA2.0-flash / LLaDA2.0-mini
	- [x] Support tensor parallel / expert parallel 
	- [x] Support block diffusion and kv cache
- [x] Doc and CI @ClawSeven @Monstertail 
https://github.com/sgl-project/sglang/pull/14358
https://github.com/sgl-project/sglang/pull/14723
- [x] Support cuda graph @btw616  https://github.com/sgl-project/sglang/pull/14203
- [ ] Support self-defined attention mask
- [x] Support parallel decoding @btw616 @Monstertail  https://github.com/sgl-project/sglang/pull/14412 
- [ ] Support temperature, topp, topk
- [ ] dLLM code refactoring @ClawSeven 
    - [x] Support initial dynamic batching https://github.com/sgl-project/sglang/pull/14883
    - [x] Batching optimization & dLLM scheduling refactor https://github.com/sgl-project/sglang/pull/17484
    - [ ] Requests early exit (for decoding optimization)
- [x] Support radix cache  (for prefill optimization) @btw616 https://github.com/sgl-project/sglang/pull/18724
- [ ] Support overlap scheduling 
- [x] Support dLLM editing @edwardzjl @btw616 https://github.com/sgl-project/sglang/pull/18171
- [ ] Metrics for dLLM @zhanghaotong 
- [ ] Support non-block diffusion LLMs 
- [ ] Var-length Prefill optimization, rely on https://github.com/flashinfer-ai/flashinfer/pull/2722
- [ ] piecewise CUDA Graph (for prefill optimization) 

### For RL
- [ ] Support step maps @RuixiangMa @ClawSeven https://github.com/sgl-project/sglang/pull/17297
- [ ] Support abort request

### VL-dLLM
- [ ]  Initial multi-modal LLM implementation @btw616

### More supported models
- [x] LLaDA2.0 
- [x] SDAR @chengshuang18 https://github.com/sgl-project/sglang/pull/18318
- [ ]  Fast-dLLM v2 @Monstertail (WIP) https://github.com/sgl-project/sglang/pull/17577

### More Hardwares
- [x] Nvidia
- [x] AMD:
    - [x] triton backend: https://github.com/sgl-project/sglang/pull/15560
    - [ ] aiter backend
- [ ] Ascend
   - [x] ascend backend: https://github.com/sgl-project/sglang/pull/18485
   - [ ] triton backend
- [ ] Intel

### More Parallelism 
- [x] Tensor parallelism
- [x] Expert parallelism
- [ ] Data Parallelism (with DPA)
- [ ] Context Parallelism
- [ ] Pipeline parallelism

### Kernel Optimization for dLLM

- [ ] Moe
The dLLM makes the fused‑MoE kernel a bottleneck, so we need to optimize the fused implementation for dLLM scenarios.
    - [x] FP8 optimization for small batch sizes: https://github.com/sgl-project/sglang/pull/15712
- [ ] Attention
Optimize block‑wise causal attention for dLLM prefill.
- [ ] Decoding Algorithm
    - [ ] Low-confidence @YJMSTR https://github.com/sgl-project/sglang/pull/18810


### More disaggregation

PD is not suitable for dLLM, but AFD might be a viable option.

### More Tests
- [ ] Small unit tests for specific functions
- [ ] Nightly unit tests for E2E accuracy and throughput testing

### Better streaming output
- [ ] Support diffusion-style streaming output (like Mercury)

### RFC
https://github.com/sgl-project/sglang/issues/12766

### Related resources

- #15256
- #15011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] Diffusion LLMs (2026 S1) #14199

Checklist

Motivation

Features

For RL

VL-dLLM

More supported models

More Hardwares

More Parallelism

Kernel Optimization for dLLM

More disaggregation

More Tests

Better streaming output

RFC

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Roadmap] Diffusion LLMs (2026 S1) #14199

Description

Checklist

Motivation

Features

For RL

VL-dLLM

More supported models

More Hardwares

More Parallelism

Kernel Optimization for dLLM

More disaggregation

More Tests

Better streaming output

RFC

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions