feat: Add LLaDA Diffusion Model Support #14

cavit99 · 2025-03-15T22:38:35Z

Overview

This PR introduces support for diffusion-based language models (e.g., LLaDA) in mlx_lm, extending the framework beyond autoregressive generation to include diffusion paradigms for further research. The implementation adds a new generate_diffusion function in utils.py, respectfully updates the generation pipeline to handle both model types, and resolves issues that arose during integration, being very careful not to affect autoregressive path compatibility. The changes maintain alignment with the mlx-lm principles while ensuring robust initial functionality for diffusion models LLaDA and potentially others going forward.

Key Changes

New generate_diffusion Function in utils.py
- Added a new function to support diffusion-based text generation, tailored for models like LLaDA.
- Key features:
  - Progressive token unmasking with configurable steps (steps), generation length (gen_length), and block sizes (block_length).
  - Supports Gumbel noise sampling (noise_temp) for stochasticity and classifier-free guidance (cfg) for controlled generation.
  - Implements semi-autoregressive block-wise diffusion with configurable unmasking strategies (topk or random).
  - Yields intermediate progress updates in verbose mode or final text otherwise, integrated with the GenerationResponse dataclass.
- Optimizations:
  - Uses mx.compile for the sampling step to leverage MLX’s performance benefits.
  - Efficiently handles batch size of 1 (current limitation) with plans for future expansion.
Updated stream_generate in utils.py
- Modified to detect diffusion models via model.args.model_type == "llada" and delegate to generate_diffusion.
- Preserves existing autoregressive paths (generate_step and speculative_generate_step) for non-diffusion models.
- Added argument filtering to route diffusion-specific kwargs (e.g., steps, gen_length) to generate_diffusion and autoregressive kwargs (e.g., max_tokens, sampler) to their respective functions.
- Ensures consistent streaming behavior across model types using the GenerationResponse interface.
Extended CLI in generate.py
- Added diffusion-specific arguments with sensible defaults:
  - --steps (default: 32): Number of diffusion steps.
  - --gen-length (default: 64): Length of the generated sequence.
  - --noise-temp (default: 0.0): Temperature for Gumbel noise sampling.
  - --cfg (default: 1.0): Classifier-free guidance scale.
  - --block-length (default: None): Size of semi-autoregressive blocks.
  - --unmasking (default: "topk"): Strategy for unmasking tokens (topk or random).
- Integrated these into the generate call, ensuring seamless invocation of diffusion generation when using an LLaDA model.

Implementation Details

Model Detection: Relies on model.args.model_type == "llada" to switch to diffusion mode, ensuring compatibility with custom LLaDA implementations (e.g., your llada.py).
Performance: Leverages MLX’s fast operations as much as possible.
Backward Compatibility: Autoregressive generation paths (including speculative decoding) remain unchanged, with diffusion-specific logic isolated to new code paths.

Testing

Weights Conversion

Works as normal, including quantization.

Example:

mlx_lm.convert --hf-path <llada-hf-repo> --mlx-path ./llada-mlx --quantize --q-bits 4

Diffusion:
- Command:
- ```
mlx_lm.generate \
  --model mlx-community/LLaDA-8B-Instruct-mlx-fp16 \
  --prompt "Tell me about Leonardo da Vinci." \
  --gen-length 32 \
  --steps 32 \
  --noise-temp 0.3 \
  --cfg 1.0 \
  --verbose true
```
- Recommend using pre converted weights mlx-community/LLaDA-8B-Instruct-mlx-fp16 or quantized also on mlx-community.
- Output:
  - Verbose mode shows block-wise progress (e.g., "Block 1/1 | Step 32/32 | Unmasked 32/32") with intermediate text.
  - Non-verbose mode prints only the final generated text.

Impact

New Feature: Users can now generate text with diffusion model LLaDA, broadening mlx_lm’s applicability.
Preserved Behaviour: Autoregressive models unaffected, with minimised intrusion into the repo.

Known Limitations

Verbose output in diffusion mode uses ANSI escape codes (\033[2J\033[H), which may not render perfectly in all terminals.

Future Work

Optimise performance
New research regarding diffusion LLMs drops weekly, there is a lot of scope for improvements

Blaizzy · 2025-03-15T23:44:52Z

Well done @cavit99, this looks awesome!

felladrin · 2025-05-21T07:58:03Z

Excellent work! I hope we can at some point use through mlx.server!

Tested with:

mlx_lm.generate \
  --model mlx-community/LLaDA-8B-Instruct-mlx-fp16 \
  --prompt "What is the best way to become healthier?" \
  --gen-length 128 \
  --steps 128 \
  --noise-temp 0.65 \
  --cfg 0.9 \
  --verbose true

Result:

Generating with diffusion model: 128 steps, 128 tokens
Block 1/1 | Step 128/128 | Unmasked 128/128

The best way to become healthier is through a combination of a balanced diet, regular physical activity, good sleep habits, stress management, and social support. Aim to consume a variety of fruits, vegetables, whole grains, lean proteins, and healthy fats, and limit processed foods and sugary drinks. Engage in at least 150 minutes of moderate-intensity aerobic exercise per week, along with muscle-strengthening activities at least two days a week. Prioritize quality sleep, aiming for 7-9 hours per night, and practice stress-reducing techniques like mindfulness or meditation. Additionally, maintain social connections with friends and family members.

==========

Prompt: 22 tokens, 0.207 tokens-per-sec
Generation: 128 tokens, 1.207 tokens-per-sec
Peak memory: 16.548 GB

arthurcolle · 2025-06-07T21:38:12Z

Three months, no word on this - very frustrating to waste 2 hours reimplementing this adapter that was working fine 3 months ago

WTF?

cavit99 added 5 commits March 14, 2025 16:23

llada v1

f25bfc8

llada

c539f3c

corrected gumbel noise

a65019a

fix seed handling, minor tweaks

95d43d7

rebase .gitignore

51169aa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add LLaDA Diffusion Model Support #14

feat: Add LLaDA Diffusion Model Support #14

Uh oh!

cavit99 commented Mar 15, 2025

Uh oh!

Blaizzy commented Mar 15, 2025

Uh oh!

felladrin commented May 21, 2025

Uh oh!

arthurcolle commented Jun 7, 2025

Uh oh!

Uh oh!

feat: Add LLaDA Diffusion Model Support #14

Are you sure you want to change the base?

feat: Add LLaDA Diffusion Model Support #14

Uh oh!

Conversation

cavit99 commented Mar 15, 2025

Overview

Key Changes

Implementation Details

Testing

Impact

Known Limitations

Future Work

Uh oh!

Blaizzy commented Mar 15, 2025

Uh oh!

felladrin commented May 21, 2025

Uh oh!

arthurcolle commented Jun 7, 2025

Uh oh!

Uh oh!