mlxcel has three distributed or multi-device surfaces. They share code under
src/distributed/, but their maturity differs by mode and model family.
| Mode | Purpose | Maturity |
|---|---|---|
| Tensor parallelism (TP) | Shard tensor operations across in-process ranks. | Implemented for selected dense text families; validate per model. |
| Pipeline parallelism (PP) | Split layer ranges across stages. | Best validated on Llama-family text models and two-stage topologies. |
| Disaggregated inference (DI) | Split prefill/decode roles while each node holds the model. | Infrastructure exists; treat as experimental unless validated for your topology. |
Model fits on one device?
├── yes
│ ├── latency-sensitive single-user serving → single device
│ └── many concurrent users → consider DI after validation
└── no
├── high-bandwidth local devices → TP or PP
└── multi-host / uneven memory → PP with explicit layer ranges
TP shards weights inside transformer layers and synchronizes row-parallel outputs. The public knobs are:
mlxcel generate -m models/<checkpoint> \
--tp-size 2 \
-p "Hello" -n 100
mlxcel-server -m models/<checkpoint> \
--tp-size 2 \
--port 8080Related options include --tp-moe-mode, --tp-embedding-mode, and
--tp-lm-head-mode. The current runtime requires replicated embedding and LM
head modes for many families.
The help text in src/main.rs and src/bin/mlx_server.rs is the source of
truth for the currently advertised TP family list. At the time of this docs
pass, it includes dense Llama, Qwen 2/2.5/3/3.5 text, Gemma 3/4 text,
ERNIE 4.5, and Hunyuan v1 Dense, with additional implementation pieces for
other families.
Limitations:
- The model must shard cleanly across the selected rank count.
- Some server batching and VLM paths are intentionally conservative under TP.
- Benchmark and correctness validation should be repeated for every model family and rank count you intend to run.
PP splits the model by layer range. It is useful when a model exceeds a single device's memory or when hosts have uneven memory capacity.
mlxcel generate -m models/<checkpoint> \
--pp-size 2 \
--pp-micro-batch-size 4 \
-p "Hello" -n 100You can provide explicit layer ranges instead of relying on auto partitioning:
mlxcel generate -m models/<checkpoint> \
--pp-layers 0-15,16-31 \
--pp-micro-batch-size 4 \
-p "Hello" -n 100The server uses --distributed-config with a TOML cluster configuration. The
repository includes helper scripts and examples under examples/distributed/
and scripts/benchmark_pipeline_remote_rollout.sh; inspect those before
operating a real cluster.
A minimal shape looks like this:
# Stage process.
mlxcel-server -m models/<checkpoint> \
--distributed-config examples/distributed/generated_pipeline_remote_2node_tcp.toml \
--node-id stage-1 \
--host 0.0.0.0 --port 18081 --no-warmup
# Coordinator / serving process.
mlxcel-server -m models/<checkpoint> \
--distributed-config examples/distributed/generated_pipeline_remote_2node_tcp.toml \
--node-id coordinator \
--host 0.0.0.0 --port 18080 \
--parallel 2 --max-batch-size 2 --pp-micro-batch-size 2--pp-auto N can generate a zero-config pipeline plan and is mutually exclusive
with --distributed-config. For production, prefer checking in an explicit TOML
once the topology is known.
| Transport | Notes |
|---|---|
| TCP | Default IP transport. |
| Thunderbolt | macOS Thunderbolt Bridge selection on top of the shared TCP core. |
| RDMA | Backend exists with capability probing and fallback behavior; validate on the target OS/hardware before relying on acceleration. |
mDNS/static discovery options are available for zero-config startup. Static configuration is the safer choice across subnets or locked-down networks.
DI separates prefill and decode roles. Unlike PP, it does not reduce per-node model memory: each role still needs the model loaded. The intended use case is throughput tuning, not making an oversized model fit.
The code shares the same cluster config, registry, transport, and metrics infrastructure as PP. Treat it as a topology-specific feature: run a live test with your traffic shape before publishing performance claims.
- Distributed support is not uniform across model families.
- VLM partitioning is partial; text-only paths are better covered.
- Multi-host CI coverage is limited compared with single-host unit tests.
- Transport performance depends heavily on the physical interconnect and OS network configuration.
See supported models for the maintained support summary.