Skip to content

Latest commit

 

History

History
136 lines (103 loc) · 4.92 KB

File metadata and controls

136 lines (103 loc) · 4.92 KB

Distributed inference

mlxcel has three distributed or multi-device surfaces. They share code under src/distributed/, but their maturity differs by mode and model family.

Mode Purpose Maturity
Tensor parallelism (TP) Shard tensor operations across in-process ranks. Implemented for selected dense text families; validate per model.
Pipeline parallelism (PP) Split layer ranges across stages. Best validated on Llama-family text models and two-stage topologies.
Disaggregated inference (DI) Split prefill/decode roles while each node holds the model. Infrastructure exists; treat as experimental unless validated for your topology.

Choosing a mode

Model fits on one device?
├── yes
│   ├── latency-sensitive single-user serving → single device
│   └── many concurrent users                 → consider DI after validation
└── no
    ├── high-bandwidth local devices          → TP or PP
    └── multi-host / uneven memory            → PP with explicit layer ranges

Tensor parallelism

TP shards weights inside transformer layers and synchronizes row-parallel outputs. The public knobs are:

mlxcel generate -m models/<checkpoint> \
    --tp-size 2 \
    -p "Hello" -n 100

mlxcel-server -m models/<checkpoint> \
    --tp-size 2 \
    --port 8080

Related options include --tp-moe-mode, --tp-embedding-mode, and --tp-lm-head-mode. The current runtime requires replicated embedding and LM head modes for many families.

The help text in src/main.rs and src/bin/mlx_server.rs is the source of truth for the currently advertised TP family list. At the time of this docs pass, it includes dense Llama, Qwen 2/2.5/3/3.5 text, Gemma 3/4 text, ERNIE 4.5, and Hunyuan v1 Dense, with additional implementation pieces for other families.

Limitations:

  • The model must shard cleanly across the selected rank count.
  • Some server batching and VLM paths are intentionally conservative under TP.
  • Benchmark and correctness validation should be repeated for every model family and rank count you intend to run.

Pipeline parallelism

PP splits the model by layer range. It is useful when a model exceeds a single device's memory or when hosts have uneven memory capacity.

In-process CLI path

mlxcel generate -m models/<checkpoint> \
    --pp-size 2 \
    --pp-micro-batch-size 4 \
    -p "Hello" -n 100

You can provide explicit layer ranges instead of relying on auto partitioning:

mlxcel generate -m models/<checkpoint> \
    --pp-layers 0-15,16-31 \
    --pp-micro-batch-size 4 \
    -p "Hello" -n 100

Server / multi-host path

The server uses --distributed-config with a TOML cluster configuration. The repository includes helper scripts and examples under examples/distributed/ and scripts/benchmark_pipeline_remote_rollout.sh; inspect those before operating a real cluster.

A minimal shape looks like this:

# Stage process.
mlxcel-server -m models/<checkpoint> \
    --distributed-config examples/distributed/generated_pipeline_remote_2node_tcp.toml \
    --node-id stage-1 \
    --host 0.0.0.0 --port 18081 --no-warmup

# Coordinator / serving process.
mlxcel-server -m models/<checkpoint> \
    --distributed-config examples/distributed/generated_pipeline_remote_2node_tcp.toml \
    --node-id coordinator \
    --host 0.0.0.0 --port 18080 \
    --parallel 2 --max-batch-size 2 --pp-micro-batch-size 2

--pp-auto N can generate a zero-config pipeline plan and is mutually exclusive with --distributed-config. For production, prefer checking in an explicit TOML once the topology is known.

Transports

Transport Notes
TCP Default IP transport.
Thunderbolt macOS Thunderbolt Bridge selection on top of the shared TCP core.
RDMA Backend exists with capability probing and fallback behavior; validate on the target OS/hardware before relying on acceleration.

mDNS/static discovery options are available for zero-config startup. Static configuration is the safer choice across subnets or locked-down networks.

Disaggregated inference

DI separates prefill and decode roles. Unlike PP, it does not reduce per-node model memory: each role still needs the model loaded. The intended use case is throughput tuning, not making an oversized model fit.

The code shares the same cluster config, registry, transport, and metrics infrastructure as PP. Treat it as a topology-specific feature: run a live test with your traffic shape before publishing performance claims.

Common limitations

  • Distributed support is not uniform across model families.
  • VLM partitioning is partial; text-only paths are better covered.
  • Multi-host CI coverage is limited compared with single-host unit tests.
  • Transport performance depends heavily on the physical interconnect and OS network configuration.

See supported models for the maintained support summary.