Investigate Why Ringed Pipeline favors k>1 at 70B Q4

# MLX Ring Pipeline Memory Investigation

_**dnet**_ ring pipeline fails to run Llama-3.3/70B Q4 on 32+24 GB (56 GB total) for k=1 (1 round per token setup) while [distributed-llama](https://github.com/b4rtaz/distributed-llama/discussions/147)(tensor parallel + Q80 pipes) fits on 2×24 GB (48 GB). This task instruments, tests, and narrows the root causes (inter-stage dtype, KV, embedding/LM-head placement, lack of TP, pools) and proposes concrete fixes.

## Background

### Distributed Llama (DLlama)

- Tensor-parallel sharding for all large mats per layer (Q/K/V, W1/W3 row; W2/logits col)
- Inter-node pipes in Q80; matmul Q80×Q40→F32; KV heads sharded across nodes

### DNET

| Component | Current Behavior | References |
|-----------|------------------|------------|
| **Inter-stage dtype** | fp16 by default (`TransportConfig.wire_dtype`, runtime forces casts). Compression exists (`qsparse8_v1`) but ring path doesn't use it. | `dnet/shard/config.py`, `dnet/shard/runtime.py`, `dnet/shard/codec.py`, `dnet/core/tensor.py`, `dnet/compression/wire.py` |
| **KV cache** | Defaulted to fp16 unless API explicitly sets `8bit/4bit` | `dnet/shard/runtime.py` (`load_model_core` kv_bits handling) |
| **Parallelism** | No tensor sharding; pure pipeline parallel via `assigned_layers` | `dnet/api/strategies/ring.py`, `dnet/shard/runtime.py` |
| **Embedding/LM head** | Embedding and Head split; dtype follows checkpoint (can be dense) | `dnet/utils/model.py` (`load_embeddings`, `load_lm_head`) |
| **Pools** | Input/output pools default 512 MB each per shard | `dnet/shard/config.py` |

---

## Hypotheses

| ID | Hypothesis |
|----|------------|
| **H1** | Inter-stage fp16 doubles activation traffic and buffer footprint vs Q8, pushing peak stage memory over 24 GB |
| **H2** | Embedding/LM head reside unsharded on edge stages; if dense, they spike stage memory |
| **H3** | Pools and stream buffers add nontrivial overhead on tight 24 GB stages |
| **H4** | Absence of TP concentrates full layer weights on a stage; DLlama's TP flattens peaks |

---

## Goals

- Produce a stage-wise memory and comm budget that explains the OOM
- Validate which factors (H1–H5) materially impact peak stage memory
- Deliver minimal changes to get closer to DLlama's footprint (or document why TP is required)

---

## Experiments

### E1: Inter-stage dtype

| Variant | Description |
|---------|-------------|
| **A** | Baseline fp16 wire (current) |
| **B** | Enable `qsparse8_v1` by default; compare peak memory + bytes/token |

### E2: Pools

Reduce pools from 512 MB → 128 MB → 64 MB; measure headroom.

### E3: Embedding/LM head

- Verify dtype (dense vs quantized)
- If dense, attempt quantized tensors (if present) or relocate embedding to larger device
- Shard LM head by vocab for sampling
---

## Test Matrix

| Parameter | Values |
|-----------|--------|
| **Hardware** | Macbook Pro M1 Max 32 GB + MacMini M4 Pro 24 GB (or 2×24 GB) |
| **Model** | Llama-3.1 70B Q4, `seq_len=4096`, `batch=1` |
| **Wire** | fp16 vs `qsparse8_v1` |
| **KV** | fp16 vs 8-bit |
| **Pools** | 512 vs 128 vs 64 MB |
| **Embed/Head** | Dense vs quantized (if available), placement |

---

Component	Current Behavior	References
Inter-stage dtype	fp16 by default (`TransportConfig.wire_dtype`, runtime forces casts). Compression exists (`qsparse8_v1`) but ring path doesn't use it.	`dnet/shard/config.py`, `dnet/shard/runtime.py`, `dnet/shard/codec.py`, `dnet/core/tensor.py`, `dnet/compression/wire.py`
KV cache	Defaulted to fp16 unless API explicitly sets `8bit/4bit`	`dnet/shard/runtime.py` (`load_model_core` kv_bits handling)
Parallelism	No tensor sharding; pure pipeline parallel via `assigned_layers`	`dnet/api/strategies/ring.py`, `dnet/shard/runtime.py`
Embedding/LM head	Embedding and Head split; dtype follows checkpoint (can be dense)	`dnet/utils/model.py` (`load_embeddings`, `load_lm_head`)
Pools	Input/output pools default 512 MB each per shard	`dnet/shard/config.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate Why Ringed Pipeline favors k>1 at 70B Q4 #73

MLX Ring Pipeline Memory Investigation

Background

Distributed Llama (DLlama)

DNET

Hypotheses

Goals

Experiments

E1: Inter-stage dtype

E2: Pools

E3: Embedding/LM head

Test Matrix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ID	Hypothesis
H1	Inter-stage fp16 doubles activation traffic and buffer footprint vs Q8, pushing peak stage memory over 24 GB
H2	Embedding/LM head reside unsharded on edge stages; if dense, they spike stage memory
H3	Pools and stream buffers add nontrivial overhead on tight 24 GB stages
H4	Absence of TP concentrates full layer weights on a stage; DLlama's TP flattens peaks

Variant	Description
A	Baseline fp16 wire (current)
B	Enable `qsparse8_v1` by default; compare peak memory + bytes/token

Parameter	Values
Hardware	Macbook Pro M1 Max 32 GB + MacMini M4 Pro 24 GB (or 2×24 GB)
Model	Llama-3.1 70B Q4, `seq_len=4096`, `batch=1`
Wire	fp16 vs `qsparse8_v1`
KV	fp16 vs 8-bit
Pools	512 vs 128 vs 64 MB
Embed/Head	Dense vs quantized (if available), placement

Investigate Why Ringed Pipeline favors k>1 at 70B Q4 #73

Description

MLX Ring Pipeline Memory Investigation

Background

Distributed Llama (DLlama)

DNET

Hypotheses

Goals

Experiments

E1: Inter-stage dtype

E2: Pools

E3: Embedding/LM head

Test Matrix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions