add chunked prefill and continuous batching writeup (#64)

skyzh · web-flow · commit 14498161ce67 · 2025-09-13T15:55:36.000-04:00
Signed-off-by: Alex Chi Z &lt;iskyzh@gmail.com&gt;
diff --git a/book/src/SUMMARY.md b/book/src/SUMMARY.md
@@ -17,8 +17,7 @@
     - [Key-Value Cache](./week2-01-kv-cache.md)
     - [Quantized Matmul (2 Days)]()
     - [Flash Attention (2 Days)]()
-    - [Chunked Prefill]()
-    - [Continuous Batching]()
+    - [Continuous Batching (2 Days)](./week2-06-prefill-and-batch.md)
 - [Week 3: Serving]()
 
 ---
diff --git a/book/src/week2-06-prefill-and-batch.md b/book/src/week2-06-prefill-and-batch.md
@@ -0,0 +1,113 @@
+# Week 2 Day 6 and 7: Chunked Prefill and Continuous Batching
+
+In this chapter, we will implement **continuous batching**. The idea is to batch multiple requests together so we can make full use of the compute resources.
+
+So far, we have assumed that the model only processes a single batch each time it is called. However, a single batch is usually not enough to saturate the compute resources. To address this, we can process multiple requests at the same time.
+
+The first question is how to batch requests. A naive approach would be to select a fixed number of prompts (for example, 5) from the request queue and perform decoding as before. The problem is that different prompts produce sequences of different lengths. It is possible that 4 out of 5 requests finish decoding quickly, while the remaining one takes much longer. This leads to wasted compute resources and stalls all other requests.
+
+A smarter approach is **continuous batching**. That is, we set the maximum number of requests we can process at once. When one request finishes, we replace its slot (i.e., its KV cache) with another request. In this way, the pipeline remains fully utilized.
+
+Another challenge is how to handle decoding and prefilling at the same time. In this chapter, we adopt a simplified approach: we prefill one request, then decode one token for each request in progress. The general idea can be described with the following pseudocode:
+
+```python
+while requests_in_queue_or_in_progress:
+    if prefill_request exists:
+        prefill_request.try_prefill()  # perform a chunk of chunked prefill
+        if prefill_request.ready:
+            if kv_cache.try_add(prefill_request):
+                prefill_request = next(requests)
+    tokens = decode(model, kv_cache)
+    requests.append(tokens)
+```
+
+We will also implement **chunked prefill** in this chapter. Prefilling a long prompt can take a significant amount of time. Since we are interleaving prefills and decodes, we want to reduce the latency of producing the next token. Ideally, the time slots for prefill and decode should be roughly equal. To achieve this, we can prefill a portion of the request at a time, using multiple slots to finish the entire prefill.
+
+For prefilling, this essentially means providing a chunk of tokens to the model to populate the KV cache. For example:
+
+```python
+# assume prompt_tokens is a list of 400 tokens and prefill chunk size is 128
+_step(model, prompt_tokens[0:128], offset=0, kv_cache)
+_step(model, prompt_tokens[128:256], offset=128, kv_cache)
+_step(model, prompt_tokens[256:384], offset=256, kv_cache)
+_step(model, prompt_tokens[384:400], offset=384, kv_cache)
+```
+
+Note that the causal mask generated during prefilling has the shape `LxS`. For example, assume we already have 5 tokens in the KV cache and want to prefill 3 tokens. The mask should look like this:
+
+```
+0    0    0   -inf  -inf
+0    0    0    0    -inf
+0    0    0    0     0
+```
+
+This is the same masking logic you implemented in Week 1.
+
+## Task 1: Batch RoPE and Causal Mask for Prefill
+
+```
+src/tiny_llm/positional_encoding.py
+src/tiny_llm/attention.py::causal_mask
+```
+
+Ensure your RoPE implementation accepts a list of offsets. Also, make sure your mask implementation correctly handles the case where `L != S`.
+
+## Task 2: Batch KV Cache
+
+```
+src/tiny_llm/kv_cache.py::BatchingKvCache
+```
+
+The batch KV cache is a collection of KV caches, one for each request. A challenge here is generating a `BxHxLxS` mask for the batch, since requests can have different lengths.
+
+```
+S = max(S_i of the batch)
+L = mask_length (input parameter)
+keys: 1, H, S_i, D
+values: 1, H, S_i, D
+batched_keys: B, H, S, D
+batched_values: B, H, S, D
+mask: B, 1, L, S
+```
+
+You should fill the `batched_keys` and `batched_values` arrays so that each request’s data is aligned at the end:
+
+```python
+batched_keys[i, :, (S-S_i):S, :] = keys[i, :, :, :]
+batched_values[i, :, (S-S_i):S, :] = values[i, :, :, :]
+mask[i, :, 0:L, (S-S_i):S] = causal_mask(L, S_i)
+```
+
+## Task 3: Handle Batches in the Model
+
+```
+src/tiny_llm/qwen2_week2.py
+```
+
+Ensure your model can handle multiple requests simultaneously. You should also use the masks returned by the batch KV cache.
+
+## Task 4: Batch Generate
+
+```
+src/tiny_llm/batch.py
+```
+
+Implement `try_prefill` so that it prefills an entire request at once. Then implement the rest of the code as described in the starter code.
+
+## Task 5: Chunked Prefill
+
+```
+src/tiny_llm/batch.py
+```
+
+Modify `try_prefill` so that it performs prefilling in chunks, rather than all at once.
+
+You can test your implementation by running:
+
+```bash
+pdm run batch-main
+```
+
+This will use the `qwen2-0.5b` model with a batch size of 5 to process a fixed set of prompts.
+
+{{#include copyright.md}}
diff --git a/src/tiny_llm/attention.py b/src/tiny_llm/attention.py
@@ -53,5 +53,6 @@ def flash_attention(
     key: mx.array,
     value: mx.array,
     scale: float | None = None,
+    mask: mx.array | None = None,
 ) -> mx.array:
     pass
diff --git a/src/tiny_llm/batch.py b/src/tiny_llm/batch.py
@@ -0,0 +1,171 @@
+import mlx.core as mx
+from mlx_lm.tokenizer_utils import TokenizerWrapper
+from .kv_cache import *
+from .qwen2_week2 import Qwen2ModelWeek2
+from typing import Callable
+from datetime import datetime
+
+
+def _step(model, y, offsets, kv_cache):
+    logits = model(y, offsets, kv_cache)
+    logits = logits[:, -1, :]
+    logprobs = logits - mx.logsumexp(logits, keepdims=True)
+    sampler = lambda x: mx.argmax(x, axis=-1)
+    y = sampler(logprobs)
+    return y
+
+
+class Request:
+    def __init__(
+        self,
+        model: any,
+        tokenizer: TokenizerWrapper,
+        prompt: str,
+        prefill_max_step: int = 128,
+        prompt_idx: int = 0,
+    ):
+        self.prompt = prompt
+        self.kv_cache = [TinyKvFullCache() for _ in range(model.num_hidden_layers)]
+        self.model = model
+        self.detokenizer = tokenizer.detokenizer.__class__(tokenizer._tokenizer)
+        self.prefill_tokens = mx.array(
+            tokenizer.encode(prompt, add_special_tokens=False)
+        )
+        self.prefill_max_step = prefill_max_step
+        self.is_done = False
+        self.is_prefill_done = False
+        self.eos_token_id = tokenizer.eos_token_id
+        self.next_token = None
+        self.offset = 0
+        self.prompt_idx = prompt_idx
+
+    def try_prefill(self):
+        """
+        Prefill this request up to max_step size, returns None if prefill is not done
+        """
+        if self.is_prefill_done:
+            raise ValueError("prefill called after done")
+        # TODO: in task 4, prefill the full request at once; in task 5, prefill a chunk at a time
+
+    def decode_done(self, token, update_offset=True):
+        if self.is_done:
+            raise ValueError("decode called after done")
+        if token == self.eos_token_id:
+            self.is_done = True
+            return
+        # TODO: update the offset and add the token to the detokenizer
+
+    def text(self):
+        return self.detokenizer.text
+
+
+def _print_progress(
+    requests: list[Request | None],
+    is_idle: list[bool],
+    pending_prefill_request: Request | None,
+    queue_size: int,
+    progress_cnt: int,
+    start_time: datetime,
+):
+    print(f"  --- {datetime.now() - start_time}")
+    animation_frames = ["⠋", "⠙", "⠹", "⠸", "⠼", "⠴", "⠦", "⠧", "⠇", "⠏"]
+    animation_frame = animation_frames[progress_cnt % len(animation_frames)]
+    for i in range(len(requests)):
+        if is_idle[i]:
+            print(f"  Decode #{i}: idle", flush=True)
+        else:
+            text_preview = requests[i].text()[-80:].replace("\n", " ")
+            print(
+                f"{animation_frame} Decode [req {requests[i].prompt_idx}, {requests[i].offset}]: {text_preview}",
+                flush=True,
+            )
+    if pending_prefill_request is not None:
+        if pending_prefill_request.is_prefill_done:
+            print(
+                f"  Prefill [req {pending_prefill_request.prompt_idx}]: done, waiting for slot, {queue_size} requests in queue",
+                flush=True,
+            )
+            return
+        precentage = (
+            pending_prefill_request.offset / pending_prefill_request.prefill_tokens.size
+        ) * 100
+        print(
+            f"{animation_frame} Prefill [req {pending_prefill_request.prompt_idx}]: {precentage:.2f}% ({pending_prefill_request.prefill_tokens.size - pending_prefill_request.offset} remaining tokens)",
+            flush=True,
+        )
+    else:
+        print(f"  Prefill: idle, {queue_size} requests in queue", flush=True)
+
+
+def batch_generate(
+    model: any,
+    tokenizer: TokenizerWrapper,
+    prompts: list[str],
+    max_seq_len=512,
+    batch_size=5,
+    prefill_step=128,
+):
+    decode_requests: list[Request] = [None] * batch_size
+    is_idle = [True] * batch_size
+    kv_cache = [
+        BatchingKvCache(max_active_requests=batch_size, max_seq_len=max_seq_len)
+        for _ in range(model.num_hidden_layers)
+    ]
+    result = []
+    pending_prefill_request = None
+    next_request_idx = 0
+    progress_cnt = 0
+    start_time = datetime.now()
+
+    while True:
+        if len(prompts) == 0 and all(is_idle):
+            break
+        # prefill until no idle slots
+        if len(prompts) > 0 and pending_prefill_request is None:
+            prompt = prompts.pop(0)
+            pending_prefill_request = Request(
+                model, tokenizer, prompt, prefill_step, next_request_idx
+            )
+            next_request_idx += 1
+
+        # In every iteration, we do a prefill first
+        if pending_prefill_request is not None:
+            made_progress = False
+            if not pending_prefill_request.is_prefill_done:
+                pending_prefill_request.try_prefill()
+                made_progress = True
+            if pending_prefill_request.is_prefill_done:
+                # Implement this: find an idle slot and add the request to the decode requests
+                pass
+            if made_progress:
+                _print_progress(
+                    decode_requests,
+                    is_idle,
+                    pending_prefill_request,
+                    len(prompts),
+                    progress_cnt,
+                    start_time,
+                )
+                progress_cnt += 1
+
+        # After the prefill request moves forward one step, we do the decode
+        if not all(is_idle):
+            next_tokens = []
+            offsets = []
+            # TODO: collect the next tokens and offsets from the decode requests
+            next_tokens = _step(model, next_tokens.reshape(-1, 1), offsets, kv_cache)
+            for i in range(batch_size):
+                # TODO: check if the decode has finished by comparing EOS or the seqlength. If so,
+                # remove the request from the decode requests and add the result to the result list;
+                # otherwise, call `decode_done` to update the offset and add the token to the detokenizer
+                pass
+            _print_progress(
+                decode_requests,
+                is_idle,
+                pending_prefill_request,
+                len(prompts),
+                progress_cnt,
+                start_time,
+            )
+            progress_cnt += 1
+    return result
diff --git a/src/tiny_llm/generate.py b/src/tiny_llm/generate.py
@@ -20,14 +20,3 @@ def simple_generate_with_kv_cache(
 ) -> str:
     def _step(model, y, offset, kv_cache):
         pass
-
-
-def batch_generate(
-    model: any,
-    tokenizer: TokenizerWrapper,
-    prompts: list[str],
-    max_seq_len=512,
-    batch_size=5,
-    prefill_step=128,
-):
-    pass
diff --git a/src/tiny_llm/kv_cache.py b/src/tiny_llm/kv_cache.py
@@ -31,12 +31,52 @@ def update_and_fetch(
 
 class BatchingKvCache(TinyKvCache):
     def __init__(self, max_active_requests: int, max_seq_len: int):
-        pass
+        self.max_active_requests = max_active_requests
+        self.max_seq_len = max_seq_len
+        self.kv_caches: list[TinyKvCache] = [None] * max_active_requests
+        self.HD = None
 
     def update_and_fetch(
-        self, key: mx.array, value: mx.array
-    ) -> tuple[mx.array, mx.array, int]:
-        pass
+        self,
+        keys: mx.array,
+        values: mx.array,
+        mask_length: int | None = None,
+        mask: mx.array | str | None = None,
+    ) -> tuple[mx.array, mx.array, int, Optional[mx.array]]:
+        B, H, S, D = keys.shape
+        assert keys.shape == values.shape
+        assert S <= self.max_seq_len
+        assert self.HD == (H, D), f"expect {self.HD} but got {H, D}"
+        assert B == self.max_active_requests
+        # Step 1: append the result to the cache
+        data = []
+        for b in range(B):
+            if self.kv_caches[b] is None:
+                data.append(None)
+                continue
+            key, value = keys[b : b + 1], values[b : b + 1]
+            new_key, new_value, seq_len, mask = self.kv_caches[b].update_and_fetch(
+                key, value
+            )
+            data.append((new_key[0], new_value[0], seq_len, mask))
+
+        # Step 2: compute seq_len of this batch
+        def get_seq_len(data):
+            if data is None:
+                return 0
+            _, _, seq_len, _ = data
+            return seq_len
+
+        seq_len = max(map(get_seq_len, data))
+
+        # Step 3: generate masks and a single array of keys and values
+        keys = mx.zeros((self.max_active_requests, H, seq_len, D), dtype=key.dtype)
+        values = mx.zeros((self.max_active_requests, H, seq_len, D), dtype=value.dtype)
+        masks = mx.full(
+            (self.max_active_requests, mask_length, seq_len), -mx.inf, dtype=key.dtype
+        )
+        # TODO: generate masks and a single array of keys and values
+        return keys, values, None, masks.reshape(B, 1, mask_length, seq_len)
 
     def add_request(self, prefilled: TinyKvCache, id: int):
         pass
@@ -47,9 +87,14 @@ def remove_request(self, id: int):
 
 class TinyKvFullCache(TinyKvCache):
     def __init__(self):
-        pass
+        self.key_values = None
+        self.offset = 0
 
     def update_and_fetch(
-        self, key: mx.array, value: mx.array
-    ) -> tuple[mx.array, mx.array, int]:
+        self,
+        key: mx.array,
+        value: mx.array,
+        mask_length: int | None = None,
+        mask: mx.array | str | None = None,
+    ) -> tuple[mx.array, mx.array, int, Optional[mx.array]]:
         pass
diff --git a/src/tiny_llm/qwen2_week2.py b/src/tiny_llm/qwen2_week2.py
diff --git a/src/tiny_llm_ref/kv_cache.py b/src/tiny_llm_ref/kv_cache.py
diff --git a/src/tiny_llm_ref/qwen2_week1.py b/src/tiny_llm_ref/qwen2_week1.py