You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR is:
- To align the Metal paged-attention KV cache dtype with the model's
dtype (fixes batched decode parity for #119).
- To compute KV cache byte sizes via `torch.dtype.itemsize` instead of
allocating temporary tensors.
Notes:
- `tests/test_metal_kernel_paged.py::test_batched_decode_matches` now
passes.
- `tests/test_metal_kernel_paged.py::test_greedy_output_matches` remains
xfailed (tracked in #119). This is a remaining single-request greedy
parity mismatch between the paged-kernel path and the standard path;
fixing it likely requires deeper kernel/offset semantics work, so I'm
keeping it out of this PR to keep scope tight.
Quick manual smoke test:
Terminal 1:
```bash
vllm serve Qwen/Qwen3-0.6B --host 127.0.0.1 --port 8000 --max-model-len 2048
```
Terminal 2 (single request):
```bash
curl -fsS http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"Write a 2-sentence apple story."}],"max_tokens":512,"temperature":0.8}' \
| jq -r '.choices[0].message.content'
```
Terminal 2 (concurrent 4 requests):
```bash
for i in 1 2 3 4; do
(
echo "===== req $i ====="
curl -fsS http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"Write a 2-sentence apple story (${i}).\"}],\"max_tokens\":256,\"temperature\":0.8}" \
| jq -r '.choices[0].message.content'
echo
) &
done
wait
```
Related: #119
---------
Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>
0 commit comments