Skip to content

Commit 87212c8

Browse files
authored
Fix link to vLLM Native CPU Offloading documentation (llm-d#928)
* Fix link to vLLM Native CPU Offloading documentation Signed-off-by: Pete Cheslock <petecheslock@users.noreply.github.com> * Fix header/indenting Signed-off-by: Pete Cheslock <pete.cheslock@redhat.com> --------- Signed-off-by: Pete Cheslock <petecheslock@users.noreply.github.com> Signed-off-by: Pete Cheslock <pete.cheslock@redhat.com>
1 parent 37b1843 commit 87212c8

1 file changed

Lines changed: 2 additions & 2 deletions

File tree

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,15 +52,15 @@ llm-d accelerates distributed inference by integrating industry-standard open te
5252
</picture>
5353
</p>
5454

55-
llm-d adds:
55+
### llm-d adds:
5656

5757
- [**Model Server Optimizations in vLLM:**](https://github.com/vllm-project/vllm) The llm-d team contributes and maintains high performance distributed serving optimizations in upstream vLLM, including disaggregated serving, KV connector interfaces, support for frontier OSS mixture of experts models, and production-ready observability and resiliency.
5858

5959
- [**Inference Scheduler:**](https://github.com/llm-d/llm-d-inference-scheduler) llm-d uses the Envoy proxy and its extensible balancing policies to make customizable “smart” load-balancing decisions specifically for LLMs without reimplementing a full featured load balancer. Leveraging operational telemetry, the Inference Scheduler implements the filtering and scoring algorithms to make decisions with P/D-, KV-cache-, SLA-, and load-awareness. Advanced users can implement their own scorers to further customize the algorithm while benefiting from IGW features like flow control and latency-aware balancing. The control plane for the load balancer is the Kubernetes API but can also be run standalone.
6060

6161
- [**Disaggregated Serving Sidecar:**](https://github.com/llm-d/llm-d-inference-scheduler/tree/main/cmd/pd-sidecar) llm-d orchestrates prefill and decode phases onto independent instances - the scheduler decides which instances should receive a given request, and the transaction is coordinated via a sidecar alongside decode instances. The sidecar instructs vLLM to provide point to point KV cache transfer over fast interconnects (IB/RoCE RDMA, TPU ICI, and DCN) via NIXL.
6262

63-
- [**vLLM Native CPU Offloading**](https://docs.vllm.ai/en/latest/examples/offline_inference/basic/#cpu-offload) and [**llm-d filesystem backend**:](https://github.com/llm-d/llm-d-kv-cache/tree/main/kv_connectors/llmd_fs_backend) llm-d uses vLLM's KVConnector abstraction to configure a pluggable KV cache hierarchy, including offloading KVs to host, remote storage, and systems like LMCache, Mooncake, and KVBM.
63+
- [**vLLM Native CPU Offloading**](https://docs.vllm.ai/en/latest/examples/basic/offline_inference/#cpu-offload) and [**llm-d filesystem backend**:](https://github.com/llm-d/llm-d-kv-cache/tree/main/kv_connectors/llmd_fs_backend) llm-d uses vLLM's KVConnector abstraction to configure a pluggable KV cache hierarchy, including offloading KVs to host, remote storage, and systems like LMCache, Mooncake, and KVBM.
6464

6565
- [**Variant Autoscaling over Hardware, Workload, and Traffic**](https://github.com/llm-d-incubation/ig-wva): A traffic- and hardware-aware autoscaler that (a) measures the capacity of each model server instance, (b) derive a load function that takes into account different request shapes and QoS, and (c) assesses recent traffic mix (QPS, QoS, and shapes) to calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, enabling use of HPA for SLO-level efficiency.
6666

0 commit comments

Comments
 (0)