Fix link to vLLM Native CPU Offloading documentation (llm-d#928)

petecheslock · web-flow · commit 87212c8b5ff5 · 2026-03-09T13:22:56.000-04:00
* Fix link to vLLM Native CPU Offloading documentation

Signed-off-by: Pete Cheslock &lt;petecheslock@users.noreply.github.com&gt;

* Fix header/indenting

Signed-off-by: Pete Cheslock &lt;pete.cheslock@redhat.com&gt;

---------

Signed-off-by: Pete Cheslock &lt;petecheslock@users.noreply.github.com&gt;
Signed-off-by: Pete Cheslock &lt;pete.cheslock@redhat.com&gt;
diff --git a/README.md b/README.md
@@ -52,15 +52,15 @@ llm-d accelerates distributed inference by integrating industry-standard open te
   </picture>
 </p>
 
- llm-d adds:
+### llm-d adds:
 
 - [**Model Server Optimizations in vLLM:**](https://github.com/vllm-project/vllm) The llm-d team contributes and maintains high performance distributed serving optimizations in upstream vLLM, including disaggregated serving, KV connector interfaces, support for frontier OSS mixture of experts models, and production-ready observability and resiliency. 
 
 - [**Inference Scheduler:**](https://github.com/llm-d/llm-d-inference-scheduler) llm-d uses the Envoy proxy and its extensible balancing policies to make customizable “smart” load-balancing decisions specifically for LLMs without reimplementing a full featured load balancer. Leveraging operational telemetry, the Inference Scheduler implements the filtering and scoring algorithms to make decisions with P/D-, KV-cache-, SLA-, and load-awareness. Advanced users can implement their own scorers to further customize the algorithm while benefiting from IGW features like flow control and latency-aware balancing. The control plane for the load balancer is the Kubernetes API but can also be run standalone.
 
 - [**Disaggregated Serving Sidecar:**](https://github.com/llm-d/llm-d-inference-scheduler/tree/main/cmd/pd-sidecar) llm-d orchestrates prefill and decode phases onto independent instances - the scheduler decides which instances should receive a given request, and the transaction is coordinated via a sidecar alongside decode instances. The sidecar instructs vLLM to provide point to point KV cache transfer over fast interconnects (IB/RoCE RDMA, TPU ICI, and DCN) via NIXL.
 
-- [**vLLM Native CPU Offloading**](https://docs.vllm.ai/en/latest/examples/offline_inference/basic/#cpu-offload) and [**llm-d filesystem backend**:](https://github.com/llm-d/llm-d-kv-cache/tree/main/kv_connectors/llmd_fs_backend) llm-d uses vLLM's KVConnector abstraction to configure a pluggable KV cache hierarchy, including offloading KVs to host, remote storage, and systems like LMCache, Mooncake, and KVBM. 
+- [**vLLM Native CPU Offloading**](https://docs.vllm.ai/en/latest/examples/basic/offline_inference/#cpu-offload) and [**llm-d filesystem backend**:](https://github.com/llm-d/llm-d-kv-cache/tree/main/kv_connectors/llmd_fs_backend) llm-d uses vLLM's KVConnector abstraction to configure a pluggable KV cache hierarchy, including offloading KVs to host, remote storage, and systems like LMCache, Mooncake, and KVBM. 
 
 - [**Variant Autoscaling over Hardware, Workload, and Traffic**](https://github.com/llm-d-incubation/ig-wva): A traffic- and hardware-aware autoscaler that (a) measures the capacity of each model server instance, (b) derive a load function that takes into account different request shapes and QoS, and (c) assesses recent traffic mix (QPS, QoS, and shapes) to calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, enabling use of HPA for SLO-level efficiency.