Skip to content

Commit 6a5d90e

Browse files
committed
Merge remote-tracking branch 'upstream/main' into sync/upstream-ff5f8eab
2 parents 7f7943e + ff37a55 commit 6a5d90e

27 files changed

Lines changed: 633 additions & 63 deletions

File tree

.github/ISSUE_TEMPLATE/new-release.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
---
22
name: New Release
33
about: Propose a new release
4-
title: Release v0.x.0
5-
labels: ''
4+
title: Release vX.Y.Z
5+
labels: kind/release
66
assignees: ''
77

88
---
@@ -49,7 +49,7 @@ This document defines the process for releasing llm-d-router.
4949

5050
### Create or Checkout branch
5151

52-
1. If you already have the repo cloned, ensure its up-to-date and your local branch is clean.
52+
1. If you already have the repo cloned, ensure it's up-to-date and your local branch is clean.
5353

5454
1. Release Branch Handling:
5555
- For a Release Candidate:
@@ -63,7 +63,7 @@ This document defines the process for releasing llm-d-router.
6363
A release branch should already exist. In this case, check out the existing branch:
6464

6565
```shell
66-
git checkout -b release-${MAJOR}.${MINOR} ${REMOTE}/release-${MAJOR}.${MINOR}
66+
git checkout release-${MAJOR}.${MINOR} ${REMOTE}/release-${MAJOR}.${MINOR}
6767
```
6868

6969
1. Push your release branch to the llm-d-router remote.
@@ -79,13 +79,13 @@ This document defines the process for releasing llm-d-router.
7979
For a release candidate:
8080

8181
```shell
82-
git tag -s -a v${MAJOR}.${MINOR}.${PATCH}-rc.${RC} -m 'llm-d-router v${MAJOR}.${MINOR}.${PATCH}-rc.${RC} Release Candidate'
82+
git tag -s -a v${MAJOR}.${MINOR}.${PATCH}-rc.${RC} -m "llm-d-router v${MAJOR}.${MINOR}.${PATCH}-rc.${RC} Release Candidate"
8383
```
8484

8585
For a major, minor or patch release:
8686

8787
```shell
88-
git tag -s -a v${MAJOR}.${MINOR}.${PATCH} -m 'llm-d-router v${MAJOR}.${MINOR}.${PATCH} Release'
88+
git tag -s -a v${MAJOR}.${MINOR}.${PATCH} -m "llm-d-router v${MAJOR}.${MINOR}.${PATCH} Release"
8989
```
9090

9191
1. Push the tag to the llm-d-router repo.
@@ -102,16 +102,17 @@ This document defines the process for releasing llm-d-router.
102102
git push ${REMOTE} v${MAJOR}.${MINOR}.${PATCH}
103103
```
104104

105-
1. Pushing the tag triggers CI action to build and publish the [EPP image] and [sidecar image] to the [ghcr registry].
106-
1. Test the steps in the tagged quickstart guide after the PR merges. TODO add e2e tests! <!-- link to an e2e tests once we have such one -->
105+
1. Pushing the tag triggers CI action to build and publish the EPP image (`ghcr.io/llm-d/llm-d-router-endpoint-picker`) and sidecar image (`ghcr.io/llm-d/llm-d-router-disagg-sidecar`) to the [ghcr registry].
106+
1. Verify the [CI release workflow] completed successfully before proceeding.
107+
1. Test the steps in the tagged quickstart guide after the PR merges.
107108

108109
### Create the release!
109110

110111
1. Create a [new release]:
111112
1. Choose the tag that you created for the release.
112-
1. Use the tag as the release title, i.e. `v0.1.0` refer to previous release for the content of the release body.
113+
1. Use the tag as the release title, e.g. `v0.1.0`.
113114
1. Click "Generate release notes" and preview the release body.
114-
1. Go to Gateway Inference Extension latest release and make sure to include the highlights in llm-d-router as well.
115+
1. Ensure the release body includes: highlights, breaking changes (if any), known issues, and upgrade steps.
115116
1. If this is a release candidate, select the "This is a pre-release" checkbox.
116117
1. If you find any bugs in this process, create an [issue].
117118

@@ -131,7 +132,6 @@ Use the following steps to announce the release.
131132

132133
[repo]: https://github.com/llm-d/llm-d-router
133134
[ghcr registry]: https://github.com/orgs/llm-d/packages?repo_name=llm-d-router
134-
[EPP image]: https://github.com/llm-d/llm-d-router/pkgs/container/llm-d-router-endpoint-picker
135-
[sidecar image]: https://github.com/llm-d/llm-d-router/pkgs/container/llm-d-router-disagg-sidecar
136135
[new release]: https://github.com/llm-d/llm-d-router/releases/new
137136
[issue]: https://github.com/llm-d/llm-d-router/issues/new/choose
137+
[CI release workflow]: https://github.com/llm-d/llm-d-router/actions/workflows/ci-release.yaml

.github/actions/docker-build-and-push/action.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ runs:
5757
tags: |
5858
${{ inputs.registry }}/${{ inputs.image-name }}:${{ inputs.tag }}
5959
${{ inputs.push == 'true' && inputs.prerelease != 'true' && format('{0}/{1}:latest', inputs.registry, inputs.image-name) || '' }}
60+
${{ inputs.commit-sha != '' && format('{0}/{1}:{2}', inputs.registry, inputs.image-name, inputs.commit-sha) || '' }}
6061
build-args: |
6162
LDFLAGS=-s -w
6263
COMMIT_SHA=${{ inputs.commit-sha || 'unknown' }}

config/charts/routerlib/templates/_latency-predictor.tpl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ Latency Predictor Sidecar Containers
5555
image: {{ $.Values.router.latencyPredictor.predictionServers.image.registry }}/{{ $.Values.router.latencyPredictor.predictionServers.image.repository }}:{{ $.Values.router.latencyPredictor.predictionServers.image.tag }}
5656
imagePullPolicy: {{ $.Values.router.latencyPredictor.predictionServers.image.pullPolicy }}
5757
command: ["uvicorn"]
58-
args: ["prediction_server:app", "--host", "0.0.0.0", "--port", "{{ add $.Values.router.latencyPredictor.predictionServers.startPort $i }}"]
58+
args: ["llm_d_latency_predictor.prediction_server:app", "--host", "0.0.0.0", "--port", "{{ add $.Values.router.latencyPredictor.predictionServers.startPort $i }}"]
5959
ports:
6060
- containerPort: {{ add $.Values.router.latencyPredictor.predictionServers.startPort $i }}
6161
name: predict-port-{{ add $i 1 }}

go.mod

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -38,12 +38,12 @@ require (
3838
google.golang.org/genproto/googleapis/api v0.0.0-20260526163538-3dc84a4a5aaa
3939
google.golang.org/grpc v1.81.1
4040
google.golang.org/protobuf v1.36.11
41-
k8s.io/api v0.35.5
42-
k8s.io/apiextensions-apiserver v0.35.5
43-
k8s.io/apimachinery v0.35.5
44-
k8s.io/client-go v0.35.5
45-
k8s.io/code-generator v0.35.5
46-
k8s.io/component-base v0.35.5
41+
k8s.io/api v0.35.6
42+
k8s.io/apiextensions-apiserver v0.35.6
43+
k8s.io/apimachinery v0.35.6
44+
k8s.io/client-go v0.35.6
45+
k8s.io/code-generator v0.35.6
46+
k8s.io/component-base v0.35.6
4747
k8s.io/utils v0.0.0-20260108192941-914a6e750570
4848
sigs.k8s.io/controller-runtime v0.23.3
4949
sigs.k8s.io/controller-tools v0.20.1
@@ -141,7 +141,7 @@ require (
141141
gopkg.in/inf.v0 v0.9.1 // indirect
142142
gopkg.in/yaml.v2 v2.4.0 // indirect
143143
gopkg.in/yaml.v3 v3.0.1 // indirect
144-
k8s.io/apiserver v0.35.5 // indirect
144+
k8s.io/apiserver v0.35.6 // indirect
145145
k8s.io/gengo/v2 v2.0.0-20250922181213-ec3ebc5fd46b // indirect
146146
k8s.io/klog/v2 v2.140.0 // indirect
147147
k8s.io/kube-openapi v0.0.0-20260127142750-a19766b6e2d4 // indirect

go.sum

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -334,20 +334,20 @@ gopkg.in/yaml.v2 v2.4.0/go.mod h1:RDklbk79AGWmwhnvt/jBztapEOGDOx6ZbXqjP6csGnQ=
334334
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
335335
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
336336
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
337-
k8s.io/api v0.35.5 h1:BrFeUDGY/LBtlA1R5RoxhlYRHs76RnQBc6xbm/y7hsQ=
338-
k8s.io/api v0.35.5/go.mod h1:xWkFhMnoPZdTAQh95Rlw3zZpUUNVlFHcuESUYd06BWM=
339-
k8s.io/apiextensions-apiserver v0.35.5 h1:HttlJjgsx3ddLsASCqklkKvfBlwUoXma8VLpeMG5YL8=
340-
k8s.io/apiextensions-apiserver v0.35.5/go.mod h1:4xbAgP/jbt8sVHE3H4DfE1gSPLUoSzXrNqhZz1lTHKc=
341-
k8s.io/apimachinery v0.35.5 h1:lbjjjUfVeVqFbiOpyhqZHc8DhiYkWOxSNij7lHx2U8Y=
342-
k8s.io/apimachinery v0.35.5/go.mod h1:NNi1taPOpep0jOj+oRha3mBJPqvi0hGdaV8TCqGQ+cc=
343-
k8s.io/apiserver v0.35.5 h1:ZtFpSEmxf/VmOdbL3bo7hLxyNRorRegqOLmYSW0mxEo=
344-
k8s.io/apiserver v0.35.5/go.mod h1:6NNWFTq/UosCwUmqhQDC+3ApzSx5ekeYMIwzSG+49VU=
345-
k8s.io/client-go v0.35.5 h1:wUrgqVSmFRw75bgSHY7X0G/hZM/QYpV0Hg7SYYOYpFk=
346-
k8s.io/client-go v0.35.5/go.mod h1:Z0mDcAJsX1Y7RQfuQlJipiRtqf8Mhk2VDu1/JvRqdGo=
347-
k8s.io/code-generator v0.35.5 h1:g2ZIw7LCjmX2p5WDjtkVYwmvtx+pDF0Pq1dfgCoHkhQ=
348-
k8s.io/code-generator v0.35.5/go.mod h1:W46pDvFxY2SlphV3MBI/6KDZ2JDMhHXGVgPQXMoYFiM=
349-
k8s.io/component-base v0.35.5 h1:1y1xxfpFNkNi4RMi6bvPNN4aDr9VhOijtEfrqnhPijs=
350-
k8s.io/component-base v0.35.5/go.mod h1:n/+aL98XYINubqIu/Okh6mS/kZT2nMeN4IQkQR4VXRg=
337+
k8s.io/api v0.35.6 h1:phPzP79F3kcONsD2TzmDiITNCV6/1Z5U3CCEcjtsXzI=
338+
k8s.io/api v0.35.6/go.mod h1:GWKUaIp24fuDFigAgnhr9EJOKDqspnwPjYlpDca5B4U=
339+
k8s.io/apiextensions-apiserver v0.35.6 h1:fyzp3i+PAbB/jSNau9LF0rMuUTUUyybR02BpYhT1YKI=
340+
k8s.io/apiextensions-apiserver v0.35.6/go.mod h1:kkCbFS495cT53wOqNwWnQei759bkvgn6OqE0R8b3DEA=
341+
k8s.io/apimachinery v0.35.6 h1:ASSpfmmsOArKb2Hsu8gGlIcbIcEMVTboI3FfsfYuQ8k=
342+
k8s.io/apimachinery v0.35.6/go.mod h1:NNi1taPOpep0jOj+oRha3mBJPqvi0hGdaV8TCqGQ+cc=
343+
k8s.io/apiserver v0.35.6 h1:VWYg2S0wlAmN3URFpVeuLa4PP2RCpTFg1nvlUHOy2C8=
344+
k8s.io/apiserver v0.35.6/go.mod h1:wajGSrXO9w+lx69jYq4SaE4Xxw5KxxwvVD1zbttYA2E=
345+
k8s.io/client-go v0.35.6 h1:qZQv9a5B4YlIpXhFBwsI9qPOOJC6Z8lk9lkEWmrmus8=
346+
k8s.io/client-go v0.35.6/go.mod h1:LOO6N1EhxdQAzYIZ/73cJVyb3gixrMY6ZDJcJ/ANfsY=
347+
k8s.io/code-generator v0.35.6 h1:QXxmfS8diVF5jeEIdO9MUSyMsD3OnXfypj9zw4wfJic=
348+
k8s.io/code-generator v0.35.6/go.mod h1:QCFzJL445DiaE6t1wnHpvfctz1EeaNP0Ms3XpsqoqFw=
349+
k8s.io/component-base v0.35.6 h1:dTkck9uefkIrKn7wRCEYiDWNUvHd8UdwZCcVafmHgL4=
350+
k8s.io/component-base v0.35.6/go.mod h1:qcNKrspACsqR+vgUJXkWzwtgUGkURcnrus41o92jjpk=
351351
k8s.io/gengo/v2 v2.0.0-20250922181213-ec3ebc5fd46b h1:gMplByicHV/TJBizHd9aVEsTYoJBnnUAT5MHlTkbjhQ=
352352
k8s.io/gengo/v2 v2.0.0-20250922181213-ec3ebc5fd46b/go.mod h1:CgujABENc3KuTrcsdpGmrrASjtQsWCT7R99mEV4U/fM=
353353
k8s.io/klog/v2 v2.140.0 h1:Tf+J3AH7xnUzZyVVXhTgGhEKnFqye14aadWv7bzXdzc=

pkg/epp/framework/interface/requestcontrol/plugins.go

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,8 @@ import (
2727

2828
const (
2929
PreAdmissionExtensionPoint = "PreAdmission"
30+
AdmissionExtensionPoint = "Admission"
31+
DataProducerExtensionPoint = "DataProducer"
3032
PreRequestExtensionPoint = "PreRequest"
3133
ResponseReceivedExtensionPoint = "ResponseReceived"
3234
ResponseStreamingExtensionPoint = "ResponseStreaming"

pkg/epp/framework/interface/requesthandling/plugins.go

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,11 @@ import (
2424
fwkplugin "github.com/llm-d/llm-d-router/pkg/epp/framework/interface/plugin"
2525
)
2626

27+
const (
28+
RequestParsingExtensionPoint = "RequestParsing"
29+
ResponseParsingExtensionPoint = "ResponseParsing"
30+
)
31+
2732
// Parser defines the interface for parsing payload(requests and responses).
2833
type Parser interface {
2934
fwkplugin.Plugin

pkg/epp/framework/plugins/scheduling/filter/prefixcacheaffinity/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ Can be instantiated multiple times with different thresholds (e.g., 0.99 for glo
5656

5757
- Keep only endpoints with prefix cache score >= `affinityThreshold`
5858
- If no endpoints pass, all are kept (no-op)
59-
- With probability `explorationProbability` (default 1%), skip the gate entirely for exploration
59+
- With probability `explorationProbability` (default 0, disabled), skip the gate entirely for exploration
6060
- TTFT load gate: if best sticky endpoint's TTFT exceeds best non-sticky by more than
6161
`maxTTFTPenaltyMs`, break stickiness and keep all endpoints (0 = always stick). The
6262
per-endpoint TTFT is estimated from in-flight tokens as
@@ -72,7 +72,7 @@ Can be instantiated multiple times with different thresholds (e.g., 0.99 for glo
7272
| Parameter | Type | Required | Default | Description |
7373
|-----------|------|----------|---------|-------------|
7474
| `affinityThreshold` | `float64` | No | `0.80` | Prefix cache score threshold for stickiness |
75-
| `explorationProbability` | `float64` | No | `0.01` | Probability of skipping the gate |
75+
| `explorationProbability` | `float64` | No | `0` | Probability of skipping the gate |
7676
| `maxTTFTPenaltyMs` | `float64` | No | `18000` | Max TTFT penalty (ms) before breaking stickiness. 0 = always stick |
7777
| `ttftSource` | `string` | No | `prefillThroughput` | TTFT source for the load gate: `prefillThroughput` or `latencyPredictor` |
7878
| `peakPrefillThroughput` | `float64` | No | `15928` | Peak prefill throughput (tokens/sec), used to estimate TTFT when `ttftSource` is `prefillThroughput` |

pkg/epp/framework/plugins/scheduling/filter/prefixcacheaffinity/plugin.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ type Config struct {
6464
AffinityThreshold float64 `json:"affinityThreshold,omitempty"`
6565

6666
// ExplorationProbability is the probability of skipping the gate entirely,
67-
// keeping all endpoints for exploration. Range: [0, 1]. Default: 0.01.
67+
// keeping all endpoints for exploration. Range: [0, 1]. Default: 0.
6868
ExplorationProbability float64 `json:"explorationProbability,omitempty"`
6969

7070
// MaxTTFTPenaltyMs is the max TTFT penalty (ms) before breaking stickiness.
@@ -92,7 +92,7 @@ type Config struct {
9292

9393
var DefaultConfig = Config{
9494
AffinityThreshold: 0.80,
95-
ExplorationProbability: 0.01,
95+
ExplorationProbability: 0,
9696
MaxTTFTPenaltyMs: 18000,
9797
TTFTSource: TTFTSourcePrefillThroughput,
9898

pkg/epp/framework/plugins/scheduling/filter/sessionaffinity/README.md

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,17 +7,41 @@ Pins subsequent requests in a session to the same pod the first request was sent
77
The session is carried in a request header whose value is the base64-encoded `namespace/name` of the previously selected pod. As a [`ResponseHeaderProcessor`](../../../../interface/requestcontrol/plugins.go), the filter writes that same header on the response so the client can echo it back on the next request.
88

99
## Parameters
10-
10+
1111
| Name | Type | Default | Description |
1212
|---|---|---|---|
1313
| `headerName` | string | `x-session-token` | Request and response header carrying the session token. When set, only this header is read; the default is ignored. |
14+
| `profileName` | string | | The name of the profile this instance is associated with. When set (e.g. `prefill`), the plugin looks up the target pod from the results of that profile in `SchedulingResult` during the response received phase. When empty, it defaults to the primary (decode) pod. |
15+
16+
### Default Configuration (without PD disaggregation)
1417

1518
```yaml
1619
- type: session-affinity-filter
1720
parameters:
1821
headerName: x-session-token
1922
```
2023
24+
### PD Disaggregation Configuration
25+
26+
To support session affinity with PD disaggregation, configure two separate instances of the filter: one for decode and one for prefill.
27+
28+
```yaml
29+
# Instance for the decode profile (pins decode requests)
30+
- name: session-affinity-decode
31+
type: session-affinity-filter
32+
parameters:
33+
headerName: x-session-token
34+
35+
# Instance for the prefill profile (pins prefill requests)
36+
- name: session-affinity-prefill
37+
type: session-affinity-filter
38+
parameters:
39+
headerName: x-session-token-prefill
40+
profileName: prefill
41+
```
42+
43+
The decode instance uses the default behavior (writing the decode pod to `x-session-token`). The prefill instance uses `profileName: prefill` to look up the prefill pod from the scheduling results and write it to `x-session-token-prefill`. This ensures that subsequent requests in the same session target both the same prefill pod and the same decode pod.
44+
2145
## Relationship to the session affinity scorer
2246

2347
The [session affinity scorer](../../scorer/sessionaffinity/README.md) (`session-affinity-scorer`) provides the same affinity behavior as a soft preference and writes the same response header.

0 commit comments

Comments
 (0)