feat(lmcache): implement decode first flow on lmcache connector when cache_hit_threshold field is present by kyanokashi · Pull Request #509 · llm-d/llm-d-inference-scheduler

kyanokashi · 2025-12-09T23:23:57Z

This PR updates the lmcache connector on the sidecar to decode first when cache_hit_threshold is present in the completion request.

The new flow goes as follows:

If cache_hit_threshold present; then decode first
If decode is successful write success response; return
If finish reason is cache_threshold that means the decode node didn't meet the cache hit threshold
Then we continue with the original flow of prefill, then decode

Test Report

Environment:

Kind Cluster: llm-d-inference-scheduler-dev
vLLM Simulator Image: kyanokashi/llm-d-inference-sim:dev
Sidecar Connector: lmcache
Log Level: --zap-log-level=4

Test Scenarios

Scenario 1: Non-Streaming WITHOUT `cache_hit_threshold`

Description: Basic request without lmcache protocol activation.

Result: ✅ PASS - Request proxied directly to decoder.

Scenario 2: Non-Streaming WITH `cache_hit_threshold`

Description: Request with cache_hit_threshold field, no cache_threshold triggered.

Result: ✅ PASS - LMCache protocol runs, response returned successfully.

Scenario 3: Non-Streaming WITH `cache_hit_threshold` AND `X-Cache-Threshold: true`

Description: Request configured to trigger cache_threshold finish_reason.

Result: ✅ PASS - LMCache protocol detects cache_threshold, triggers prefill→decode.

Scenario 4: Streaming WITHOUT `cache_hit_threshold`

Description: Streaming request without lmcache protocol activation.

Result: ✅ PASS - SSE chunks streamed directly from decoder.

Scenario 5: Streaming WITH `cache_hit_threshold`

Description: Streaming request with cache_hit_threshold, no header.

Result: ✅ PASS - LMCache protocol parses SSE chunks, streams through when no cache_threshold found.

Scenario 6: Streaming WITH `cache_hit_threshold` AND `X-Cache-Threshold: true`

Description: Streaming request with full lmcache configuration.

Result: ✅ PASS - LMCache protocol detects cache_threshold in SSE, triggers prefill→decode.

Failure Injection Test

To enable failure injection on the simulator:

--failure-injection-rate=100
--failure-types=rate_limit

Scenario 7: Streaming Decode Error (tryDecodeStreaming)

Request:

curl -s http://localhost:30080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "food-review",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 5,
    "cache_hit_threshold": 0.5,
    "stream": true
  }'

Path: tryDecodeStreaming → error

Result: ✅ PASS

HTTP Status: 429
Error returned as JSON (not SSE)

Response:

{
  "error": {
    "code": 429,
    "message": "Rate limit reached for food-review...",
    "type": "RateLimitError"
  }
}

Summary

Scenario	Streaming	cache_hit_threshold	X-Cache-Threshold	Result
1	No	No	No	✅ PASS
2	No	Yes	No	✅ PASS
3	No	Yes	Yes	✅ PASS
4	Yes	No	No	✅ PASS
5	Yes	Yes	No	✅ PASS
6	Yes	Yes	Yes	✅ PASS
7	Yes	Yes	No (error)	✅ PASS

All scenarios passed successfully.

github-actions · 2025-12-09T23:24:07Z

🚨 Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits,
please see GitHub Documentation.

- if cache_hit_threshold field is present in completion request, then we perform a decode first flow Signed-off-by: kyano <kyanokashi2@gmail.com>

Signed-off-by: kyanokashi <71283892+kyanokashi@users.noreply.github.com>

Signed-off-by: kyano <kyanokashi2@gmail.com>

pkg/sidecar/proxy/connector_lmcache.go

Signed-off-by: kyano <kyanokashi2@gmail.com>

kfirwolfson

Nice implementation, @kyanokashi. Please see comments inline, there are some things that I think should be modified for it to work properly.

pkg/sidecar/proxy/connector_lmcache.go

kfirwolfson · 2025-12-14T14:09:33Z

pkg/sidecar/proxy/connector_lmcache.go

 	}

-	// Create prefiller request. Set max_tokens to 1.
+	if s.forwardDataParallel && s.dataParallelHandler(w, r) {


This whole "if" code block and its contents should be removed. We don't want to start with "prefill". We want to start with "tryDecode" flow.

This is to account for the temporary workaround @shmuelk mentioned he put in place for something related to istio. In that case we would only prefill, therefore, we don't want to call tryDecode.

I believe that if DP is in use, it's only for the Decode phase, and prefill works normally. @shmuelk please correct me if I am mistaken. Assuming this code will be removed soon, maybe we can avoid having a special handling for s.forwardDataParallel == true

This code is for the workaround we have for Istio 1.28.0.

Istio 1.28.1 has a fix for this issue. We will be removing this code in the future.

Should it be when sending to prefill or decode, @shmuelk ?

So the istio workaround specifically intends to avoid decoding and only prefill for some scenarios.

I guess the question is whether the condition caused from the istio workaround and the cache_hit_threshold being present in the request could exist. Technically it shouldn't because that field is only relevant for decoding correct me if I'm wrong @kfirwolfson

I am not sure anyone will try running the connector_lmcache.py code with s.forwardDataParallel==True, let alone with cache_hit_threshold>0, before the s.forwardDataParallel code will be removed, so I don't think it matters much.
But for now, let's just remove this whole "if" statement. It's inaccurate for when s.forwardDataParallel==True and irrelevant for when s.forwardDataParallel==False.

pkg/sidecar/proxy/connector_lmcache.go

kfirwolfson · 2025-12-14T16:22:10Z

I suggest focusing on d-first use-case in this PR, and handing the Decode preemption problem (explained in detail in vLLM issue 24256 ), in separate PR.

…w decode first flow Signed-off-by: kyano <kyanokashi2@gmail.com>

- decrease verbosity for common log - add cache_hit_threshold attribute Signed-off-by: kyano <kyanokashi2@gmail.com>

…marshal decode response Signed-off-by: kyano <kyanokashi2@gmail.com>

Signed-off-by: kyano <kyanokashi2@gmail.com>

…regardless of cache condition Signed-off-by: kyano <kyanokashi2@gmail.com>

pkg/sidecar/proxy/connector_lmcache.go

kfirwolfson · 2025-12-15T14:16:33Z

pkg/sidecar/proxy/connector_lmcache.go

Not sure how this would work with Streaming (vLLM online inference with partial responses). Would it?

Hmmm this is tricky.

Previously we were passing a pass through writer to the decode proxy which was responsible for writing responses back to the client.

Now, because we need to parse the response to determine the finish reason, we use a buffered writer so we could read the first choice.

Let me explore some options. I'm thinking of either updating bufferedResponseWriter to support streaming or implementing a new writer type that handles the cache_threshold case specifically

Here's what I ended up doing 88739c6

It's a bit complex, but couldn't think of a better way to do it.

Still need to test this

@kfirwolfson

Signed-off-by: kyano <kyanokashi2@gmail.com>

kfirwolfson · 2026-01-18T14:25:48Z

pkg/sidecar/proxy/proxy.go


-	// ConnectorLMCache enables (now deprecated) P/D LMCache protocol
-	ConnectorLMCache = "lmcache"
+	// ConnectorSharedStorage enables (now deprecated) P/D Shared Storage protocol


suggest removing the "(now deprecated)" text. We're working to enable it

I'll remove it in the follow up refactor PR. It has massive refactoring and I prefer not changing the base if I can.

kfirwolfson · 2026-01-18T14:30:59Z

Overall, both logic and code look good to me, @elevran.

kyanokashi · 2026-01-31T05:31:40Z

@kfirwolfson @elevran @vMaroon what's left before we can merge this?

kfirwolfson · 2026-01-31T05:33:09Z

@kfirwolfson @elevran @vMaroon what's left before we can merge this?

Looks great to me.

Signed-off-by: kyanokashi <kyanokashi2@gmail.com>

kyanokashi · 2026-02-01T02:49:29Z

I went ahead an added coverage in the e2e tests d71b10b

elevran · 2026-02-05T17:32:27Z

@kyanokashi please fix the lint errors and ping me to approval. Other than that, I think this is ready to go in.
@kfirwolfson you may need to dismiss the Request Changes using the github UI before this can merge.

Signed-off-by: kyanokashi <kyanokashi2@gmail.com>

kyanokashi · 2026-02-05T18:56:51Z

@elevran I think you were referring to the test errors? Anyways they are fixed now

kfirwolfson

@elevran done. Approved from my pov.

elevran · 2026-02-06T10:28:24Z

@elevran I think you were referring to the test errors? Anyways they are fixed now

No. Was referring to the lint errors in the e2e test code introduced by this PR. Check the lint-and-test action :

Error: test/e2e/e2e_test.go:620:23: Error return value of `resp.Body.Close` is not checked (errcheck)
  	defer resp.Body.Close()
  	                     ^
  Error: test/e2e/e2e_test.go:647:23: Error return value of `resp.Body.Close` is not checked (errcheck)
  	defer resp.Body.Close()
  	                     ^
  Error: test/e2e/e2e_test.go:681:23: Error return value of `resp.Body.Close` is not checked (errcheck)
  	defer resp.Body.Close()
  	                     ^
  Error: test/e2e/e2e_test.go:715:23: Error return value of `resp.Body.Close` is not checked (errcheck)
  	defer resp.Body.Close()
  	                     ^
  Error: test/e2e/e2e_test.go:771:12: string-format: fmt.Sprintf can be replaced with string concatenation (perfsprint)
  	ginkgo.By(fmt.Sprintf("Getting request count from prefill pod: %s", prefillPodName))

Signed-off-by: kyanokashi <kyanokashi2@gmail.com>

kyanokashi · 2026-02-06T15:01:38Z

@elevran I think you were referring to the test errors? Anyways they are fixed now

No. Was referring to the lint errors in the e2e test code introduced by this PR. Check the lint-and-test action :

Error: test/e2e/e2e_test.go:620:23: Error return value of `resp.Body.Close` is not checked (errcheck)
  	defer resp.Body.Close()
  	                     ^
  Error: test/e2e/e2e_test.go:647:23: Error return value of `resp.Body.Close` is not checked (errcheck)
  	defer resp.Body.Close()
  	                     ^
  Error: test/e2e/e2e_test.go:681:23: Error return value of `resp.Body.Close` is not checked (errcheck)
  	defer resp.Body.Close()
  	                     ^
  Error: test/e2e/e2e_test.go:715:23: Error return value of `resp.Body.Close` is not checked (errcheck)
  	defer resp.Body.Close()
  	                     ^
  Error: test/e2e/e2e_test.go:771:12: string-format: fmt.Sprintf can be replaced with string concatenation (perfsprint)
  	ginkgo.By(fmt.Sprintf("Getting request count from prefill pod: %s", prefillPodName))

Ok, fixed now. There's an issue with the CI runs, where after they fail and I push a new commit, they will show as pending and I don't have access to the previous failed runs.

I tried running make lint locally which didn't show any errors because the linter was configured to not run on new code for some reason. I went ahead and made that configurable as well.

@elevran

elevran · 2026-02-06T17:12:01Z

/lgtm
/approve

* feat: use Tinyllama as the "model" for kind test and switch to use precise-prefix-cache-score in config (llm-d#581) * feat: use Tinyllama as the "model" for kind test - in order to test precies-prefix-cache-score we cannot use fool-reviewer since it need call kv-cache-manager to get tokenizer by getting a real model from HF - the change is to switch the "default model" to TinyLlama - also to make tokenizer folder writable need change permission to the USER in Dockerfile - rename dp-epp-config.yaml sim-dp-epp-config.yaml as it is used for local test Signed-off-by: Wen Zhou <wenzhou@redhat.com> * update: revert back some config to keep using prefix-cache-scorer - revert file renaming Signed-off-by: Wen Zhou <wenzhou@redhat.com> --------- Signed-off-by: Wen Zhou <wenzhou@redhat.com> * Update linter configuration (llm-d#588) Signed-off-by: Etai Lev Ran <elevran@gmail.com> * fix: config should use new precise-prefix-cache-scorer (llm-d#576) - we have rename prefix-cache-scorer to precise-prefix-cache-scorer in 0.3.0, configs need migrate from the old one to the new one with spec. - rename plugin name - remove parameters.autoTune and parameters.mode: cache_tracking and lruCapacityPerServer - move hashBlockSize, maxPrefixBlocksToMatch under indexrConfig - for config using food-review keep old prefix-cache-scorer - keep pd-epp-config and sim-pd-epp-config with prefix-cache-scorer as KV and PD need both be enabled which is not done yet Signed-off-by: Wen Zhou <wenzhou@redhat.com> * deps(actions): bump crate-ci/typos from 1.42.1 to 1.42.2 (llm-d#589) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.42.1 to 1.42.2. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.42.1...v1.42.2) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-version: 1.42.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Updated to more recent GIE (llm-d#592) * Updated to more recent GIE Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Updated to latest GIE and chnages due to review comments Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Added a true mock SchedulerProfile Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Exploited mock SchedulerProfile Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> --------- Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * pull kvc v0.5.0 libs (llm-d#595) Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> * deps(actions): bump crate-ci/typos from 1.42.2 to 1.43.0 (llm-d#596) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.42.2 to 1.43.0. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.42.2...v1.43.0) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-version: 1.43.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * address nil,nil return linter error in test mock (llm-d#598) Signed-off-by: Etai Lev Ran <elevran@gmail.com> * deps(go): bump the go-dependencies group with 2 updates (llm-d#597) Bumps the go-dependencies group with 2 updates: [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) and [github.com/onsi/gomega](https://github.com/onsi/gomega). Updates `github.com/onsi/ginkgo/v2` from 2.27.5 to 2.28.1 - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](onsi/ginkgo@v2.27.5...v2.28.1) Updates `github.com/onsi/gomega` from 1.39.0 to 1.39.1 - [Release notes](https://github.com/onsi/gomega/releases) - [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md) - [Commits](onsi/gomega@v1.39.0...v1.39.1) --- updated-dependencies: - dependency-name: github.com/onsi/ginkgo/v2 dependency-version: 2.28.1 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: github.com/onsi/gomega dependency-version: 1.39.1 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Models extractor (llm-d#553) * Models extractor Signed-off-by: irar2 <irar@il.ibm.com> * Update register.go Signed-off-by: Ira Rosen <irar@il.ibm.com> * Updated for the newer GIE Signed-off-by: irar2 <irar@il.ibm.com> * Review comments Signed-off-by: irar2 <irar@il.ibm.com> * Check the scheme Signed-off-by: irar2 <irar@il.ibm.com> --------- Signed-off-by: irar2 <irar@il.ibm.com> Signed-off-by: Ira Rosen <irar@il.ibm.com> * feat(lmcache): implement decode first flow on lmcache connector when cache_hit_threshold field is present (llm-d#509) * feat: implement decode first flow on lmcache connector - if cache_hit_threshold field is present in completion request, then we perform a decode first flow Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: error handling Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: add back todo comment Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: reduce code complexity and duplication Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: improve header copying Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: add comment explaning the cache_hit_threshold field and the new decode first flow Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: enhance logging for cache hit threshold in decode flow - decrease verbosity for common log - add cache_hit_threshold attribute Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: improve error handling and observability when failing to unmarshal decode response Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: add deleted informational comments Signed-off-by: kyano <kyanokashi2@gmail.com> * typo Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: make error logs more descriptive of the failure reason Signed-off-by: kyano <kyanokashi2@gmail.com> * feat: add cache hit threshold to prefill request so prefill executes regardless of cache condition Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: typo Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: assign 0 cache_hit_threshold before final decode attempt Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: update comment according to feedback Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: remove istio workaround Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: set cache hit threshold to 0 in prefill request for consistent execution Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: update the log Signed-off-by: kyano <kyanokashi2@gmail.com> * feat: support online decoding Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: preserve request body in lmcache connector Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: support sse format for streamed decode Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: add and improve log descriptions Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: typo Signed-off-by: kyano <kyanokashi2@gmail.com> * nit: undo capitalization Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: typos Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: improve error log observability Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: encapsulate http error checking in function and reuse Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: encapsulate and reuse code better Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: lint error Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: improve code encapsulation and reduce duplication Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: rename and simplify SSE event signaling logic Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: rename lmcache to shared storage protocol Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: remove unused function Signed-off-by: kyano <kyanokashi2@gmail.com> * test: e2e tests Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * chore: claude gitignore Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * fix: sim deployment Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * feat: make linter running on new code configurable Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * fix: lint errors Signed-off-by: kyanokashi <kyanokashi2@gmail.com> --------- Signed-off-by: kyano <kyanokashi2@gmail.com> Signed-off-by: kyanokashi <71283892+kyanokashi@users.noreply.github.com> Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * Extend support for different ways to decide if disaggregated PD is required (llm-d#531) * Initial step of a configurable pd decider which is responsible for decision whether disaggregation is required, use data added in prefix scorer plugin in PrepareRequestData Signed-off-by: Maya Barnea <mayab@il.ibm.com> * update version of GIE + fix lint Signed-off-by: Maya Barnea <mayab@il.ibm.com> * update yaml and the test according prefix plugin configuration change (blockSize replaced by blockSizeTokens) Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update docs/architecture.md Co-authored-by: Shmuel Kallner <kallner@il.ibm.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * code review Signed-off-by: Maya Barnea <mayab@il.ibm.com> * code review Signed-off-by: Maya Barnea <mayab@il.ibm.com> * update version of GIE, update prefix_disagr_decider accordingly Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix typo Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix PD for short inputs Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update docs/architecture.md Co-authored-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update pkg/plugins/profile/always_disaggr_decider.go Co-authored-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update pkg/plugins/profile/always_disaggr_decider.go Co-authored-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update pkg/plugins/profile/prefix_disagg_decider.go Co-authored-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * updates according the PR comments Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix test Signed-off-by: Maya Barnea <mayab@il.ibm.com> * create pd decider plugin type with 2 implementations (for prefix based and test always), update deploy configuration according the new structure Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix e2e tests Signed-off-by: Maya Barnea <mayab@il.ibm.com> * changes according the pr comments Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix e2e test Signed-off-by: Maya Barnea <mayab@il.ibm.com> * add explanation about pd deciders to disagg_pd doc Signed-off-by: Maya Barnea <mayab@il.ibm.com> * rename always_disaggr_decider to always_disagg_decider Signed-off-by: Maya Barnea <mayab@il.ibm.com> --------- Signed-off-by: Maya Barnea <mayab@il.ibm.com> Co-authored-by: Shmuel Kallner <kallner@il.ibm.com> Co-authored-by: Etai Lev Ran <elevran@gmail.com> * chore: fix wrong port for NIXL (llm-d#593) - start with vLLM 0.11.1, default port for NIXL has been updated to 5600 - leave ZMQ to use 5557 Signed-off-by: Wen Zhou <wenzhou@redhat.com> * fix: resolve JSON serialization error in active-request-scorer debug logs (llm-d#602) * fix: resolve JSON serialization error in active-request-scorer debug logs Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> * feat: Add raw scores to debug Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> --------- Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> * Implement "LGTM" ChatOps Workflow. Signed-off-by: Revital Sur <eres@il.ibm.com> * test Signed-off-by: Revital Sur <eres@il.ibm.com> * Lgtm2 (#17) * Implement "LGTM" ChatOps Workflow. Signed-off-by: Revital Sur <eres@il.ibm.com> * test Signed-off-by: Revital Sur <eres@il.ibm.com> --------- Signed-off-by: Revital Sur <eres@il.ibm.com> * test * test: automated LGTM workflow test (#19) This PR tests the /lgtm command workflow automation. Test suite: all Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test: automated LGTM workflow test (#20) This PR tests the /lgtm command workflow automation. Test suite: all Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test: automated LGTM workflow test (#21) This PR tests the /lgtm command workflow automation. Test suite: all Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test: automated LGTM workflow test (#22) This PR tests the /lgtm command workflow automation. Test suite: reset Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test Signed-off-by: Revital Sur <eres@il.ibm.com> * test: automated LGTM workflow test (#24) This PR tests the /lgtm command workflow automation. Test suite: reset Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test Signed-off-by: Revital Sur <eres@il.ibm.com> * test: automated LGTM workflow test (#26) This PR tests the /lgtm command workflow automation. Test suite: reset Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test Signed-off-by: Revital Sur <eres@il.ibm.com> * Address review comments. Signed-off-by: Revital Sur <eres@il.ibm.com> * test: automated LGTM workflow test This PR tests the /lgtm command workflow automation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Revital Sur <eres@il.ibm.com> --------- Signed-off-by: Wen Zhou <wenzhou@redhat.com> Signed-off-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> Signed-off-by: irar2 <irar@il.ibm.com> Signed-off-by: Ira Rosen <irar@il.ibm.com> Signed-off-by: kyano <kyanokashi2@gmail.com> Signed-off-by: kyanokashi <71283892+kyanokashi@users.noreply.github.com> Signed-off-by: kyanokashi <kyanokashi2@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Wen Zhou <wenzhou@redhat.com> Co-authored-by: Etai Lev Ran <elevran@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Shmuel Kallner <kallner@il.ibm.com> Co-authored-by: Maroon Ayoub <maroon.ayoub@ibm.com> Co-authored-by: Ira Rosen <irar@il.ibm.com> Co-authored-by: kyanokashi <71283892+kyanokashi@users.noreply.github.com> Co-authored-by: Maya Barnea <mayab@il.ibm.com> Co-authored-by: alberto <aperdomo@redhat.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

* chore: bump gie to v1.2.1 (llm-d#504) Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com> * deps(go): bump sigs.k8s.io/gateway-api in the kubernetes group (llm-d#508) Bumps the kubernetes group with 1 update: [sigs.k8s.io/gateway-api](https://github.com/kubernetes-sigs/gateway-api). Updates `sigs.k8s.io/gateway-api` from 1.4.0 to 1.4.1 - [Release notes](https://github.com/kubernetes-sigs/gateway-api/releases) - [Changelog](https://github.com/kubernetes-sigs/gateway-api/blob/main/RELEASE.md) - [Commits](kubernetes-sigs/gateway-api@v1.4.0...v1.4.1) --- updated-dependencies: - dependency-name: sigs.k8s.io/gateway-api dependency-version: 1.4.1 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: kubernetes ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * deps(go): bump the go-dependencies group with 3 updates (llm-d#507) Bumps the go-dependencies group with 3 updates: [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo), [github.com/onsi/gomega](https://github.com/onsi/gomega) and [golang.org/x/sync](https://github.com/golang/sync). Updates `github.com/onsi/ginkgo/v2` from 2.27.2 to 2.27.3 - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](onsi/ginkgo@v2.27.2...v2.27.3) Updates `github.com/onsi/gomega` from 1.38.2 to 1.38.3 - [Release notes](https://github.com/onsi/gomega/releases) - [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md) - [Commits](onsi/gomega@v1.38.2...v1.38.3) Updates `golang.org/x/sync` from 0.18.0 to 0.19.0 - [Commits](golang/sync@v0.18.0...v0.19.0) --- updated-dependencies: - dependency-name: github.com/onsi/ginkgo/v2 dependency-version: 2.27.3 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies - dependency-name: github.com/onsi/gomega dependency-version: 1.38.3 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies - dependency-name: golang.org/x/sync dependency-version: 0.19.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Miscellaneous dependency updates (llm-d#510) * Miscelaneous dependency updates Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Use latest GIE CRDs Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Fixed references to kv-cache-manager Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> --------- Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * deps(go): bump the kubernetes group with 5 updates (llm-d#513) Bumps the kubernetes group with 5 updates: | Package | From | To | | --- | --- | --- | | [k8s.io/api](https://github.com/kubernetes/api) | `0.34.2` | `0.34.3` | | [k8s.io/apiextensions-apiserver](https://github.com/kubernetes/apiextensions-apiserver) | `0.34.2` | `0.34.3` | | [k8s.io/apimachinery](https://github.com/kubernetes/apimachinery) | `0.34.2` | `0.34.3` | | [k8s.io/client-go](https://github.com/kubernetes/client-go) | `0.34.2` | `0.34.3` | | [k8s.io/component-base](https://github.com/kubernetes/component-base) | `0.34.2` | `0.34.3` | Updates `k8s.io/api` from 0.34.2 to 0.34.3 - [Commits](kubernetes/api@v0.34.2...v0.34.3) Updates `k8s.io/apiextensions-apiserver` from 0.34.2 to 0.34.3 - [Release notes](https://github.com/kubernetes/apiextensions-apiserver/releases) - [Commits](kubernetes/apiextensions-apiserver@v0.34.2...v0.34.3) Updates `k8s.io/apimachinery` from 0.34.2 to 0.34.3 - [Commits](kubernetes/apimachinery@v0.34.2...v0.34.3) Updates `k8s.io/client-go` from 0.34.2 to 0.34.3 - [Changelog](https://github.com/kubernetes/client-go/blob/master/CHANGELOG.md) - [Commits](kubernetes/client-go@v0.34.2...v0.34.3) Updates `k8s.io/component-base` from 0.34.2 to 0.34.3 - [Commits](kubernetes/component-base@v0.34.2...v0.34.3) --- updated-dependencies: - dependency-name: k8s.io/api dependency-version: 0.34.3 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: kubernetes - dependency-name: k8s.io/apiextensions-apiserver dependency-version: 0.34.3 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: kubernetes - dependency-name: k8s.io/apimachinery dependency-version: 0.34.3 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: kubernetes - dependency-name: k8s.io/client-go dependency-version: 0.34.3 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: kubernetes - dependency-name: k8s.io/component-base dependency-version: 0.34.3 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: kubernetes ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix kind-dev-env.sh (llm-d#512) Running `make env-dev-kind` will fail if the vllm simulator image hasn't been already pulled. This fixes it by skipping the manual load & save of the image unless we're dealing with a custom locally built image (using the dev tag). The kubelet will anyway pull the right image when deploying the pod. Signed-off-by: Antonio Cardace <acardace@redhat.com> * test: add precise_prefix_cache_test (llm-d#505) * test: add precise_prefix_cache_test Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * test: add precise_prefix_cache_test Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> --------- Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * test: reuse upstream data store and enable logr in unit tests (llm-d#518) * enable logr in ut Signed-off-by: MregXN <mregxn@gmail.com> * fix package impoert order Signed-off-by: MregXN <mregxn@gmail.com> * apply comments Signed-off-by: MregXN <mregxn@gmail.com> --------- Signed-off-by: MregXN <mregxn@gmail.com> * feat: allow pd_profile_handler to handle diverse plugin types (llm-d#516) * Store the precise prefix cache score in cycleState. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com> * edit test code Signed-off-by: HyunKyun Moon <mhg5303@gmail.com> --------- Signed-off-by: HyunKyun Moon <mhg5303@gmail.com> * deps(actions): bump crate-ci/typos from 1.40.0 to 1.40.1 (llm-d#526) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.40.0 to 1.40.1. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.40.0...v1.40.1) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-version: 1.40.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * deps(go): bump google.golang.org/grpc in the go-dependencies group (llm-d#527) Bumps the go-dependencies group with 1 update: [google.golang.org/grpc](https://github.com/grpc/grpc-go). Updates `google.golang.org/grpc` from 1.77.0 to 1.78.0 - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](grpc/grpc-go@v1.77.0...v1.78.0) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-version: 1.78.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * feat(metrics): add model_name label to PD decision metric (llm-d#528) Signed-off-by: CYJiang <googs1025@gmail.com> * deps(actions): bump crate-ci/typos from 1.40.1 to 1.41.0 (llm-d#532) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.40.1 to 1.41.0. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.40.1...v1.41.0) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-version: 1.41.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Configure dependabot ignores Go version updates (llm-d#533) * dependabot ignores Go version updates Signed-off-by: Etai Lev Ran <elevran@gmail.com> * allow semver patch level updates to Go Signed-off-by: Etai Lev Ran <elevran@gmail.com> --------- Signed-off-by: Etai Lev Ran <elevran@gmail.com> * Updates the architecture description with reference to BBR and support for multiple GenAI models and LoRAs to remove confusion about llm-d only supporing one model per cluster (llm-d#525) * finer control over package updates (llm-d#542) Signed-off-by: Etai Lev Ran <elevran@gmail.com> * port auto-assign action from llm-d-kv-cache (llm-d#551) Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> * refactor: set python version and pin docker image with tag (llm-d#543) - default set to 3.12 for python - set 9.7(the current latest) for ubi image Signed-off-by: Wen Zhou <wenzhou@redhat.com> * chore(test): update API version for nixl test (llm-d#555) - extentionRef was in old v1alpha2, in v1 it should be updated to endpointPickerRef - remove InferenceModel - update docs for test/sidecar Signed-off-by: Wen Zhou <wenzhou@redhat.com> * deps(go): bump the go-dependencies group with 2 updates (llm-d#558) Bumps the go-dependencies group with 2 updates: [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) and [github.com/onsi/gomega](https://github.com/onsi/gomega). Updates `github.com/onsi/ginkgo/v2` from 2.27.3 to 2.27.4 - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](onsi/ginkgo@v2.27.3...v2.27.4) Updates `github.com/onsi/gomega` from 1.38.3 to 1.39.0 - [Release notes](https://github.com/onsi/gomega/releases) - [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md) - [Commits](onsi/gomega@v1.38.3...v1.39.0) --- updated-dependencies: - dependency-name: github.com/onsi/ginkgo/v2 dependency-version: 2.27.4 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies - dependency-name: github.com/onsi/gomega dependency-version: 1.39.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * deps(actions): bump crate-ci/typos from 1.41.0 to 1.42.0 (llm-d#557) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.41.0 to 1.42.0. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.41.0...v1.42.0) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-version: 1.42.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * deps(actions): bump actions/checkout from 4 to 6 (llm-d#556) Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v4...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * update auto-assign logic (llm-d#560) Signed-off-by: Etai Lev Ran <elevran@gmail.com> * remove newline in unsigned commit message (llm-d#561) Signed-off-by: Etai Lev Ran <elevran@gmail.com> * bump gie to v1.3.0 rc2 (llm-d#562) * update OWNERS (llm-d#559) Signed-off-by: Etai Lev Ran <elevran@gmail.com> * refactor: Makefile, update docs (llm-d#463) * refactor: Makefile, update docs - split Makefile 1. tools: include install tools, check tools, download dependency(gcc etc) and tokenizer. these will be download into "bin" folder than global path 2. cluster: include k8s and ocp 3. kind - rename "openshift-base" to "kubernetes-base" to be clear for purpose - uplift Go lint version to 2.1.6 to align with the same one set in Github Action - rename make targets for better visibility, deprcating old ones - add more print in "make env" Signed-off-by: Wen Zhou <wenzhou@redhat.com> * update: code review - move image tags from Makefile.tools.mk back to Makefile - update docuement to reflact how image and tag are created - do not export image tag env variables IMG_TAG - fix patch-deployments.yaml after EPP_TAG is not used but should only use EPP_IMAGE - fix kubernetes-dev-env.sh for EPP_IMAGE - remove flag on golangci_lint fmt Signed-off-by: Wen Zhou <wenzhou@redhat.com> * code review: - revert back to 1.3.0 - remove comments - set default as default namespace Signed-off-by: Wen Zhou <wenzhou@redhat.com> * Update Makefile Co-authored-by: Shmuel Kallner <kallner@il.ibm.com> Signed-off-by: Wen Zhou <wenzhou@redhat.com> * docs: fix broken link in the docs Signed-off-by: Wen Zhou <wenzhou@redhat.com> --------- Signed-off-by: Wen Zhou <wenzhou@redhat.com> Co-authored-by: Shmuel Kallner <kallner@il.ibm.com> * feat: add metrics validation in e2e test (llm-d#529) Signed-off-by: CYJiang <googs1025@gmail.com> * feat: make no-hit-lru P/D-aware (llm-d#522) * feat: make no-hit-lru P/D-aware Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * hardcode prefill profile Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * remove spammy log Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * apply suggestions Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> --------- Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * Update disaggregated Prefill/Decode inference serving documentation (llm-d#571) * update pd docs Signed-off-by: Maya Barnea <mayab@il.ibm.com> * typos Signed-off-by: Maya Barnea <mayab@il.ibm.com> * typo Signed-off-by: Maya Barnea <mayab@il.ibm.com> --------- Signed-off-by: Maya Barnea <mayab@il.ibm.com> * deps(actions): bump crate-ci/typos from 1.42.0 to 1.42.1 (llm-d#572) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.42.0 to 1.42.1. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.42.0...v1.42.1) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-version: 1.42.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * deps(go): bump github.com/onsi/ginkgo/v2 in the go-dependencies group (llm-d#573) Bumps the go-dependencies group with 1 update: [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo). Updates `github.com/onsi/ginkgo/v2` from 2.27.4 to 2.27.5 - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](onsi/ginkgo@v2.27.4...v2.27.5) --- updated-dependencies: - dependency-name: github.com/onsi/ginkgo/v2 dependency-version: 2.27.5 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix reviewers auto assign minor bug (llm-d#575) * fix(scorer): make active request pd aware (llm-d#569) * fix: decrement all pods on request complete instead of only final pod Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * fix: append all pod endpoints from profile results Signed-off-by: kyanokashi <kyanokashi2@gmail.com> --------- Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * test(e2e): cleanup kind cluster (llm-d#563) - if e2e-tests cluster exist, it fails to run "make test-e2e" - main cleanup should be done in AfterSuite() call - in certain case(kill/terminate) cluster might remain locally this PR is to add trap to preperly clean i up Signed-off-by: Wen Zhou <wenzhou@redhat.com> * refactor: add early validation in DP profile handler (llm-d#554) - validate number of schedulingProfiles in EPP to be 1 otherwise return empty map to reduce computation on filter and scores. - add unit test Signed-off-by: Wen Zhou <wenzhou@redhat.com> * deps(go): bump the kubernetes group with 2 updates (llm-d#574) Bumps the kubernetes group with 2 updates: [sigs.k8s.io/controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) and [sigs.k8s.io/gateway-api-inference-extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension). Updates `sigs.k8s.io/controller-runtime` from 0.22.4 to 0.22.5 - [Release notes](https://github.com/kubernetes-sigs/controller-runtime/releases) - [Changelog](https://github.com/kubernetes-sigs/controller-runtime/blob/main/RELEASE.md) - [Commits](kubernetes-sigs/controller-runtime@v0.22.4...v0.22.5) Updates `sigs.k8s.io/gateway-api-inference-extension` from 1.3.0-rc.2 to 1.3.0-rc.3 - [Release notes](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases) - [Changelog](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/RELEASE.md) - [Commits](kubernetes-sigs/gateway-api-inference-extension@v1.3.0-rc.2...v1.3.0-rc.3) --- updated-dependencies: - dependency-name: sigs.k8s.io/controller-runtime dependency-version: 0.22.5 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: kubernetes - dependency-name: sigs.k8s.io/gateway-api-inference-extension dependency-version: 1.3.0-rc.3 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: kubernetes ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * refactor: kv cache manager repo (llm-d#570) * refactor: kv cache manager repo name Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * go mod tidy Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * fetch kv cache upstream instead of my fork Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * revert dockerfile to fetch kv cache manager from upstream instead of go mod replace Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * update chat preprocessing structs Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * update kv cache manager version Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * refactor kvblock.Key to kvblock.BlockHash Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * add context Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * add parent block key Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * refactor encode Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * validate model name Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * run setup.sh Signed-off-by: HyunKyun Moon <mhg5303@gmail.com> * clone vllm into build Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * edit Signed-off-by: HyunKyun Moon <mhg5303@gmail.com> * edit lint Signed-off-by: HyunKyun Moon <mhg5303@gmail.com> * delete fetch-python-wrapper.sh Signed-off-by: HyunKyun Moon <mhg5303@gmail.com> * edit git workflow Signed-off-by: HyunKyun Moon <mhg5303@gmail.com> * edit Signed-off-by: HyunKyun Moon <mhg5303@gmail.com> * refactor TokenProcessorConfig in config Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * fix kv cache repo name in docker file Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * fix e2e tests Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> * add ignore Signed-off-by: HyunKyun Moon <mhg5303@gmail.com> * update architecture docs Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> --------- Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> Signed-off-by: HyunKyun Moon <mhg5303@gmail.com> Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> Co-authored-by: HyunKyun Moon <mhg5303@gmail.com> Co-authored-by: Maroon Ayoub <maroon.ayoub@ibm.com> * bumping IGW version to the full released version (llm-d#583) Signed-off-by: Kellen Swain <kfswain@google.com> * Enable prefix-cache awareness in active-active multi-replica scheduler deployments (llm-d#578) * - active-active-ha support Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> * Update docs/architecture.md Co-authored-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maroon Ayoub <Maroonay@gmail.com> * lint Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> --------- Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> Signed-off-by: Maroon Ayoub <Maroonay@gmail.com> Co-authored-by: Etai Lev Ran <elevran@gmail.com> * Switch to pre-built vLLM wheels for CPU builds (llm-d#582) * try use official vllm wheels in dockerfile.epp Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * wip Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * use wheels in makefile Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * wip Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * write permissions to setup.sh Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * update kv cache manager commit Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * try instal py deps wo sudo Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * CR changes Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> --------- Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * update llm-d-kv-cache import to v0.5.0-RC1 (llm-d#584) * update kvc version import Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> * add go.mod to testable changes Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> --------- Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> * Use 1.3.0 CRDs (llm-d#586) Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * free disk space on ci-release (llm-d#587) Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> * feat: use Tinyllama as the "model" for kind test and switch to use precise-prefix-cache-score in config (llm-d#581) * feat: use Tinyllama as the "model" for kind test - in order to test precies-prefix-cache-score we cannot use fool-reviewer since it need call kv-cache-manager to get tokenizer by getting a real model from HF - the change is to switch the "default model" to TinyLlama - also to make tokenizer folder writable need change permission to the USER in Dockerfile - rename dp-epp-config.yaml sim-dp-epp-config.yaml as it is used for local test Signed-off-by: Wen Zhou <wenzhou@redhat.com> * update: revert back some config to keep using prefix-cache-scorer - revert file renaming Signed-off-by: Wen Zhou <wenzhou@redhat.com> --------- Signed-off-by: Wen Zhou <wenzhou@redhat.com> * Update linter configuration (llm-d#588) Signed-off-by: Etai Lev Ran <elevran@gmail.com> * fix: config should use new precise-prefix-cache-scorer (llm-d#576) - we have rename prefix-cache-scorer to precise-prefix-cache-scorer in 0.3.0, configs need migrate from the old one to the new one with spec. - rename plugin name - remove parameters.autoTune and parameters.mode: cache_tracking and lruCapacityPerServer - move hashBlockSize, maxPrefixBlocksToMatch under indexrConfig - for config using food-review keep old prefix-cache-scorer - keep pd-epp-config and sim-pd-epp-config with prefix-cache-scorer as KV and PD need both be enabled which is not done yet Signed-off-by: Wen Zhou <wenzhou@redhat.com> * deps(actions): bump crate-ci/typos from 1.42.1 to 1.42.2 (llm-d#589) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.42.1 to 1.42.2. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.42.1...v1.42.2) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-version: 1.42.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Updated to more recent GIE (llm-d#592) * Updated to more recent GIE Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Updated to latest GIE and chnages due to review comments Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Added a true mock SchedulerProfile Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Exploited mock SchedulerProfile Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> --------- Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * pull kvc v0.5.0 libs (llm-d#595) Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> * deps(actions): bump crate-ci/typos from 1.42.2 to 1.43.0 (llm-d#596) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.42.2 to 1.43.0. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.42.2...v1.43.0) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-version: 1.43.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * address nil,nil return linter error in test mock (llm-d#598) Signed-off-by: Etai Lev Ran <elevran@gmail.com> * deps(go): bump the go-dependencies group with 2 updates (llm-d#597) Bumps the go-dependencies group with 2 updates: [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) and [github.com/onsi/gomega](https://github.com/onsi/gomega). Updates `github.com/onsi/ginkgo/v2` from 2.27.5 to 2.28.1 - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](onsi/ginkgo@v2.27.5...v2.28.1) Updates `github.com/onsi/gomega` from 1.39.0 to 1.39.1 - [Release notes](https://github.com/onsi/gomega/releases) - [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md) - [Commits](onsi/gomega@v1.39.0...v1.39.1) --- updated-dependencies: - dependency-name: github.com/onsi/ginkgo/v2 dependency-version: 2.28.1 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: github.com/onsi/gomega dependency-version: 1.39.1 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Models extractor (llm-d#553) * Models extractor Signed-off-by: irar2 <irar@il.ibm.com> * Update register.go Signed-off-by: Ira Rosen <irar@il.ibm.com> * Updated for the newer GIE Signed-off-by: irar2 <irar@il.ibm.com> * Review comments Signed-off-by: irar2 <irar@il.ibm.com> * Check the scheme Signed-off-by: irar2 <irar@il.ibm.com> --------- Signed-off-by: irar2 <irar@il.ibm.com> Signed-off-by: Ira Rosen <irar@il.ibm.com> * feat(lmcache): implement decode first flow on lmcache connector when cache_hit_threshold field is present (llm-d#509) * feat: implement decode first flow on lmcache connector - if cache_hit_threshold field is present in completion request, then we perform a decode first flow Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: error handling Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: add back todo comment Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: reduce code complexity and duplication Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: improve header copying Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: add comment explaning the cache_hit_threshold field and the new decode first flow Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: enhance logging for cache hit threshold in decode flow - decrease verbosity for common log - add cache_hit_threshold attribute Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: improve error handling and observability when failing to unmarshal decode response Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: add deleted informational comments Signed-off-by: kyano <kyanokashi2@gmail.com> * typo Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: make error logs more descriptive of the failure reason Signed-off-by: kyano <kyanokashi2@gmail.com> * feat: add cache hit threshold to prefill request so prefill executes regardless of cache condition Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: typo Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: assign 0 cache_hit_threshold before final decode attempt Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: update comment according to feedback Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: remove istio workaround Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: set cache hit threshold to 0 in prefill request for consistent execution Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: update the log Signed-off-by: kyano <kyanokashi2@gmail.com> * feat: support online decoding Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: preserve request body in lmcache connector Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: support sse format for streamed decode Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: add and improve log descriptions Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: typo Signed-off-by: kyano <kyanokashi2@gmail.com> * nit: undo capitalization Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: typos Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: improve error log observability Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: encapsulate http error checking in function and reuse Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: encapsulate and reuse code better Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: lint error Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: improve code encapsulation and reduce duplication Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: rename and simplify SSE event signaling logic Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: rename lmcache to shared storage protocol Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: remove unused function Signed-off-by: kyano <kyanokashi2@gmail.com> * test: e2e tests Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * chore: claude gitignore Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * fix: sim deployment Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * feat: make linter running on new code configurable Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * fix: lint errors Signed-off-by: kyanokashi <kyanokashi2@gmail.com> --------- Signed-off-by: kyano <kyanokashi2@gmail.com> Signed-off-by: kyanokashi <71283892+kyanokashi@users.noreply.github.com> Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * Extend support for different ways to decide if disaggregated PD is required (llm-d#531) * Initial step of a configurable pd decider which is responsible for decision whether disaggregation is required, use data added in prefix scorer plugin in PrepareRequestData Signed-off-by: Maya Barnea <mayab@il.ibm.com> * update version of GIE + fix lint Signed-off-by: Maya Barnea <mayab@il.ibm.com> * update yaml and the test according prefix plugin configuration change (blockSize replaced by blockSizeTokens) Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update docs/architecture.md Co-authored-by: Shmuel Kallner <kallner@il.ibm.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * code review Signed-off-by: Maya Barnea <mayab@il.ibm.com> * code review Signed-off-by: Maya Barnea <mayab@il.ibm.com> * update version of GIE, update prefix_disagr_decider accordingly Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix typo Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix PD for short inputs Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update docs/architecture.md Co-authored-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update pkg/plugins/profile/always_disaggr_decider.go Co-authored-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update pkg/plugins/profile/always_disaggr_decider.go Co-authored-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update pkg/plugins/profile/prefix_disagg_decider.go Co-authored-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * updates according the PR comments Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix test Signed-off-by: Maya Barnea <mayab@il.ibm.com> * create pd decider plugin type with 2 implementations (for prefix based and test always), update deploy configuration according the new structure Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix e2e tests Signed-off-by: Maya Barnea <mayab@il.ibm.com> * changes according the pr comments Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix e2e test Signed-off-by: Maya Barnea <mayab@il.ibm.com> * add explanation about pd deciders to disagg_pd doc Signed-off-by: Maya Barnea <mayab@il.ibm.com> * rename always_disaggr_decider to always_disagg_decider Signed-off-by: Maya Barnea <mayab@il.ibm.com> --------- Signed-off-by: Maya Barnea <mayab@il.ibm.com> Co-authored-by: Shmuel Kallner <kallner@il.ibm.com> Co-authored-by: Etai Lev Ran <elevran@gmail.com> * chore: fix wrong port for NIXL (llm-d#593) - start with vLLM 0.11.1, default port for NIXL has been updated to 5600 - leave ZMQ to use 5557 Signed-off-by: Wen Zhou <wenzhou@redhat.com> * fix: resolve JSON serialization error in active-request-scorer debug logs (llm-d#602) * fix: resolve JSON serialization error in active-request-scorer debug logs Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> * feat: Add raw scores to debug Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> --------- Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> * Implement "LGTM" ChatOps Workflow. Signed-off-by: Revital Sur <eres@il.ibm.com> * test Signed-off-by: Revital Sur <eres@il.ibm.com> * Lgtm2 (#17) * Implement "LGTM" ChatOps Workflow. Signed-off-by: Revital Sur <eres@il.ibm.com> * test Signed-off-by: Revital Sur <eres@il.ibm.com> --------- Signed-off-by: Revital Sur <eres@il.ibm.com> * test * test: automated LGTM workflow test (#19) This PR tests the /lgtm command workflow automation. Test suite: all Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test: automated LGTM workflow test (#20) This PR tests the /lgtm command workflow automation. Test suite: all Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test: automated LGTM workflow test (#21) This PR tests the /lgtm command workflow automation. Test suite: all Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test: automated LGTM workflow test (#22) This PR tests the /lgtm command workflow automation. Test suite: reset Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test Signed-off-by: Revital Sur <eres@il.ibm.com> * test: automated LGTM workflow test (#24) This PR tests the /lgtm command workflow automation. Test suite: reset Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test Signed-off-by: Revital Sur <eres@il.ibm.com> * test: automated LGTM workflow test (#26) This PR tests the /lgtm command workflow automation. Test suite: reset Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test Signed-off-by: Revital Sur <eres@il.ibm.com> * Address review comments. Signed-off-by: Revital Sur <eres@il.ibm.com> * test: automated LGTM workflow test This PR tests the /lgtm command workflow automation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Revital Sur <eres@il.ibm.com> --------- Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> Signed-off-by: Antonio Cardace <acardace@redhat.com> Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> Signed-off-by: MregXN <mregxn@gmail.com> Signed-off-by: HyunKyun Moon <mhg5303@gmail.com> Signed-off-by: CYJiang <googs1025@gmail.com> Signed-off-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> Signed-off-by: Wen Zhou <wenzhou@redhat.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> Signed-off-by: kyanokashi <kyanokashi2@gmail.com> Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> Signed-off-by: Kellen Swain <kfswain@google.com> Signed-off-by: Maroon Ayoub <Maroonay@gmail.com> Signed-off-by: irar2 <irar@il.ibm.com> Signed-off-by: Ira Rosen <irar@il.ibm.com> Signed-off-by: kyano <kyanokashi2@gmail.com> Signed-off-by: kyanokashi <71283892+kyanokashi@users.noreply.github.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Nir Rozenbaum <nirro@il.ibm.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Shmuel Kallner <kallner@il.ibm.com> Co-authored-by: Antonio Cardace <anto.cardace@gmail.com> Co-authored-by: Edoardo Vacchi <evacchi@users.noreply.github.com> Co-authored-by: MregXN <46479059+MregXN@users.noreply.github.com> Co-authored-by: Hyunkyun Moon <mhg5303@gmail.com> Co-authored-by: CYJiang <86391540+googs1025@users.noreply.github.com> Co-authored-by: Etai Lev Ran <elevran@gmail.com> Co-authored-by: David Breitgand <davidbreitgand@users.noreply.github.com> Co-authored-by: Maroon Ayoub <maroon.ayoub@ibm.com> Co-authored-by: Wen Zhou <wenzhou@redhat.com> Co-authored-by: Maya Barnea <mayab@il.ibm.com> Co-authored-by: kyanokashi <71283892+kyanokashi@users.noreply.github.com> Co-authored-by: Sage <80211083+sagearc@users.noreply.github.com> Co-authored-by: Kellen Swain <kfswain@google.com> Co-authored-by: Ira Rosen <irar@il.ibm.com> Co-authored-by: alberto <aperdomo@redhat.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

* update llm-d-kv-cache import to v0.5.0-RC1 (llm-d#584) * update kvc version import Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> * add go.mod to testable changes Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> --------- Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> * Use 1.3.0 CRDs (llm-d#586) Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * free disk space on ci-release (llm-d#587) Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> * feat: use Tinyllama as the "model" for kind test and switch to use precise-prefix-cache-score in config (llm-d#581) * feat: use Tinyllama as the "model" for kind test - in order to test precies-prefix-cache-score we cannot use fool-reviewer since it need call kv-cache-manager to get tokenizer by getting a real model from HF - the change is to switch the "default model" to TinyLlama - also to make tokenizer folder writable need change permission to the USER in Dockerfile - rename dp-epp-config.yaml sim-dp-epp-config.yaml as it is used for local test Signed-off-by: Wen Zhou <wenzhou@redhat.com> * update: revert back some config to keep using prefix-cache-scorer - revert file renaming Signed-off-by: Wen Zhou <wenzhou@redhat.com> --------- Signed-off-by: Wen Zhou <wenzhou@redhat.com> * Update linter configuration (llm-d#588) Signed-off-by: Etai Lev Ran <elevran@gmail.com> * fix: config should use new precise-prefix-cache-scorer (llm-d#576) - we have rename prefix-cache-scorer to precise-prefix-cache-scorer in 0.3.0, configs need migrate from the old one to the new one with spec. - rename plugin name - remove parameters.autoTune and parameters.mode: cache_tracking and lruCapacityPerServer - move hashBlockSize, maxPrefixBlocksToMatch under indexrConfig - for config using food-review keep old prefix-cache-scorer - keep pd-epp-config and sim-pd-epp-config with prefix-cache-scorer as KV and PD need both be enabled which is not done yet Signed-off-by: Wen Zhou <wenzhou@redhat.com> * deps(actions): bump crate-ci/typos from 1.42.1 to 1.42.2 (llm-d#589) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.42.1 to 1.42.2. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.42.1...v1.42.2) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-version: 1.42.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Updated to more recent GIE (llm-d#592) * Updated to more recent GIE Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Updated to latest GIE and chnages due to review comments Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Added a true mock SchedulerProfile Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Exploited mock SchedulerProfile Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> --------- Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * pull kvc v0.5.0 libs (llm-d#595) Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> * deps(actions): bump crate-ci/typos from 1.42.2 to 1.43.0 (llm-d#596) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.42.2 to 1.43.0. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.42.2...v1.43.0) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-version: 1.43.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * address nil,nil return linter error in test mock (llm-d#598) Signed-off-by: Etai Lev Ran <elevran@gmail.com> * deps(go): bump the go-dependencies group with 2 updates (llm-d#597) Bumps the go-dependencies group with 2 updates: [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) and [github.com/onsi/gomega](https://github.com/onsi/gomega). Updates `github.com/onsi/ginkgo/v2` from 2.27.5 to 2.28.1 - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](onsi/ginkgo@v2.27.5...v2.28.1) Updates `github.com/onsi/gomega` from 1.39.0 to 1.39.1 - [Release notes](https://github.com/onsi/gomega/releases) - [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md) - [Commits](onsi/gomega@v1.39.0...v1.39.1) --- updated-dependencies: - dependency-name: github.com/onsi/ginkgo/v2 dependency-version: 2.28.1 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: github.com/onsi/gomega dependency-version: 1.39.1 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Models extractor (llm-d#553) * Models extractor Signed-off-by: irar2 <irar@il.ibm.com> * Update register.go Signed-off-by: Ira Rosen <irar@il.ibm.com> * Updated for the newer GIE Signed-off-by: irar2 <irar@il.ibm.com> * Review comments Signed-off-by: irar2 <irar@il.ibm.com> * Check the scheme Signed-off-by: irar2 <irar@il.ibm.com> --------- Signed-off-by: irar2 <irar@il.ibm.com> Signed-off-by: Ira Rosen <irar@il.ibm.com> * feat(lmcache): implement decode first flow on lmcache connector when cache_hit_threshold field is present (llm-d#509) * feat: implement decode first flow on lmcache connector - if cache_hit_threshold field is present in completion request, then we perform a decode first flow Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: error handling Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: add back todo comment Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: reduce code complexity and duplication Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: improve header copying Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: add comment explaning the cache_hit_threshold field and the new decode first flow Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: enhance logging for cache hit threshold in decode flow - decrease verbosity for common log - add cache_hit_threshold attribute Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: improve error handling and observability when failing to unmarshal decode response Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: add deleted informational comments Signed-off-by: kyano <kyanokashi2@gmail.com> * typo Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: make error logs more descriptive of the failure reason Signed-off-by: kyano <kyanokashi2@gmail.com> * feat: add cache hit threshold to prefill request so prefill executes regardless of cache condition Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: typo Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: assign 0 cache_hit_threshold before final decode attempt Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: update comment according to feedback Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: remove istio workaround Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: set cache hit threshold to 0 in prefill request for consistent execution Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: update the log Signed-off-by: kyano <kyanokashi2@gmail.com> * feat: support online decoding Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: preserve request body in lmcache connector Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: support sse format for streamed decode Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: add and improve log descriptions Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: typo Signed-off-by: kyano <kyanokashi2@gmail.com> * nit: undo capitalization Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: typos Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: improve error log observability Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: encapsulate http error checking in function and reuse Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: encapsulate and reuse code better Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: lint error Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: improve code encapsulation and reduce duplication Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: rename and simplify SSE event signaling logic Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: rename lmcache to shared storage protocol Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: remove unused function Signed-off-by: kyano <kyanokashi2@gmail.com> * test: e2e tests Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * chore: claude gitignore Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * fix: sim deployment Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * feat: make linter running on new code configurable Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * fix: lint errors Signed-off-by: kyanokashi <kyanokashi2@gmail.com> --------- Signed-off-by: kyano <kyanokashi2@gmail.com> Signed-off-by: kyanokashi <71283892+kyanokashi@users.noreply.github.com> Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * Extend support for different ways to decide if disaggregated PD is required (llm-d#531) * Initial step of a configurable pd decider which is responsible for decision whether disaggregation is required, use data added in prefix scorer plugin in PrepareRequestData Signed-off-by: Maya Barnea <mayab@il.ibm.com> * update version of GIE + fix lint Signed-off-by: Maya Barnea <mayab@il.ibm.com> * update yaml and the test according prefix plugin configuration change (blockSize replaced by blockSizeTokens) Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update docs/architecture.md Co-authored-by: Shmuel Kallner <kallner@il.ibm.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * code review Signed-off-by: Maya Barnea <mayab@il.ibm.com> * code review Signed-off-by: Maya Barnea <mayab@il.ibm.com> * update version of GIE, update prefix_disagr_decider accordingly Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix typo Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix PD for short inputs Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update docs/architecture.md Co-authored-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update pkg/plugins/profile/always_disaggr_decider.go Co-authored-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update pkg/plugins/profile/always_disaggr_decider.go Co-authored-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update pkg/plugins/profile/prefix_disagg_decider.go Co-authored-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * updates according the PR comments Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix test Signed-off-by: Maya Barnea <mayab@il.ibm.com> * create pd decider plugin type with 2 implementations (for prefix based and test always), update deploy configuration according the new structure Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix e2e tests Signed-off-by: Maya Barnea <mayab@il.ibm.com> * changes according the pr comments Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix e2e test Signed-off-by: Maya Barnea <mayab@il.ibm.com> * add explanation about pd deciders to disagg_pd doc Signed-off-by: Maya Barnea <mayab@il.ibm.com> * rename always_disaggr_decider to always_disagg_decider Signed-off-by: Maya Barnea <mayab@il.ibm.com> --------- Signed-off-by: Maya Barnea <mayab@il.ibm.com> Co-authored-by: Shmuel Kallner <kallner@il.ibm.com> Co-authored-by: Etai Lev Ran <elevran@gmail.com> * chore: fix wrong port for NIXL (llm-d#593) - start with vLLM 0.11.1, default port for NIXL has been updated to 5600 - leave ZMQ to use 5557 Signed-off-by: Wen Zhou <wenzhou@redhat.com> * fix: resolve JSON serialization error in active-request-scorer debug logs (llm-d#602) * fix: resolve JSON serialization error in active-request-scorer debug logs Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> * feat: Add raw scores to debug Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> --------- Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> * Match documentation with default model in scripts (llm-d#615) Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Test: LGTM Workflow Automation (#32) * feat: use Tinyllama as the "model" for kind test and switch to use precise-prefix-cache-score in config (llm-d#581) * feat: use Tinyllama as the "model" for kind test - in order to test precies-prefix-cache-score we cannot use fool-reviewer since it need call kv-cache-manager to get tokenizer by getting a real model from HF - the change is to switch the "default model" to TinyLlama - also to make tokenizer folder writable need change permission to the USER in Dockerfile - rename dp-epp-config.yaml sim-dp-epp-config.yaml as it is used for local test Signed-off-by: Wen Zhou <wenzhou@redhat.com> * update: revert back some config to keep using prefix-cache-scorer - revert file renaming Signed-off-by: Wen Zhou <wenzhou@redhat.com> --------- Signed-off-by: Wen Zhou <wenzhou@redhat.com> * Update linter configuration (llm-d#588) Signed-off-by: Etai Lev Ran <elevran@gmail.com> * fix: config should use new precise-prefix-cache-scorer (llm-d#576) - we have rename prefix-cache-scorer to precise-prefix-cache-scorer in 0.3.0, configs need migrate from the old one to the new one with spec. - rename plugin name - remove parameters.autoTune and parameters.mode: cache_tracking and lruCapacityPerServer - move hashBlockSize, maxPrefixBlocksToMatch under indexrConfig - for config using food-review keep old prefix-cache-scorer - keep pd-epp-config and sim-pd-epp-config with prefix-cache-scorer as KV and PD need both be enabled which is not done yet Signed-off-by: Wen Zhou <wenzhou@redhat.com> * deps(actions): bump crate-ci/typos from 1.42.1 to 1.42.2 (llm-d#589) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.42.1 to 1.42.2. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.42.1...v1.42.2) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-version: 1.42.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Updated to more recent GIE (llm-d#592) * Updated to more recent GIE Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Updated to latest GIE and chnages due to review comments Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Added a true mock SchedulerProfile Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Exploited mock SchedulerProfile Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> --------- Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * pull kvc v0.5.0 libs (llm-d#595) Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> * deps(actions): bump crate-ci/typos from 1.42.2 to 1.43.0 (llm-d#596) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.42.2 to 1.43.0. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.42.2...v1.43.0) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-version: 1.43.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * address nil,nil return linter error in test mock (llm-d#598) Signed-off-by: Etai Lev Ran <elevran@gmail.com> * deps(go): bump the go-dependencies group with 2 updates (llm-d#597) Bumps the go-dependencies group with 2 updates: [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) and [github.com/onsi/gomega](https://github.com/onsi/gomega). Updates `github.com/onsi/ginkgo/v2` from 2.27.5 to 2.28.1 - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](onsi/ginkgo@v2.27.5...v2.28.1) Updates `github.com/onsi/gomega` from 1.39.0 to 1.39.1 - [Release notes](https://github.com/onsi/gomega/releases) - [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md) - [Commits](onsi/gomega@v1.39.0...v1.39.1) --- updated-dependencies: - dependency-name: github.com/onsi/ginkgo/v2 dependency-version: 2.28.1 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: github.com/onsi/gomega dependency-version: 1.39.1 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Models extractor (llm-d#553) * Models extractor Signed-off-by: irar2 <irar@il.ibm.com> * Update register.go Signed-off-by: Ira Rosen <irar@il.ibm.com> * Updated for the newer GIE Signed-off-by: irar2 <irar@il.ibm.com> * Review comments Signed-off-by: irar2 <irar@il.ibm.com> * Check the scheme Signed-off-by: irar2 <irar@il.ibm.com> --------- Signed-off-by: irar2 <irar@il.ibm.com> Signed-off-by: Ira Rosen <irar@il.ibm.com> * feat(lmcache): implement decode first flow on lmcache connector when cache_hit_threshold field is present (llm-d#509) * feat: implement decode first flow on lmcache connector - if cache_hit_threshold field is present in completion request, then we perform a decode first flow Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: error handling Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: add back todo comment Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: reduce code complexity and duplication Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: improve header copying Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: add comment explaning the cache_hit_threshold field and the new decode first flow Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: enhance logging for cache hit threshold in decode flow - decrease verbosity for common log - add cache_hit_threshold attribute Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: improve error handling and observability when failing to unmarshal decode response Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: add deleted informational comments Signed-off-by: kyano <kyanokashi2@gmail.com> * typo Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: make error logs more descriptive of the failure reason Signed-off-by: kyano <kyanokashi2@gmail.com> * feat: add cache hit threshold to prefill request so prefill executes regardless of cache condition Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: typo Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: assign 0 cache_hit_threshold before final decode attempt Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: update comment according to feedback Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: remove istio workaround Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: set cache hit threshold to 0 in prefill request for consistent execution Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: update the log Signed-off-by: kyano <kyanokashi2@gmail.com> * feat: support online decoding Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: preserve request body in lmcache connector Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: support sse format for streamed decode Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: add and improve log descriptions Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: typo Signed-off-by: kyano <kyanokashi2@gmail.com> * nit: undo capitalization Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: typos Signed-off-by: kyano <kyanokashi2@gmail.com> * chore: improve error log observability Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: encapsulate http error checking in function and reuse Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: encapsulate and reuse code better Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: lint error Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: improve code encapsulation and reduce duplication Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: rename and simplify SSE event signaling logic Signed-off-by: kyano <kyanokashi2@gmail.com> * refactor: rename lmcache to shared storage protocol Signed-off-by: kyano <kyanokashi2@gmail.com> * fix: remove unused function Signed-off-by: kyano <kyanokashi2@gmail.com> * test: e2e tests Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * chore: claude gitignore Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * fix: sim deployment Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * feat: make linter running on new code configurable Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * fix: lint errors Signed-off-by: kyanokashi <kyanokashi2@gmail.com> --------- Signed-off-by: kyano <kyanokashi2@gmail.com> Signed-off-by: kyanokashi <71283892+kyanokashi@users.noreply.github.com> Signed-off-by: kyanokashi <kyanokashi2@gmail.com> * Extend support for different ways to decide if disaggregated PD is required (llm-d#531) * Initial step of a configurable pd decider which is responsible for decision whether disaggregation is required, use data added in prefix scorer plugin in PrepareRequestData Signed-off-by: Maya Barnea <mayab@il.ibm.com> * update version of GIE + fix lint Signed-off-by: Maya Barnea <mayab@il.ibm.com> * update yaml and the test according prefix plugin configuration change (blockSize replaced by blockSizeTokens) Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update docs/architecture.md Co-authored-by: Shmuel Kallner <kallner@il.ibm.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * code review Signed-off-by: Maya Barnea <mayab@il.ibm.com> * code review Signed-off-by: Maya Barnea <mayab@il.ibm.com> * update version of GIE, update prefix_disagr_decider accordingly Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix typo Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix PD for short inputs Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update docs/architecture.md Co-authored-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update pkg/plugins/profile/always_disaggr_decider.go Co-authored-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update pkg/plugins/profile/always_disaggr_decider.go Co-authored-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Update pkg/plugins/profile/prefix_disagg_decider.go Co-authored-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> * updates according the PR comments Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix test Signed-off-by: Maya Barnea <mayab@il.ibm.com> * create pd decider plugin type with 2 implementations (for prefix based and test always), update deploy configuration according the new structure Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix e2e tests Signed-off-by: Maya Barnea <mayab@il.ibm.com> * changes according the pr comments Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix e2e test Signed-off-by: Maya Barnea <mayab@il.ibm.com> * add explanation about pd deciders to disagg_pd doc Signed-off-by: Maya Barnea <mayab@il.ibm.com> * rename always_disaggr_decider to always_disagg_decider Signed-off-by: Maya Barnea <mayab@il.ibm.com> --------- Signed-off-by: Maya Barnea <mayab@il.ibm.com> Co-authored-by: Shmuel Kallner <kallner@il.ibm.com> Co-authored-by: Etai Lev Ran <elevran@gmail.com> * chore: fix wrong port for NIXL (llm-d#593) - start with vLLM 0.11.1, default port for NIXL has been updated to 5600 - leave ZMQ to use 5557 Signed-off-by: Wen Zhou <wenzhou@redhat.com> * fix: resolve JSON serialization error in active-request-scorer debug logs (llm-d#602) * fix: resolve JSON serialization error in active-request-scorer debug logs Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> * feat: Add raw scores to debug Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> --------- Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> * Implement "LGTM" ChatOps Workflow. Signed-off-by: Revital Sur <eres@il.ibm.com> * test Signed-off-by: Revital Sur <eres@il.ibm.com> * Lgtm2 (#17) * Implement "LGTM" ChatOps Workflow. Signed-off-by: Revital Sur <eres@il.ibm.com> * test Signed-off-by: Revital Sur <eres@il.ibm.com> --------- Signed-off-by: Revital Sur <eres@il.ibm.com> * test * test: automated LGTM workflow test (#19) This PR tests the /lgtm command workflow automation. Test suite: all Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test: automated LGTM workflow test (#20) This PR tests the /lgtm command workflow automation. Test suite: all Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test: automated LGTM workflow test (#21) This PR tests the /lgtm command workflow automation. Test suite: all Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test: automated LGTM workflow test (#22) This PR tests the /lgtm command workflow automation. Test suite: reset Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test Signed-off-by: Revital Sur <eres@il.ibm.com> * test: automated LGTM workflow test (#24) This PR tests the /lgtm command workflow automation. Test suite: reset Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test Signed-off-by: Revital Sur <eres@il.ibm.com> * test: automated LGTM workflow test (#26) This PR tests the /lgtm command workflow automation. Test suite: reset Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test Signed-off-by: Revital Sur <eres@il.ibm.com> * Address review comments. Signed-off-by: Revital Sur <eres@il.ibm.com> * test: automated LGTM workflow test This PR tests the /lgtm command workflow automation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Revital Sur <eres@il.ibm.com> --------- Signed-off-by: Wen Zhou <wenzhou@redhat.com> Signed-off-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> Signed-off-by: irar2 <irar@il.ibm.com> Signed-off-by: Ira Rosen <irar@il.ibm.com> Signed-off-by: kyano <kyanokashi2@gmail.com> Signed-off-by: kyanokashi <71283892+kyanokashi@users.noreply.github.com> Signed-off-by: kyanokashi <kyanokashi2@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Wen Zhou <wenzhou@redhat.com> Co-authored-by: Etai Lev Ran <elevran@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Shmuel Kallner <kallner@il.ibm.com> Co-authored-by: Maroon Ayoub <maroon.ayoub@ibm.com> Co-authored-by: Ira Rosen <irar@il.ibm.com> Co-authored-by: kyanokashi <71283892+kyanokashi@users.noreply.github.com> Co-authored-by: Maya Barnea <mayab@il.ibm.com> Co-authored-by: alberto <aperdomo@redhat.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * test Signed-off-by: Revital Sur <eres@il.ibm.com> * test: open-pr Tests that opening a PR triggers gatekeeper which blocks without lgtm label. Test timestamp: 1771188042 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> Signed-off-by: Wen Zhou <wenzhou@redhat.com> Signed-off-by: Etai Lev Ran <elevran@gmail.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: irar2 <irar@il.ibm.com> Signed-off-by: Ira Rosen <irar@il.ibm.com> Signed-off-by: kyano <kyanokashi2@gmail.com> Signed-off-by: kyanokashi <71283892+kyanokashi@users.noreply.github.com> Signed-off-by: kyanokashi <kyanokashi2@gmail.com> Signed-off-by: Maya Barnea <mayab@il.ibm.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com> Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Maroon Ayoub <maroon.ayoub@ibm.com> Co-authored-by: Shmuel Kallner <kallner@il.ibm.com> Co-authored-by: Wen Zhou <wenzhou@redhat.com> Co-authored-by: Etai Lev Ran <elevran@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ira Rosen <irar@il.ibm.com> Co-authored-by: kyanokashi <71283892+kyanokashi@users.noreply.github.com> Co-authored-by: Maya Barnea <mayab@il.ibm.com> Co-authored-by: alberto <aperdomo@redhat.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

github-project-automation bot added this to llm-d-inference-scheduler Dec 9, 2025

kyanokashi changed the title ~~feat: implement decode first flow on lmcache connector~~ feat(lmcache): implement decode first flow on lmcache connector when cache_hit_threshold field is present Dec 9, 2025

kyanokashi force-pushed the feat/sidecar/lmcache-connector/decode-first branch from 77b64c5 to 2a6e437 Compare December 9, 2025 23:37

feat: implement decode first flow on lmcache connector

b228bb7

- if cache_hit_threshold field is present in completion request, then we perform a decode first flow Signed-off-by: kyano <kyanokashi2@gmail.com>

kyanokashi force-pushed the feat/sidecar/lmcache-connector/decode-first branch from 2a6e437 to b228bb7 Compare December 9, 2025 23:46

kyanokashi mentioned this pull request Dec 9, 2025

Support conditional prefill based on actual decode kv cache state #382

Open

kyanokashi and others added 2 commits December 9, 2025 18:55

Merge branch 'main' into feat/sidecar/lmcache-connector/decode-first

a436b50

Signed-off-by: kyanokashi <71283892+kyanokashi@users.noreply.github.com>

fix: error handling

a6ae771

Signed-off-by: kyano <kyanokashi2@gmail.com>

kyanokashi force-pushed the feat/sidecar/lmcache-connector/decode-first branch from a120186 to a6ae771 Compare December 10, 2025 17:34

Merge branch 'main' into feat/sidecar/lmcache-connector/decode-first

58388eb

elevran reviewed Dec 11, 2025

View reviewed changes

kyanokashi added 3 commits December 11, 2025 08:46

chore: add back todo comment

04b7ffd

Signed-off-by: kyano <kyanokashi2@gmail.com>

refactor: reduce code complexity and duplication

ce74f50

Signed-off-by: kyano <kyanokashi2@gmail.com>

refactor: improve header copying

1de6035

Signed-off-by: kyano <kyanokashi2@gmail.com>

elevran moved this to In review in llm-d-inference-scheduler Dec 11, 2025

kfirwolfson suggested changes Dec 14, 2025

View reviewed changes

kyanokashi added 7 commits December 14, 2025 20:33

chore: add comment explaning the cache_hit_threshold field and the ne…

c0ac69e

…w decode first flow Signed-off-by: kyano <kyanokashi2@gmail.com>

refactor: enhance logging for cache hit threshold in decode flow

7ce5e19

- decrease verbosity for common log - add cache_hit_threshold attribute Signed-off-by: kyano <kyanokashi2@gmail.com>

refactor: improve error handling and observability when failing to un…

6430a02

…marshal decode response Signed-off-by: kyano <kyanokashi2@gmail.com>

chore: add deleted informational comments

4c15d95

Signed-off-by: kyano <kyanokashi2@gmail.com>

typo

cac084f

Signed-off-by: kyano <kyanokashi2@gmail.com>

refactor: make error logs more descriptive of the failure reason

91c7a06

Signed-off-by: kyano <kyanokashi2@gmail.com>

feat: add cache hit threshold to prefill request so prefill executes …

69d30b5

…regardless of cache condition Signed-off-by: kyano <kyanokashi2@gmail.com>

kfirwolfson reviewed Dec 15, 2025

View reviewed changes

kyanokashi added 2 commits December 17, 2025 13:41

fix: typo

1ed1d89

Signed-off-by: kyano <kyanokashi2@gmail.com>

refactor: assign 0 cache_hit_threshold before final decode attempt

515b385

Signed-off-by: kyano <kyanokashi2@gmail.com>

kyanokashi force-pushed the feat/sidecar/lmcache-connector/decode-first branch from 2ebde52 to 515b385 Compare December 17, 2025 18:54

chore: update comment according to feedback

4c8659e

Signed-off-by: kyano <kyanokashi2@gmail.com>

kfirwolfson reviewed Jan 18, 2026

View reviewed changes

elevran added this to the v0.6 milestone Jan 22, 2026

kyanokashi added 2 commits January 22, 2026 19:03

Merge branch 'main' into feat/sidecar/lmcache-connector/decode-first

ea3f1da

Merge branch 'main' into feat/sidecar/lmcache-connector/decode-first

2e35d85

kyanokashi added 2 commits January 31, 2026 16:46

test: e2e tests

d71b10b

Signed-off-by: kyanokashi <kyanokashi2@gmail.com>

chore: claude gitignore

e34600f

Signed-off-by: kyanokashi <kyanokashi2@gmail.com>

Merge branch 'main' into feat/sidecar/lmcache-connector/decode-first

efd638a

kyanokashi requested a review from elevran February 2, 2026 19:40

Merge branch 'main' into feat/sidecar/lmcache-connector/decode-first

5e988e5

kyanokashi and others added 2 commits February 5, 2026 12:47

fix: sim deployment

fbb12fc

Signed-off-by: kyanokashi <kyanokashi2@gmail.com>

Merge branch 'main' into feat/sidecar/lmcache-connector/decode-first

903249f

kfirwolfson approved these changes Feb 5, 2026

View reviewed changes

kyanokashi added 2 commits February 6, 2026 09:57

feat: make linter running on new code configurable

f6024d1

Signed-off-by: kyanokashi <kyanokashi2@gmail.com>

fix: lint errors

5f042e7

Signed-off-by: kyanokashi <kyanokashi2@gmail.com>

github-actions bot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2026

github-actions bot approved these changes Feb 6, 2026

View reviewed changes

github-actions bot merged commit dc96b95 into llm-d:main Feb 6, 2026
8 checks passed

github-project-automation bot moved this from In review to Done in llm-d-inference-scheduler Feb 6, 2026

Conversation

kyanokashi commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Report

Test Scenarios

Scenario 1: Non-Streaming WITHOUT cache_hit_threshold

Scenario 2: Non-Streaming WITH cache_hit_threshold

Scenario 3: Non-Streaming WITH cache_hit_threshold AND X-Cache-Threshold: true

Scenario 4: Streaming WITHOUT cache_hit_threshold

Scenario 5: Streaming WITH cache_hit_threshold

Scenario 6: Streaming WITH cache_hit_threshold AND X-Cache-Threshold: true

Failure Injection Test

Scenario 7: Streaming Decode Error (tryDecodeStreaming)

Summary

Uh oh!

github-actions bot commented Dec 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfirwolfson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfirwolfson Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfirwolfson commented Dec 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyanokashi Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfirwolfson commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kyanokashi commented Jan 31, 2026

Uh oh!

kfirwolfson commented Jan 31, 2026

Uh oh!

kyanokashi commented Feb 1, 2026

Uh oh!

elevran commented Feb 5, 2026

kyanokashi commented Dec 9, 2025 •

edited

Loading

Scenario 1: Non-Streaming WITHOUT `cache_hit_threshold`

Scenario 2: Non-Streaming WITH `cache_hit_threshold`

Scenario 3: Non-Streaming WITH `cache_hit_threshold` AND `X-Cache-Threshold: true`

Scenario 4: Streaming WITHOUT `cache_hit_threshold`

Scenario 5: Streaming WITH `cache_hit_threshold`

Scenario 6: Streaming WITH `cache_hit_threshold` AND `X-Cache-Threshold: true`

kfirwolfson Dec 15, 2025 •

edited

Loading

kyanokashi Dec 22, 2025 •

edited

Loading

kfirwolfson commented Jan 18, 2026 •

edited

Loading

kfirwolfson left a comment •

edited

Loading

kyanokashi commented Feb 6, 2026 •

edited

Loading