Skip to content

Chat Completions support; WIP#410

Closed
guygir wants to merge 34 commits intollm-d:mainfrom
guygir:feature/v16-prep
Closed

Chat Completions support; WIP#410
guygir wants to merge 34 commits intollm-d:mainfrom
guygir:feature/v16-prep

Conversation

@guygir
Copy link
Contributor

@guygir guygir commented Oct 30, 2025

  1. Dockerfile
  • Minimal Python-enabled runtime to support chat templating.
  • Builder: installs Python 3.12 headers; fetches preprocessing requirements and render_jinja_template_wrapper.py.
  • Runtime: use of microdnf because a package manager is required to install Python 3.12 and zeromq at runtime, and microdnf is lighter than dnf (fewer dependencies), so it keeps the image smaller.; copies the wrapper into /usr/local/lib/python3.12/site-packages/.
  • Python deps installed to a single site-packages, --no-cache-dir to keep image size smaller; torch filtered out manually for now.
  • Env set for reliable imports and HF cache (PYTHONPATH, HF_HOME).
  1. Precise Prefix Cache (scorer)
  • Adds real chat-completions preprocessing before scoring.
  • Builds RenderJinjaTemplateRequest from messages (uses msg.Content.Raw), initializes processor, fetches model chat template, renders, returns flattened prompt to KV-cache scoring.
  • Regular completions use prompt directly.
  • Includes PREPROCESSING: INFO/DEBUG logs; will be trimmed post-WIP.
  1. Submodules
  • Adds empty placeholders for llm-d-kv-cache-manager and gateway-api-inference-extension.
  • Purpose: keep upstream references during integration without pulling large repos.
  • Final build relies on Go modules and fetches Python files during Docker build; submodule content not required.
  • Documents where things came from and makes it easier to work with if needed later; can be removed if it becomes unnecessary later.

@elevran
Copy link
Collaborator

elevran commented Oct 30, 2025

@guygir also note that DCO check is failing and should be fixed

@vMaroon
Copy link
Member

vMaroon commented Oct 30, 2025

This PR should mimic as closely as possible the work done in this kv-cache-manager one: llm-d/llm-d-kv-cache#92

This includes:

  • Build changes: CI, Makefile and Dockerfile updates
  • The precise-prefix-cache-scorer plugin to operate similarly to the online kvevents example
  • Tests and documentation

/hold

@github-actions github-actions bot added the hold PRs that are blocked on design, other features, release cycle, etc. label Oct 30, 2025
guygir and others added 24 commits November 2, 2025 17:14
- Add local clones of gateway-api-inference-extension and llm-d-kv-cache-manager
- Update go.mod with replace directives for local dependencies
- Modify Dockerfile to include Python 3.12 runtime and dependencies
- Add chat completions preprocessing to precise prefix cache scorer
- Add chat completions preprocessing to PD profile handler
- Update main.go to register custom plugins
- Add comprehensive README with build instructions and troubleshooting

This enables KV-cache aware routing for chat completion requests by converting
them to flattened templated prompts before performing cache similarity matching.

Signed-off-by: Guy Girmonsky <guygir@gmail.com>
Signed-off-by: guygir <guygir@gmail.com>
Signed-off-by: guygir <guygir@gmail.com>
Signed-off-by: guygir <guygir@gmail.com>
Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.35.5 to 1.35.7.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.35.5...v1.35.7)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-version: 1.35.7
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* added comment to stale issues to make sure author doesn't miss the closing of the issue

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

* typo

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

* Update .github/workflows/stale.yaml

Co-authored-by: Etai Lev Ran <elevran@gmail.com>
Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

---------

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>
Co-authored-by: Etai Lev Ran <elevran@gmail.com>
* fixed missing dependencies in makefile

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

* fixed comment in Makefile

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

---------

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>
* Update RBAC for latest IGW

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>

* No longer create an InferenceModel object

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>

---------

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Etai Lev Ran <elevran@gmail.com>
Bumps [actions/stale](https://github.com/actions/stale) from 9 to 10.
- [Release notes](https://github.com/actions/stale/releases)
- [Changelog](https://github.com/actions/stale/blob/main/CHANGELOG.md)
- [Commits](actions/stale@v9...v10)

---
updated-dependencies:
- dependency-name: actions/stale
  dependency-version: '10'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [actions/setup-go](https://github.com/actions/setup-go) from 5 to 6.
- [Release notes](https://github.com/actions/setup-go/releases)
- [Commits](actions/setup-go@v5...v6)

---
updated-dependencies:
- dependency-name: actions/setup-go
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.35.7 to 1.36.2.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.35.7...v1.36.2)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-version: 1.36.2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…llm-d#358)

Signed-off-by: Kay Yan <kay.yan@daocloud.io>
Co-authored-by: Nir Rozenbaum <nirro@il.ibm.com>
Signed-off-by: Kellen Swain <kfswain@google.com>
Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.36.2 to 1.38.1.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.36.2...v1.38.1)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-version: 1.38.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Fix multi-architecture image issues with Kind

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>

* Review fixes

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>

---------

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
…heduler repo (llm-d#379)

* Moved prefill header definition to common import

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>

* Moved Routing Sidecar into this repo

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>

* Moved Routing Sidecar tests into this repo

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>

* Moved Routing Sidecar Dockerfile into this repo

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>

* Added Routing Sidecar to Makefile

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>

* Added Routing Sidecar to CI stream

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>

* Fixed lint error

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>

* Review fixes and added version info

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>

* Test Nixl V2 instead of the deleted Nixl V1

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>

* Fixed lint errors

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>

---------

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
…letions support

Removed local repository clones - the upstream v0.3.2 already includes
chat completions preprocessing functionality. Simplified Dockerfile to
download Python requirements from upstream repository instead of
copying local files. This makes the build process cleaner and aligns
with upstream practices while maintaining the chat completions feature.

Signed-off-by: Guy Girmonsky <guygir@gmail.com>
- Updated to show merged state with upstream (31 commits integrated)
- Removed outdated local repository clone information
- Documented use of upstream llm-d-kv-cache-manager v0.3.2
- Updated troubleshooting section to reflect simplified build process
- Clarified that Docker build is recommended for local development
- Updated API changes (request.Body vs request.Data)
- Documented proper dependency versions

Signed-off-by: Guy Girmonsky <guygir@gmail.com>
The upstream v1.1.0 API changed Content from string to struct with Raw field.
Fixed both precise_prefix_cache.go and pd_profile_handler.go to use msg.Content.Raw.
Also removed unused prompt variable in pd_profile_handler.go

Signed-off-by: Guy Girmonsky <guygir@gmail.com>
guygir and others added 10 commits November 2, 2025 17:18
Signed-off-by: Guy Girmonsky <guygir@gmail.com>
Python packages (torch, transformers, etc) are already installed in builder
stage and copied to runtime. Removed the redundant pip install that was
downloading 175MB torch package twice. Also removed python3.12-pip from
runtime since we don't need pip in production image.

Signed-off-by: Guy Girmonsky <guygir@gmail.com>
Signed-off-by: Guy Girmonsky <guygir@gmail.com>
Signed-off-by: Guy Girmonsky <guygir@gmail.com>
Signed-off-by: Guy Girmonsky <guygir@gmail.com>
Explains that 404MB image had NO chat completions,
233MB adds ALL necessary Python deps for chat preprocessing,
and breaks down exactly which packages add how much size.

Signed-off-by: Guy Girmonsky <guygir@gmail.com>
Signed-off-by: Guy Girmonsky <guygir@gmail.com>
* Make sure that max_completion_tokens=1 in Prefill

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>

* Remove/undo setting of max_completion_tokens to 1, for decode

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>

---------

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
… from PR branch

Signed-off-by: Guy Girmonsky <guygir@gmail.com>
…ories, add Python wrapper script

- Restore all Dockerfile comments from OG Dockerfile
- Remove empty placeholder directories gateway-api-inference-extension and llm-d-kv-cache-manager
- Add scripts/fetch-python-wrapper.sh for reusable Python wrapper fetching
- Finalize precise_prefix_cache.go changes (logging and comment cleanup)

Signed-off-by: Guy Girmonsky <guygir@gmail.com>
@guygir
Copy link
Contributor Author

guygir commented Nov 2, 2025

Updated the PR, and all previous comments have been addressed.
Regarding the last comment by @vMaroon - I will re-run the benchmark (including a more reliable workload, as discussed with @elevran) to produce a detailed doc similar to the one in kv-cache manager.

Note on DCO: Two commits in the branch (ec6d849 by learner0810, c40cc15 by Morgan Foster) are currently failing DCO checks. These are upstream commits that came into this branch by merging upstream, and they have the same signoff issues in upstream/main itself (missing/incorrect signoffs). Guidance on how to handle these upstream commits - should they be addressed in upstream/main, or is there another approach that I should take in my branch?

@elevran
Copy link
Collaborator

elevran commented Nov 4, 2025

@guygir would you kindly

  • rebase your work on top of main (do NOT merge commit from main which pulls in the all changes and makes it impossible to know which files yout changed and which changed on main. Doubt that you need to change 83 files and 5000 lines...)
  • resolve any remaining conflicts after the rebase
    Thanks!

@elevran
Copy link
Collaborator

elevran commented Nov 4, 2025

closing this in favor of clean (rebased) PR

@elevran elevran closed this Nov 4, 2025
@github-project-automation github-project-automation bot moved this from In review to Done in llm-d-inference-scheduler Nov 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hold PRs that are blocked on design, other features, release cycle, etc.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

9 participants