guide: pointing agents to model running on the cluster with vllm and kserve#51
guide: pointing agents to model running on the cluster with vllm and kserve#51jehlum11 wants to merge 3 commits into
Conversation
…-on-cluster-with-vllm-and-kserve simple guide to cut across all agent templates
📝 WalkthroughWalkthroughNew documentation guide describing how to run a local agent while serving its model from vLLM on an OpenShift AI cluster via KServe, including ServingRuntime and InferenceService YAMLs, vLLM runtime args, multi-GPU/chat template notes, and an OpenShift Route exposure workaround. Changes
Sequence Diagram(s)sequenceDiagram
participant LocalAgent as Local Agent
participant Route as OpenShift Route / ClusterIP
participant KServe as KServe InferenceService
participant vLLM as vLLM ServingRuntime Pod
participant Storage as Model Storage (PVC/URI)
LocalAgent->>Route: HTTP request to model endpoint
Route->>KServe: Forward request to InferenceService
KServe->>vLLM: Route inference request
vLLM->>Storage: Mount/read model from storageUri
vLLM-->>KServe: Return prediction/stream
KServe-->>Route: Relay response
Route-->>LocalAgent: Deliver response
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
guide-local-agent-to-vllm-on-cluster.md (2)
18-53: Consider adding language specifier to YAML code block.For better syntax highlighting and linting support, add
yamlas the language specifier.📝 Proposed improvement
-``` +```yaml apiVersion: serving.kserve.io/v1alpha1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@guide-local-agent-to-vllm-on-cluster.md` around lines 18 - 53, The fenced code block showing the ServingRuntime manifest lacks a language tag; update the opening triple-backtick for that block to specify yaml (i.e., change ``` to ```yaml) so editors and linters will apply YAML highlighting and validation for the ServingRuntime manifest, model args, ports, and supportedModelFormats sections.
91-114: Consider adding language specifier to YAML code block.For consistency with the first code block and better syntax highlighting, add
yamlas the language specifier.📝 Proposed improvement
-``` +```yaml apiVersion: serving.kserve.io/v1beta1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@guide-local-agent-to-vllm-on-cluster.md` around lines 91 - 114, Add a language specifier to the fenced code block that defines the InferenceService so the YAML (apiVersion: serving.kserve.io/v1beta1, kind: InferenceService, spec.predictor.model.runtime: vllm-runtime, etc.) is highlighted consistently; update the opening fence from ``` to ```yaml so the entire block (including storageUri, resources, metadata.name: llama-3-3-70b) is parsed and rendered as YAML.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@guide-local-agent-to-vllm-on-cluster.md`:
- Around line 116-118: The "Expose the Model Externally" section is incomplete
and must include concrete steps and examples to create a ClusterIP Service that
targets the vllm RawDeployment pods (replacing the headless Service) and to
create an OpenShift Route that points to that ClusterIP Service; update the
section to (1) explain how to create a ClusterIP Service (kubectl expose or a
Service YAML referencing the vllm RawDeployment selector and port names) and
include a sample Service manifest or kubectl command, and (2) show how to create
an OpenShift Route (oc create route or a Route YAML) that targets the new
Service with correct service name and port, TLS/hostname examples, and any
necessary annotations for KServe; reference the RawDeployment/Service selector
names and the Route/service names used in the diff so readers can plug them into
their manifests.
- Around line 120-122: The "3. Update app code to point to vllm + KServe on OAI"
section is incomplete—add three concrete subsections: (1) "Client configuration"
showing exact example values for endpoint URL, authentication (bearer/API key),
and required headers for an OpenAI-compatible vLLM+KServe endpoint; (2) "Code
examples" with short before/after snippets demonstrating how to switch an
OpenAI-compatible client (e.g., code that constructs a client, sets base_url,
headers, and sends a completion/request) from a local dev URL to the
cluster-served vLLM URL and how to enable TLS/auth; and (3) "Why langgraph"
explaining in 2–3 sentences why you migrated from Claude/Anthropic SDK to
langgraph/pure Python agents (compatibility with OpenAI-compatible endpoints,
lighter weight for custom deployment workflows, and easier integration with
KServe). Reference the section title "Update app code to point to vllm + KServe
on OAI" and include placeholder examples for URL/auth so readers can
copy-and-paste and adapt to their cluster.
- Around line 55-57: Fix the typo in the parser example: replace the stray
backtick at the end of 'openai\`' with a closing single quote so the example
reads 'openai'; update the sentence that references the parser flag
(--tool-call-parser=llama3_json) and the model names (Mistral-Small-4-119B-2603,
openai/gpt-oss-120b) to ensure the quotes around 'mistral' and 'openai' are
proper single quotes.
---
Nitpick comments:
In `@guide-local-agent-to-vllm-on-cluster.md`:
- Around line 18-53: The fenced code block showing the ServingRuntime manifest
lacks a language tag; update the opening triple-backtick for that block to
specify yaml (i.e., change ``` to ```yaml) so editors and linters will apply
YAML highlighting and validation for the ServingRuntime manifest, model args,
ports, and supportedModelFormats sections.
- Around line 91-114: Add a language specifier to the fenced code block that
defines the InferenceService so the YAML (apiVersion: serving.kserve.io/v1beta1,
kind: InferenceService, spec.predictor.model.runtime: vllm-runtime, etc.) is
highlighted consistently; update the opening fence from ``` to ```yaml so the
entire block (including storageUri, resources, metadata.name: llama-3-3-70b) is
parsed and rendered as YAML.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ab2e3a6f-c9a3-4181-91f1-e6bca58128ef
📒 Files selected for processing (1)
guide-local-agent-to-vllm-on-cluster.md
| #### | ||
|
|
||
| Note: In this case, I also used ' \--tool-call-parser=llama3\_json' \- each model will use different parsers. For example, Mistral-Small-4-119B-2603 will expect 'mistral', 'openai/gpt-oss-120b’ will expect ‘openai\`. |
There was a problem hiding this comment.
Fix typo in parser example.
Line 57 has a typo: `'openai`` should end with a closing single quote instead of a backtick.
✍️ Proposed fix
-Note: In this case, I also used ' \--tool-call-parser=llama3\_json' \- each model will use different parsers. For example, Mistral-Small-4-119B-2603 will expect 'mistral', 'openai/gpt-oss-120b' will expect 'openai\`.
+Note: In this case, I also used ' \--tool-call-parser=llama3\_json' \- each model will use different parsers. For example, Mistral-Small-4-119B-2603 will expect 'mistral', 'openai/gpt-oss-120b' will expect 'openai'.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #### | |
| Note: In this case, I also used ' \--tool-call-parser=llama3\_json' \- each model will use different parsers. For example, Mistral-Small-4-119B-2603 will expect 'mistral', 'openai/gpt-oss-120b’ will expect ‘openai\`. | |
| #### | |
| Note: In this case, I also used ' \--tool-call-parser=llama3\_json' \- each model will use different parsers. For example, Mistral-Small-4-119B-2603 will expect 'mistral', 'openai/gpt-oss-120b' will expect 'openai'. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@guide-local-agent-to-vllm-on-cluster.md` around lines 55 - 57, Fix the typo
in the parser example: replace the stray backtick at the end of 'openai\`' with
a closing single quote so the example reads 'openai'; update the sentence that
references the parser flag (--tool-call-parser=llama3_json) and the model names
(Mistral-Small-4-119B-2603, openai/gpt-oss-120b) to ensure the quotes around
'mistral' and 'openai' are proper single quotes.
| ## 2\. Expose the Model Externally | ||
|
|
||
| When deploying vllm with KServe using RawDeployment, it creates a **headless Service** (clusterIP: None). To expose the model externally, I needed to expose an OpenShift Route. But, OpenShift Routes cannot point to headless Services, so I needed a workaround to create a ClusterIP service. Using the product dashboard will let you do this too. |
There was a problem hiding this comment.
Complete the "Expose the Model Externally" section.
This section mentions a workaround but provides no implementation details. Users cannot complete the workflow without concrete steps to:
- Create the ClusterIP service
- Create and configure the OpenShift Route
Please add the YAML examples or CLI commands needed to expose the model externally.
Would you like me to help draft the missing content based on standard KServe/OpenShift patterns?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@guide-local-agent-to-vllm-on-cluster.md` around lines 116 - 118, The "Expose
the Model Externally" section is incomplete and must include concrete steps and
examples to create a ClusterIP Service that targets the vllm RawDeployment pods
(replacing the headless Service) and to create an OpenShift Route that points to
that ClusterIP Service; update the section to (1) explain how to create a
ClusterIP Service (kubectl expose or a Service YAML referencing the vllm
RawDeployment selector and port names) and include a sample Service manifest or
kubectl command, and (2) show how to create an OpenShift Route (oc create route
or a Route YAML) that targets the new Service with correct service name and
port, TLS/hostname examples, and any necessary annotations for KServe; reference
the RawDeployment/Service selector names and the Route/service names used in the
diff so readers can plug them into their manifests.
| 3. Update app code to point to vllm \+ KServe on OAI | ||
|
|
||
| This was one of the bigger changes that I’ve captured here \- initially using Claude & Anthropic’s Agent SDK and changed it to langgraph/pure python agents for this exercise. |
There was a problem hiding this comment.
Complete the "Update app code" section with concrete examples.
This section is incomplete—it mentions "bigger changes" that were "captured here" but provides no actual content. To fulfill the guide's promise of an end-to-end workflow, please add:
- Client configuration examples showing how to point the agent to the vLLM + KServe endpoint (URL, authentication, headers)
- Code snippets demonstrating the transition from local to cluster-served models
- Explanation of why you switched from Claude/Anthropic SDK to langgraph, and how it relates to this deployment pattern
Without this section, users cannot complete the workflow described in the guide's title.
Would you like me to help draft example code showing how to configure an OpenAI-compatible client to point to the vLLM endpoint on OpenShift AI?
🧰 Tools
🪛 LanguageTool
[style] ~122-~122: The word ‘bigger’ tends to be overused. Consider an alternative.
Context: ...m + KServe on OAI This was one of the bigger changes that I’ve captured here - initially us...
(BIG_DIFFERENCE)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@guide-local-agent-to-vllm-on-cluster.md` around lines 120 - 122, The "3.
Update app code to point to vllm + KServe on OAI" section is incomplete—add
three concrete subsections: (1) "Client configuration" showing exact example
values for endpoint URL, authentication (bearer/API key), and required headers
for an OpenAI-compatible vLLM+KServe endpoint; (2) "Code examples" with short
before/after snippets demonstrating how to switch an OpenAI-compatible client
(e.g., code that constructs a client, sets base_url, headers, and sends a
completion/request) from a local dev URL to the cluster-served vLLM URL and how
to enable TLS/auth; and (3) "Why langgraph" explaining in 2–3 sentences why you
migrated from Claude/Anthropic SDK to langgraph/pure Python agents
(compatibility with OpenAI-compatible endpoints, lighter weight for custom
deployment workflows, and easier integration with KServe). Reference the section
title "Update app code to point to vllm + KServe on OAI" and include placeholder
examples for URL/auth so readers can copy-and-paste and adapt to their cluster.
Removed unnecessary details
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
guide-local-agent-to-vllm-on-cluster.md (2)
57-57:⚠️ Potential issue | 🟡 MinorFix malformed quoting in parser example.
Line 57 still has mismatched smart quotes and a stray backtick in `'openai``; this can be copy-pasted incorrectly by readers.
Proposed fix
-Note: In this case, I also used ' --tool-call-parser=llama3_json' - each model will use different parsers. For example, Mistral-Small-4-119B-2603 will expect 'mistral', 'openai/gpt-oss-120b’ will expect ‘openai`. +Note: In this case, I also used '--tool-call-parser=llama3_json' — each model uses a different parser. For example, Mistral-Small-4-119B-2603 expects 'mistral', and 'openai/gpt-oss-120b' expects 'openai'.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@guide-local-agent-to-vllm-on-cluster.md` at line 57, Fix the malformed quoting in the parser example: replace the mismatched smart quotes and the stray backtick so the example consistently uses plain backticks and correct parser names — e.g., show `--tool-call-parser=llama3_json`, then list parser names as `mistral`, `openai/gpt-oss-120b`, and `openai` (remove the stray backtick after openai and any smart quotes).
116-118:⚠️ Potential issue | 🟠 MajorAdd concrete Route workaround steps (Service + Route).
The section explains the problem but still lacks executable steps/manifests, so users cannot complete external exposure from this guide.
Proposed content to add
## 2. Expose the Model Externally When deploying vllm with KServe using RawDeployment, it creates a **headless Service** (clusterIP: None). To expose the model externally, I needed to expose an OpenShift Route. But, OpenShift Routes cannot point to headless Services, so I needed a workaround to create a ClusterIP service. Using the product dashboard will let you do this too. + +Create a ClusterIP Service targeting the same pods: + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: llama-3-3-70b-clusterip +spec: + type: ClusterIP + selector: + serving.kserve.io/inferenceservice: llama-3-3-70b + ports: + - name: http1 + port: 80 + targetPort: 8080 +``` + +Then create a Route to that ClusterIP Service: + +```bash +oc create route edge llama-3-3-70b-route \ + --service=llama-3-3-70b-clusterip \ + --port=http1 +``` + +Get the external host: + +```bash +oc get route llama-3-3-70b-route -o jsonpath='{.spec.host}{"\n"}' +```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@guide-local-agent-to-vllm-on-cluster.md` around lines 116 - 118, Add concrete, executable steps to the "2. Expose the Model Externally" section: include a ClusterIP Service manifest (name: llama-3-3-70b-clusterip) with selector serving.kserve.io/inferenceservice: llama-3-3-70b and a port mapping (name http1, port 80 -> targetPort 8080), then show the oc create route edge command to create an OpenShift Route (name: llama-3-3-70b-route) pointing to that service with --port=http1, and finally include the oc get route ... jsonpath command to print the external host; place these concrete manifest and commands right after the explanation about headless Services so users can apply them directly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@guide-local-agent-to-vllm-on-cluster.md`:
- Around line 18-53: The markdown code fences for both YAML examples (the
ServingRuntime block containing "kind: ServingRuntime" and the InferenceService
block containing "kind: InferenceService") are missing language identifiers;
update each opening fence from ``` to ```yaml so the blocks are recognized as
YAML (apply the same change for the additional YAML block referenced later
around the InferenceService example).
---
Duplicate comments:
In `@guide-local-agent-to-vllm-on-cluster.md`:
- Line 57: Fix the malformed quoting in the parser example: replace the
mismatched smart quotes and the stray backtick so the example consistently uses
plain backticks and correct parser names — e.g., show
`--tool-call-parser=llama3_json`, then list parser names as `mistral`,
`openai/gpt-oss-120b`, and `openai` (remove the stray backtick after openai and
any smart quotes).
- Around line 116-118: Add concrete, executable steps to the "2. Expose the
Model Externally" section: include a ClusterIP Service manifest (name:
llama-3-3-70b-clusterip) with selector serving.kserve.io/inferenceservice:
llama-3-3-70b and a port mapping (name http1, port 80 -> targetPort 8080), then
show the oc create route edge command to create an OpenShift Route (name:
llama-3-3-70b-route) pointing to that service with --port=http1, and finally
include the oc get route ... jsonpath command to print the external host; place
these concrete manifest and commands right after the explanation about headless
Services so users can apply them directly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 12316f44-2b0f-4ed1-be83-1d116686b39d
📒 Files selected for processing (1)
guide-local-agent-to-vllm-on-cluster.md
| ``` | ||
| apiVersion: serving.kserve.io/v1alpha1 | ||
| kind: ServingRuntime | ||
| metadata: | ||
| name: vllm-runtime | ||
| spec: | ||
| containers: | ||
| - name: kserve-container | ||
| image: quay.io/modh/vllm #pin to version you need | ||
| args: | ||
| # --- Core (required) --- | ||
| - --port=8080 # KServe expects this port | ||
| - --model=/mnt/models # KServe mounts weights here | ||
| - --served-model-name={{.Name}} # matches InferenceService name | ||
|
|
||
| # --- Tool calling (required for agentic use cases) --- | ||
| - --enable-auto-tool-choice # enables tool call detection | ||
| - --tool-call-parser=llama3_json # model-specific | ||
|
|
||
| # --- Memory management (adjust per GPU) --- | ||
| - --max-model-len=16384 # caps context window to reduce KV cache VRAM | ||
| - --gpu-memory-utilization=0.9 # fraction of VRAM vLLM will use (default 0.9) | ||
|
|
||
| # --- Multi-GPU (if needed) --- | ||
| # - --tensor-parallel-size=4 # split model across N GPUs | ||
|
|
||
| # --- Optional --- | ||
| # - --chat-template=/path/to/template.jinja # only if model lacks built-in chat templates (see below) | ||
| # - --tool-parser-plugin=/path/to/plugin.py # for custom parsers (e.g., Nemotron) | ||
| ports: | ||
| - containerPort: 8080 | ||
| protocol: TCP | ||
| supportedModelFormats: | ||
| - name: vLLM | ||
| autoSelect: true | ||
| ``` |
There was a problem hiding this comment.
Specify fenced code block languages for lint compliance.
Both YAML blocks are missing fence languages (MD040), which will keep markdownlint warninging in CI.
Proposed fix
-```
+```yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
...
-```
+```
-```
+```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
...
-```
+```Also applies to: 91-114
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 18-18: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@guide-local-agent-to-vllm-on-cluster.md` around lines 18 - 53, The markdown
code fences for both YAML examples (the ServingRuntime block containing "kind:
ServingRuntime" and the InferenceService block containing "kind:
InferenceService") are missing language identifiers; update each opening fence
from ``` to ```yaml so the blocks are recognized as YAML (apply the same change
for the additional YAML block referenced later around the InferenceService
example).
mpk-droid
left a comment
There was a problem hiding this comment.
added a comment. lmk what you think.
| @@ -0,0 +1,118 @@ | |||
| # Running an Agent Locally with a Model Served on vLLM on OpenShift AI | |||
There was a problem hiding this comment.
Thanks for documenting this — the content itself is useful. However, I think this doc targets the platform engineer persona (creating ServingRuntime/InferenceService CRs, tuning vLLM memory, exposing Routes), whereas this repo has so far focused on the AI engineer persona.
From the AI engineer's perspective, they just need to point their agent at a LlamaStack URL to access the model — the infrastructure behind it is abstracted away.
Before adding platform-focused content to this repo, I think we'd need to establish a clear pattern for how we organize and scope docs across personas. Otherwise we risk mixing concerns and making the repo harder to navigate for our primary audience.
There was a problem hiding this comment.
Good point, I largely agree. I would say though that the separation isn't as strict.
The platform engineer would deploy the operator (Kserve, llama-stack), then an end-user (an AIE etc) would need to still create instances against that operator (i.e. the CRs - Serving Runtime/inference serving, lls etc).
Wdyt?
There was a problem hiding this comment.
Ah, I see. Thanks for the clarification. In my mind, the Platform engineer would also create the CRs for the operator.
Drawing on my personal experience, i feel like a clear line between AI eng and Platform Eng would be that Plat eng handles all things cluster and exposes URIs for various resources and the AI Eng uses those resources to carry out some actions. I feel like this creates a cleaner boundaries in their responsibilities. Lets imagine that one of the resources is crash looping, with the line drawn as above, its clear that platform engineer would resolve it. wdyt?
…ed-hat-data-services#31, red-hat-data-services#51) - Fail fast with a clear error when image.repository is empty instead of rendering an invalid ":latest" image reference - Add checksum/secret annotation to pod template so pods auto-restart when secret values change (e.g. API_KEY rotation via make deploy) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No description provided.