Skip to content
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions guide-local-agent-to-vllm-on-cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Running an Agent Locally with a Model Served on vLLM on OpenShift AI
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for documenting this — the content itself is useful. However, I think this doc targets the platform engineer persona (creating ServingRuntime/InferenceService CRs, tuning vLLM memory, exposing Routes), whereas this repo has so far focused on the AI engineer persona.

From the AI engineer's perspective, they just need to point their agent at a LlamaStack URL to access the model — the infrastructure behind it is abstracted away.

Before adding platform-focused content to this repo, I think we'd need to establish a clear pattern for how we organize and scope docs across personas. Otherwise we risk mixing concerns and making the repo harder to navigate for our primary audience.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I largely agree. I would say though that the separation isn't as strict.
The platform engineer would deploy the operator (Kserve, llama-stack), then an end-user (an AIE etc) would need to still create instances against that operator (i.e. the CRs - Serving Runtime/inference serving, lls etc).
Wdyt?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Thanks for the clarification. In my mind, the Platform engineer would also create the CRs for the operator.

Drawing on my personal experience, i feel like a clear line between AI eng and Platform Eng would be that Plat eng handles all things cluster and exposes URIs for various resources and the AI Eng uses those resources to carry out some actions. I feel like this creates a cleaner boundaries in their responsibilities. Lets imagine that one of the resources is crash looping, with the line drawn as above, its clear that platform engineer would resolve it. wdyt?


This guide covers the changes required to go from everything running locally (agent \+ model on your laptop) to agent running locally, model served on a cluster via vLLM \+ KServe on OpenShift AI.

## 1\. Deploy the Model on OpenShift AI

Two Custom Resources from **two different CRDs** work together:

- **ServingRuntime** (serving.kserve.io/v1alpha1) — defines how to run vLLM (container image, args, ports). Created once per inference engine. OpenShift AI ships several pre-installed.
- **InferenceService** (serving.kserve.io/v1beta1) — defines which model to deploy (storage, GPU count, memory). Created once per model, references a ServingRuntime.

### ServingRuntime CR

The ServingRuntime carries **all server-level vLLM args**, including tool calling, chat template, and memory management:

Specifically, \--enable-auto-tool-choice and \--tool-call-parser need to be enabled in the ServingRuntime Custom Resource because they control how the vLLM HTTP server parses the model's raw text output into structured tool\_calls in the OpenAI-compatible API response.

```
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: vllm-runtime
spec:
containers:
- name: kserve-container
image: quay.io/modh/vllm #pin to version you need
args:
# --- Core (required) ---
- --port=8080 # KServe expects this port
- --model=/mnt/models # KServe mounts weights here
- --served-model-name={{.Name}} # matches InferenceService name

# --- Tool calling (required for agentic use cases) ---
- --enable-auto-tool-choice # enables tool call detection
- --tool-call-parser=llama3_json # model-specific

# --- Memory management (adjust per GPU) ---
- --max-model-len=16384 # caps context window to reduce KV cache VRAM
- --gpu-memory-utilization=0.9 # fraction of VRAM vLLM will use (default 0.9)

# --- Multi-GPU (if needed) ---
# - --tensor-parallel-size=4 # split model across N GPUs

# --- Optional ---
# - --chat-template=/path/to/template.jinja # only if model lacks built-in chat templates (see below)
# - --tool-parser-plugin=/path/to/plugin.py # for custom parsers (e.g., Nemotron)
ports:
- containerPort: 8080
protocol: TCP
supportedModelFormats:
- name: vLLM
autoSelect: true
```
Comment on lines +18 to +53
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Specify fenced code block languages for lint compliance.

Both YAML blocks are missing fence languages (MD040), which will keep markdownlint warninging in CI.

Proposed fix
-```
+```yaml
 apiVersion: serving.kserve.io/v1alpha1
 kind: ServingRuntime
 ...
-```
+```

-```
+```yaml
 apiVersion: serving.kserve.io/v1beta1
 kind: InferenceService
 ...
-```
+```

Also applies to: 91-114

🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 18-18: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@guide-local-agent-to-vllm-on-cluster.md` around lines 18 - 53, The markdown
code fences for both YAML examples (the ServingRuntime block containing "kind:
ServingRuntime" and the InferenceService block containing "kind:
InferenceService") are missing language identifiers; update each opening fence
from ``` to ```yaml so the blocks are recognized as YAML (apply the same change
for the additional YAML block referenced later around the InferenceService
example).


####

Note: In this case, I also used ' \--tool-call-parser=llama3\_json' \- each model will use different parsers. For example, Mistral-Small-4-119B-2603 will expect 'mistral', 'openai/gpt-oss-120b’ will expect ‘openai\`.
Comment on lines +55 to +57
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo in parser example.

Line 57 has a typo: `'openai`` should end with a closing single quote instead of a backtick.

✍️ Proposed fix
-Note: In this case, I also used ' \--tool-call-parser=llama3\_json' \- each model will use different parsers. For example, Mistral-Small-4-119B-2603 will expect 'mistral', 'openai/gpt-oss-120b' will expect 'openai\`.
+Note: In this case, I also used ' \--tool-call-parser=llama3\_json' \- each model will use different parsers. For example, Mistral-Small-4-119B-2603 will expect 'mistral', 'openai/gpt-oss-120b' will expect 'openai'.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
####
Note: In this case, I also used ' \--tool-call-parser=llama3\_json' \- each model will use different parsers. For example, Mistral-Small-4-119B-2603 will expect 'mistral', 'openai/gpt-oss-120b will expect openai\`.
####
Note: In this case, I also used ' \--tool-call-parser=llama3\_json' \- each model will use different parsers. For example, Mistral-Small-4-119B-2603 will expect 'mistral', 'openai/gpt-oss-120b' will expect 'openai'.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@guide-local-agent-to-vllm-on-cluster.md` around lines 55 - 57, Fix the typo
in the parser example: replace the stray backtick at the end of 'openai\`' with
a closing single quote so the example reads 'openai'; update the sentence that
references the parser flag (--tool-call-parser=llama3_json) and the model names
(Mistral-Small-4-119B-2603, openai/gpt-oss-120b) to ensure the quotes around
'mistral' and 'openai' are proper single quotes.


#### Memory Management Args

\--max-model-len and \--gpu-memory-utilization are also **server-level flags** because they control how much VRAM the vLLM process allocates, not what the model weights contain.

Why \--max-model-len matters: The model may support 131K tokens (baked into weights), but the KV cache for the full context window may exceed GPU memory. This flag caps the context window, reducing KV cache allocation. Typically,
KV cache VRAM \= tokens × 2 × layers × kv\_heads × head\_dim × bytes\_per\_param

When to adjust:
\- If vLLM crashes with \`Free memory on device ... less than desired GPU memory utilization\` → lower \`--max-model-len\`
\- If still OOM → lower \`--gpu-memory-utilization\`
\- The error message includes exact VRAM numbers — use them to calculate the right \`--max-model-len\`

Chat Template:

The chat template (Jinja2) formats conversations into the token format the model expects. Most instruct models embed it in \`tokenizer\_config.json\` and vLLM auto-loads it — no \`--chat-template\` flag needed.

How to verify vLLM auto-loaded it (after deploying):

oc logs $(oc get pods \-l serving.kserve.io/inferenceservice=\<isvc-name\> \-o name) \\
| grep \-i "chat.template\\|jinja\\|tokenizer\_config"

vLLM prints one of:

- "Using default chat template from tokenizer\_config.json" — auto-loaded, no action needed
- "Using supplied chat template" — loaded from \--chat-template flag
- A warning if no template was found — tool calls will likely fail


### Apply the InferenceService CR

The InferenceService carries **per-model concerns** — which runtime, where the weights are, how much GPU:

```
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-3-3-70b
annotations:
serving.kserve.io/deploymentMode: RawDeployment
spec:
predictor:
model:
modelFormat:
name: vLLM
runtime: vllm-runtime # ← references the ServingRuntime
storageUri: pvc://llama-70b-pvc # or s3://bucket/path
resources:
requests:
cpu: "2"
memory: "48Gi"
nvidia.com/gpu: "4"
limits:
cpu: "4"
memory: "96Gi"
nvidia.com/gpu: "4"
```

## 2\. Expose the Model Externally

When deploying vllm with KServe using RawDeployment, it creates a **headless Service** (clusterIP: None). To expose the model externally, I needed to expose an OpenShift Route. But, OpenShift Routes cannot point to headless Services, so I needed a workaround to create a ClusterIP service. Using the product dashboard will let you do this too.
Comment on lines +116 to +118
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Complete the "Expose the Model Externally" section.

This section mentions a workaround but provides no implementation details. Users cannot complete the workflow without concrete steps to:

  1. Create the ClusterIP service
  2. Create and configure the OpenShift Route

Please add the YAML examples or CLI commands needed to expose the model externally.

Would you like me to help draft the missing content based on standard KServe/OpenShift patterns?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@guide-local-agent-to-vllm-on-cluster.md` around lines 116 - 118, The "Expose
the Model Externally" section is incomplete and must include concrete steps and
examples to create a ClusterIP Service that targets the vllm RawDeployment pods
(replacing the headless Service) and to create an OpenShift Route that points to
that ClusterIP Service; update the section to (1) explain how to create a
ClusterIP Service (kubectl expose or a Service YAML referencing the vllm
RawDeployment selector and port names) and include a sample Service manifest or
kubectl command, and (2) show how to create an OpenShift Route (oc create route
or a Route YAML) that targets the new Service with correct service name and
port, TLS/hostname examples, and any necessary annotations for KServe; reference
the RawDeployment/Service selector names and the Route/service names used in the
diff so readers can plug them into their manifests.


3. Update app code to point to vllm \+ KServe on OAI

This was one of the bigger changes that I’ve captured here \- initially using Claude & Anthropic’s Agent SDK and changed it to langgraph/pure python agents for this exercise.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Complete the "Update app code" section with concrete examples.

This section is incomplete—it mentions "bigger changes" that were "captured here" but provides no actual content. To fulfill the guide's promise of an end-to-end workflow, please add:

  1. Client configuration examples showing how to point the agent to the vLLM + KServe endpoint (URL, authentication, headers)
  2. Code snippets demonstrating the transition from local to cluster-served models
  3. Explanation of why you switched from Claude/Anthropic SDK to langgraph, and how it relates to this deployment pattern

Without this section, users cannot complete the workflow described in the guide's title.

Would you like me to help draft example code showing how to configure an OpenAI-compatible client to point to the vLLM endpoint on OpenShift AI?

🧰 Tools
🪛 LanguageTool

[style] ~122-~122: The word ‘bigger’ tends to be overused. Consider an alternative.
Context: ...m + KServe on OAI This was one of the bigger changes that I’ve captured here - initially us...

(BIG_DIFFERENCE)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@guide-local-agent-to-vllm-on-cluster.md` around lines 120 - 122, The "3.
Update app code to point to vllm + KServe on OAI" section is incomplete—add
three concrete subsections: (1) "Client configuration" showing exact example
values for endpoint URL, authentication (bearer/API key), and required headers
for an OpenAI-compatible vLLM+KServe endpoint; (2) "Code examples" with short
before/after snippets demonstrating how to switch an OpenAI-compatible client
(e.g., code that constructs a client, sets base_url, headers, and sends a
completion/request) from a local dev URL to the cluster-served vLLM URL and how
to enable TLS/auth; and (3) "Why langgraph" explaining in 2–3 sentences why you
migrated from Claude/Anthropic SDK to langgraph/pure Python agents
(compatibility with OpenAI-compatible endpoints, lighter weight for custom
deployment workflows, and easier integration with KServe). Reference the section
title "Update app code to point to vllm + KServe on OAI" and include placeholder
examples for URL/auth so readers can copy-and-paste and adapt to their cluster.