Skip to content

Commit e639f70

Browse files
committed
Merge remote-tracking branch 'upstream/rhoai'
2 parents b975235 + 8a13f1b commit e639f70

29 files changed

Lines changed: 1939 additions & 223 deletions

cmd/mcp-server/INTEGRATION_TEST.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# MCP Server Integration Tests
2+
3+
Manual test guide for the 7 MCP server tools against a live cluster.
4+
5+
## Prerequisites
6+
7+
| Requirement | Details |
8+
|---|---|
9+
| Cluster | OpenShift/Kubernetes with `KUBECONFIG` set |
10+
| ODH operator | Installed in `opendatahub-operator-system` |
11+
| CRs | DSCI + DSC created, at least one component `Managed` |
12+
| Tools | `jq` installed |
13+
| Binary | Built via `make mcp-server` |
14+
15+
## Setup
16+
17+
Run all commands from the repository root. Add these helpers to your shell:
18+
19+
```bash
20+
# For tools returning JSON
21+
call_tool() {
22+
echo "{\"jsonrpc\":\"2.0\",\"id\":1,\"method\":\"tools/call\",\"params\":{\"name\":\"$1\",\"arguments\":${2:-\{\}}}}" \
23+
| ./bin/mcp-server | jq -r '.result.content[0].text' | jq .
24+
}
25+
26+
# For tools returning plaintext (pod_logs)
27+
call_tool_text() {
28+
echo "{\"jsonrpc\":\"2.0\",\"id\":1,\"method\":\"tools/call\",\"params\":{\"name\":\"$1\",\"arguments\":${2:-\{\}}}}" \
29+
| ./bin/mcp-server | jq -r '.result.content[0].text'
30+
}
31+
32+
# Get operator pod name for later tests
33+
POD=$(kubectl get pods -n opendatahub-operator-system -o jsonpath='{.items[0].metadata.name}')
34+
```
35+
36+
## Test Cases
37+
38+
| # | Tool | Command | Expected |
39+
|---|------|---------|----------|
40+
| 1a | `platform_health` | `call_tool platform_health '{}'` | JSON with all sections (nodes, deployments, pods, etc.) |
41+
| 1b | `platform_health` | `call_tool platform_health '{"sections":"nodes,operator"}'` | JSON with only `nodes` and `operator` |
42+
| 1c | `platform_health` | `call_tool platform_health '{"layer":"infrastructure"}'` | JSON with infrastructure-layer sections only |
43+
| 2a | `operator_dependencies` | `call_tool operator_dependencies '{}'` | JSON array with status per dependency |
44+
| 2b | `operator_dependencies` | `call_tool operator_dependencies '{"name":"cert-manager"}'` | Single-entry JSON array for cert-manager |
45+
| 3a | `describe_resource` | `call_tool describe_resource '{"apiVersion":"dscinitialization.opendatahub.io/v2","kind":"DSCInitialization","name":"default-dsci"}'` | Full DSCI resource JSON (sensitive data redacted) |
46+
| 3b | `describe_resource` | `call_tool describe_resource "{\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"name\":\"${POD}\",\"namespace\":\"opendatahub-operator-system\"}"` | Full pod resource JSON |
47+
| 4a | `recent_events` | `call_tool recent_events '{}'` | JSON array of events (may be empty if healthy) |
48+
| 4b | `recent_events` | `call_tool recent_events '{"namespace":"opendatahub","since":"1h"}'` | Events from `opendatahub` namespace, last hour |
49+
| 5 | `classify_failure` | `call_tool classify_failure '{}'` | JSON with category, subcategory, error_code, evidence, confidence |
50+
| 6 | `component_status` | `call_tool component_status '{"component":"dashboard"}'` | JSON with CR conditions, pod statuses, deployment readiness |
51+
| 7 | `pod_logs` | `call_tool_text pod_logs "{\"pod_name\":\"${POD}\",\"namespace\":\"opendatahub-operator-system\",\"tail_lines\":10}"` | 10 lines of plaintext log output |
52+
53+
> For test 6, replace `dashboard` with whichever component is `Managed` in your DSC.
54+
55+
## Error Scenarios
56+
57+
| Tool | Command | Expected |
58+
|------|---------|----------|
59+
| `describe_resource` | `call_tool describe_resource '{"apiVersion":"v1","kind":"Pod","name":"does-not-exist","namespace":"opendatahub-operator-system"}'` | Not-found error message |
60+
| `component_status` | `call_tool component_status '{"component":"nonexistent"}'` | Component-not-found error message |
61+
| `pod_logs` | `call_tool_text pod_logs '{"pod_name":"does-not-exist","namespace":"opendatahub-operator-system"}'` | Pod-not-found error message |
62+
63+
## Pass Criteria
64+
65+
1. Valid JSON (or plaintext for `pod_logs`) returned without server crash
66+
2. Expected fields present in responses
67+
3. Error cases return descriptive messages, not stack traces

cmd/mcp-server/Makefile

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,12 @@ mcp-server:
66

77
.PHONY: mcp-server-test
88
mcp-server-test:
9-
cd cmd/mcp-server && go test ./...
9+
cd cmd/mcp-server && go test -count=1 ./...
1010

11+
.PHONY: mcp-server-integration
12+
mcp-server-integration: mcp-server
13+
@echo "See cmd/mcp-server/INTEGRATION_TEST.md for manual integration test steps"
14+
@echo "Requires: KUBECONFIG set, ODH operator deployed, DSCI/DSC created"
1115

1216
.PHONY: diagnose
1317
diagnose:
@@ -23,3 +27,4 @@ ifneq ($(shell echo '$(TEST_NAME)' | grep -qE '[;|&$$`\\!><]' && echo bad),)
2327
$(error TEST_NAME contains shell metacharacters. Only alphanumeric, slash, underscore, hyphen, dot, and space are allowed.)
2428
endif
2529
cd cmd/mcp-server && go run . --one-shot --test-name="$(TEST_NAME)"
30+

cmd/mcp-server/README.md

Lines changed: 206 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -1,98 +1,238 @@
1-
# OpenDataHub MCP Health Server
1+
# ODH MCP Server
22

3-
MCP (Model Context Protocol) server that exposes cluster health diagnostic tools for OpenDataHub.
3+
A [Model Context Protocol](https://modelcontextprotocol.io/) (MCP) server that exposes diagnostic tools for OpenDataHub clusters. It communicates over stdio using JSON-RPC, designed to be called by AI assistants (e.g. Claude Code, VS Code Copilot) or any MCP-compatible client.
44

5-
## Tools
5+
## Build & Run
66

7-
### pod_logs
7+
```bash
8+
# Build the binary
9+
make mcp-server
810

9-
Retrieve recent logs for a specific pod/container.
11+
# Run tests
12+
make mcp-server-test
13+
```
1014

11-
| Parameter | Type | Required | Description |
12-
|-----------|------|----------|-------------|
13-
| pod_name | string | yes | Name of the pod |
14-
| namespace | string | yes | Namespace of the pod |
15-
| container | string | no | Container name. Omit for the default container |
16-
| previous | boolean | no | Return logs from previous container instance. Default: false |
17-
| tail_lines | number | no | Lines from end of log to return. Default: 100 |
18-
| list_containers | boolean | no | Return list of all containers (init, regular, ephemeral) instead of logs. Default: false |
15+
The server requires a valid `KUBECONFIG` (or in-cluster config). Namespace defaults can be overridden via environment variables:
1916

20-
When a container name is invalid, the error response automatically includes the list of available containers.
17+
| Variable | Default | Description |
18+
|----------|---------|-------------|
19+
| `E2E_TEST_OPERATOR_NAMESPACE` | `opendatahub-operator-system` | Namespace where the ODH operator is deployed |
20+
| `E2E_TEST_APPLICATIONS_NAMESPACE` | `opendatahub` | Namespace where ODH components are deployed |
2121

22-
### platform_health
22+
## Tool Reference
2323

24-
Run cluster health checks and return report as JSON.
24+
### platform_health
2525

26-
| Parameter | Type | Required | Description |
27-
|-----------|------|----------|-------------|
28-
| sections | string | no | Comma-separated sections: nodes,deployments,pods,events,quotas,operator,dsci,dsc |
29-
| layer | string | no | Comma-separated layers: infrastructure,workload,operator. Ignored if sections is set |
30-
| operator_namespace | string | no | Operator namespace. Default: opendatahub-operator-system |
31-
| applications_namespace | string | no | Apps namespace. Default: opendatahub |
32-
| summary | boolean | no | Return compact summary instead of full report. Default: true |
26+
Run cluster health checks and return the full report as JSON. Checks nodes, deployments, pods, events, quotas, operator status, DSCI, and DSC.
27+
28+
| Parameter | Type | Required | Default | Description |
29+
|-----------|------|----------|---------|-------------|
30+
| `sections` | string | no | all | Comma-separated sections: `nodes`, `deployments`, `pods`, `events`, `quotas`, `operator`, `dsci`, `dsc` |
31+
| `layer` | string | no | all | Comma-separated layers: `infrastructure`, `workload`, `operator`. Ignored if `sections` is set |
32+
| `operator_namespace` | string | no | auto-discover (env → `opendatahub-operator-system`) | Operator namespace |
33+
| `applications_namespace` | string | no | auto-discover (DSCI → env → `opendatahub`) | Applications namespace |
34+
35+
```jsonc
36+
// Example call
37+
{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"platform_health","arguments":{"sections":"nodes,operator"}}}
38+
39+
// Example output (truncated)
40+
{
41+
"nodes": {
42+
"total": 3,
43+
"ready": 3,
44+
"items": [{"name": "node-1", "ready": true, "roles": "control-plane,worker", ...}]
45+
},
46+
"operator": {
47+
"deployment": "opendatahub-operator-controller-manager",
48+
"ready": true,
49+
"replicas": 1,
50+
"readyReplicas": 1
51+
}
52+
}
53+
```
3354

34-
### component_status
55+
### operator_dependencies
3556

36-
Get detailed status of a specific ODH component including managed resources.
57+
Check status of dependent operators (cert-manager, Tempo, OpenTelemetry, Kueue, LWS, etc.).
3758

38-
| Parameter | Type | Required | Description |
39-
|-----------|------|----------|-------------|
40-
| component | string | yes | Component name (e.g. kserve, dashboard, workbenches) |
41-
| applications_namespace | string | no | Apps namespace. Default: opendatahub |
59+
| Parameter | Type | Required | Default | Description |
60+
|-----------|------|----------|---------|-------------|
61+
| `operator_namespace` | string | no | auto-discover (env → `opendatahub-operator-system`) | Operator namespace |
62+
| `name` | string | no | all | Filter to a single dependency by name (e.g. `cert-manager`) |
4263

43-
Response includes `managedResources` listing Services, ConfigMaps, ServiceAccounts, and Secrets owned by the component.
64+
```jsonc
65+
// Example call
66+
{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"operator_dependencies","arguments":{}}}
4467

45-
### operator_dependencies
68+
// Example output
69+
[
70+
{"name": "cert-manager", "installed": true, "healthy": true, "version": "v1.14.0"},
71+
{"name": "tempo-operator", "installed": false, "healthy": false}
72+
]
73+
```
4674

47-
Check status of dependent operators (cert-manager, tempo, OTel, etc.).
75+
### describe_resource
4876

49-
| Parameter | Type | Required | Description |
50-
|-----------|------|----------|-------------|
51-
| operator_namespace | string | no | Operator namespace. Default: opendatahub-operator-system |
52-
| name | string | no | Filter to specific dependent by name |
77+
Get any Kubernetes resource by apiVersion/kind/name. Returns the full resource as JSON with sensitive data redacted (Secret `.data`, token fields).
78+
79+
| Parameter | Type | Required | Default | Description |
80+
|-----------|------|----------|---------|-------------|
81+
| `apiVersion` | string | yes | | API version, e.g. `v1`, `apps/v1`, `datasciencecluster.opendatahub.io/v2` |
82+
| `kind` | string | yes | | Resource kind, e.g. `Pod`, `Deployment`, `DSCInitialization` |
83+
| `name` | string | yes | | Resource name |
84+
| `namespace` | string | no | | Namespace. Omit for cluster-scoped resources |
85+
86+
```jsonc
87+
// Example call
88+
{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"describe_resource","arguments":{
89+
"apiVersion":"dscinitialization.opendatahub.io/v2","kind":"DSCInitialization","name":"default-dsci"
90+
}}}
91+
92+
// Example output (truncated)
93+
{
94+
"apiVersion": "dscinitialization.opendatahub.io/v2",
95+
"kind": "DSCInitialization",
96+
"metadata": {"name": "default-dsci", "creationTimestamp": "2025-01-15T10:00:00Z", ...},
97+
"spec": {"applicationsNamespace": "opendatahub", ...},
98+
"status": {"phase": "Ready", "conditions": [...]}
99+
}
100+
```
53101

54102
### recent_events
55103

56-
Warning/error events in ODH namespaces sorted by last timestamp.
104+
Warning/error events in ODH namespaces, sorted by last timestamp (most recent first). Auto-discovers ODH namespaces from DSCI if not specified.
105+
106+
| Parameter | Type | Required | Default | Description |
107+
|-----------|------|----------|---------|-------------|
108+
| `namespace` | string | no | auto-discover | Comma-separated namespaces to query |
109+
| `since` | string | no | `5m` | Go duration for look-back window (e.g. `5m`, `1h`) |
110+
| `event_type` | string | no | all | Filter by type: `Warning`, `Normal` |
111+
112+
```jsonc
113+
// Example call
114+
{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"recent_events","arguments":{"since":"1h"}}}
115+
116+
// Example output
117+
[
118+
{
119+
"namespace": "opendatahub",
120+
"name": "dashboard-pod.abc123",
121+
"kind": "Pod",
122+
"type": "Warning",
123+
"reason": "BackOff",
124+
"message": "Back-off restarting failed container",
125+
"count": 5,
126+
"lastTimestamp": "2025-01-15T12:30:00Z"
127+
}
128+
]
129+
```
130+
131+
### classify_failure
57132

58-
| Parameter | Type | Required | Description |
59-
|-----------|------|----------|-------------|
60-
| namespace | string | no | Comma-separated namespaces. Omit to auto-discover from DSCI |
61-
| since | string | no | Go duration for look-back window (e.g. 5m, 1h). Default: 5m |
62-
| event_type | string | no | Filter by type: Warning, Normal. Omit for all |
133+
Run cluster health checks and classify the failure deterministically. Returns a structured classification with category, subcategory, error code, evidence, and confidence.
134+
135+
| Parameter | Type | Required | Default | Description |
136+
|-----------|------|----------|---------|-------------|
137+
| `sections` | string | no | all | Same as `platform_health` |
138+
| `layer` | string | no | all | Same as `platform_health` |
139+
| `operator_namespace` | string | no | auto-discover (env → `opendatahub-operator-system`) | Operator namespace |
140+
| `applications_namespace` | string | no | auto-discover (DSCI → env → `opendatahub`) | Applications namespace |
141+
142+
```jsonc
143+
// Example call
144+
{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"classify_failure","arguments":{}}}
145+
146+
// Example output
147+
{
148+
"category": "component",
149+
"subcategory": "degraded",
150+
"error_code": "COMP_DEGRADED",
151+
"evidence": "Dashboard deployment has 0/1 ready replicas",
152+
"confidence": 0.9
153+
}
154+
```
63155

64-
Event output includes a `count` field showing how many times the event occurred.
156+
### component_status
65157

66-
### describe_resource
158+
Get detailed status of a specific ODH component: CR conditions, pod statuses, and deployment readiness.
159+
160+
| Parameter | Type | Required | Default | Description |
161+
|-----------|------|----------|---------|-------------|
162+
| `component` | string | yes | | Component name: `kserve`, `dashboard`, `workbenches`, `ray`, `trustyai`, `modelregistry`, `datasciencepipelines`, `trainingoperator`, `feastoperator`, `trainer`, `kueue`, `mlflowoperator`, `sparkoperator`, etc. |
163+
| `applications_namespace` | string | no | auto-discover (DSCI → env → `opendatahub`) | Applications namespace |
164+
165+
```jsonc
166+
// Example call
167+
{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"component_status","arguments":{"component":"dashboard"}}}
168+
169+
// Example output
170+
{
171+
"component": "dashboard",
172+
"crFound": true,
173+
"conditions": [
174+
{"type": "Ready", "status": "True", "reason": "Ready", "message": ""}
175+
],
176+
"deployments": [
177+
{"name": "odh-dashboard", "replicas": 2, "ready": 2}
178+
],
179+
"pods": [
180+
{"name": "odh-dashboard-abc12", "phase": "Running"},
181+
{"name": "odh-dashboard-def34", "phase": "Running"}
182+
]
183+
}
184+
```
185+
186+
### pod_logs
67187

68-
Get any Kubernetes resource by apiVersion/kind/name. Returns full resource as JSON with sensitive data redacted.
188+
Retrieve recent logs for a specific pod/container. Returns plaintext (not JSON).
189+
190+
| Parameter | Type | Required | Default | Description |
191+
|-----------|------|----------|---------|-------------|
192+
| `pod_name` | string | yes | | Name of the pod |
193+
| `namespace` | string | yes | | Namespace of the pod |
194+
| `container` | string | no | default container | Container name |
195+
| `previous` | boolean | no | `false` | Return logs from the previous container instance |
196+
| `tail_lines` | number | no | `100` | Number of lines from the end of the log |
197+
198+
```jsonc
199+
// Example call
200+
{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"pod_logs","arguments":{
201+
"pod_name":"odh-dashboard-abc12","namespace":"opendatahub","tail_lines":10
202+
}}}
203+
```
69204

70-
| Parameter | Type | Required | Description |
71-
|-----------|------|----------|-------------|
72-
| apiVersion | string | yes | API version (e.g. v1, apps/v1) |
73-
| kind | string | yes | Resource kind (e.g. Pod, Deployment) |
74-
| name | string | yes | Resource name |
75-
| namespace | string | no | Namespace (omit for cluster-scoped resources) |
205+
```text
206+
// Example output (plaintext, not JSON)
207+
2025-01-15T12:00:01Z INFO Starting server on :8080
208+
2025-01-15T12:00:02Z INFO Connected to database
209+
2025-01-15T12:00:03Z INFO Health check passed
210+
...
211+
```
76212

77-
### classify_failure
213+
Output is capped at 50KB. If exceeded, a `[truncated: output exceeded 50KB limit]` marker is appended.
78214

79-
Run health checks and classify the failure deterministically.
215+
## Client Configuration
80216

81-
| Parameter | Type | Required | Description |
82-
|-----------|------|----------|-------------|
83-
| sections | string | no | Comma-separated sections to check |
84-
| layer | string | no | Comma-separated layers to check |
85-
| operator_namespace | string | no | Operator namespace. Default: opendatahub-operator-system |
86-
| applications_namespace | string | no | Apps namespace. Default: opendatahub |
217+
**Claude Code:** This repo includes a `.mcp.json` at the project root — no setup needed.
87218

88-
## Running
219+
**Cursor / Claude Desktop:** Add to your MCP config (`.cursor/mcp.json` for Cursor, `claude_desktop_config.json` for Claude Desktop):
89220

90-
```bash
91-
cd cmd/mcp-server && go run .
221+
```json
222+
{
223+
"mcpServers": {
224+
"odh-diagnostics": {
225+
"command": "/absolute/path/to/opendatahub-operator/bin/mcp-server",
226+
"env": {
227+
"KUBECONFIG": "/absolute/path/to/.kube/config"
228+
}
229+
}
230+
}
231+
}
92232
```
93233

94-
## Testing
234+
Build the binary first with `make mcp-server`. The `env` block can be omitted if `KUBECONFIG` is already in your shell environment.
95235

96-
```bash
97-
cd cmd/mcp-server && go test -v ./...
98-
```
236+
## Integration Testing
237+
238+
For manual testing against a live cluster, see [INTEGRATION_TEST.md](INTEGRATION_TEST.md).

0 commit comments

Comments
 (0)