|
1 | | -# OpenDataHub MCP Health Server |
| 1 | +# ODH MCP Server |
2 | 2 |
|
3 | | -MCP (Model Context Protocol) server that exposes cluster health diagnostic tools for OpenDataHub. |
| 3 | +A [Model Context Protocol](https://modelcontextprotocol.io/) (MCP) server that exposes diagnostic tools for OpenDataHub clusters. It communicates over stdio using JSON-RPC, designed to be called by AI assistants (e.g. Claude Code, VS Code Copilot) or any MCP-compatible client. |
4 | 4 |
|
5 | | -## Tools |
| 5 | +## Build & Run |
6 | 6 |
|
7 | | -### pod_logs |
| 7 | +```bash |
| 8 | +# Build the binary |
| 9 | +make mcp-server |
8 | 10 |
|
9 | | -Retrieve recent logs for a specific pod/container. |
| 11 | +# Run tests |
| 12 | +make mcp-server-test |
| 13 | +``` |
10 | 14 |
|
11 | | -| Parameter | Type | Required | Description | |
12 | | -|-----------|------|----------|-------------| |
13 | | -| pod_name | string | yes | Name of the pod | |
14 | | -| namespace | string | yes | Namespace of the pod | |
15 | | -| container | string | no | Container name. Omit for the default container | |
16 | | -| previous | boolean | no | Return logs from previous container instance. Default: false | |
17 | | -| tail_lines | number | no | Lines from end of log to return. Default: 100 | |
18 | | -| list_containers | boolean | no | Return list of all containers (init, regular, ephemeral) instead of logs. Default: false | |
| 15 | +The server requires a valid `KUBECONFIG` (or in-cluster config). Namespace defaults can be overridden via environment variables: |
19 | 16 |
|
20 | | -When a container name is invalid, the error response automatically includes the list of available containers. |
| 17 | +| Variable | Default | Description | |
| 18 | +|----------|---------|-------------| |
| 19 | +| `E2E_TEST_OPERATOR_NAMESPACE` | `opendatahub-operator-system` | Namespace where the ODH operator is deployed | |
| 20 | +| `E2E_TEST_APPLICATIONS_NAMESPACE` | `opendatahub` | Namespace where ODH components are deployed | |
21 | 21 |
|
22 | | -### platform_health |
| 22 | +## Tool Reference |
23 | 23 |
|
24 | | -Run cluster health checks and return report as JSON. |
| 24 | +### platform_health |
25 | 25 |
|
26 | | -| Parameter | Type | Required | Description | |
27 | | -|-----------|------|----------|-------------| |
28 | | -| sections | string | no | Comma-separated sections: nodes,deployments,pods,events,quotas,operator,dsci,dsc | |
29 | | -| layer | string | no | Comma-separated layers: infrastructure,workload,operator. Ignored if sections is set | |
30 | | -| operator_namespace | string | no | Operator namespace. Default: opendatahub-operator-system | |
31 | | -| applications_namespace | string | no | Apps namespace. Default: opendatahub | |
32 | | -| summary | boolean | no | Return compact summary instead of full report. Default: true | |
| 26 | +Run cluster health checks and return the full report as JSON. Checks nodes, deployments, pods, events, quotas, operator status, DSCI, and DSC. |
| 27 | + |
| 28 | +| Parameter | Type | Required | Default | Description | |
| 29 | +|-----------|------|----------|---------|-------------| |
| 30 | +| `sections` | string | no | all | Comma-separated sections: `nodes`, `deployments`, `pods`, `events`, `quotas`, `operator`, `dsci`, `dsc` | |
| 31 | +| `layer` | string | no | all | Comma-separated layers: `infrastructure`, `workload`, `operator`. Ignored if `sections` is set | |
| 32 | +| `operator_namespace` | string | no | auto-discover (env → `opendatahub-operator-system`) | Operator namespace | |
| 33 | +| `applications_namespace` | string | no | auto-discover (DSCI → env → `opendatahub`) | Applications namespace | |
| 34 | + |
| 35 | +```jsonc |
| 36 | +// Example call |
| 37 | +{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"platform_health","arguments":{"sections":"nodes,operator"}}} |
| 38 | + |
| 39 | +// Example output (truncated) |
| 40 | +{ |
| 41 | + "nodes": { |
| 42 | + "total": 3, |
| 43 | + "ready": 3, |
| 44 | + "items": [{"name": "node-1", "ready": true, "roles": "control-plane,worker", ...}] |
| 45 | + }, |
| 46 | + "operator": { |
| 47 | + "deployment": "opendatahub-operator-controller-manager", |
| 48 | + "ready": true, |
| 49 | + "replicas": 1, |
| 50 | + "readyReplicas": 1 |
| 51 | + } |
| 52 | +} |
| 53 | +``` |
33 | 54 |
|
34 | | -### component_status |
| 55 | +### operator_dependencies |
35 | 56 |
|
36 | | -Get detailed status of a specific ODH component including managed resources. |
| 57 | +Check status of dependent operators (cert-manager, Tempo, OpenTelemetry, Kueue, LWS, etc.). |
37 | 58 |
|
38 | | -| Parameter | Type | Required | Description | |
39 | | -|-----------|------|----------|-------------| |
40 | | -| component | string | yes | Component name (e.g. kserve, dashboard, workbenches) | |
41 | | -| applications_namespace | string | no | Apps namespace. Default: opendatahub | |
| 59 | +| Parameter | Type | Required | Default | Description | |
| 60 | +|-----------|------|----------|---------|-------------| |
| 61 | +| `operator_namespace` | string | no | auto-discover (env → `opendatahub-operator-system`) | Operator namespace | |
| 62 | +| `name` | string | no | all | Filter to a single dependency by name (e.g. `cert-manager`) | |
42 | 63 |
|
43 | | -Response includes `managedResources` listing Services, ConfigMaps, ServiceAccounts, and Secrets owned by the component. |
| 64 | +```jsonc |
| 65 | +// Example call |
| 66 | +{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"operator_dependencies","arguments":{}}} |
44 | 67 |
|
45 | | -### operator_dependencies |
| 68 | +// Example output |
| 69 | +[ |
| 70 | + {"name": "cert-manager", "installed": true, "healthy": true, "version": "v1.14.0"}, |
| 71 | + {"name": "tempo-operator", "installed": false, "healthy": false} |
| 72 | +] |
| 73 | +``` |
46 | 74 |
|
47 | | -Check status of dependent operators (cert-manager, tempo, OTel, etc.). |
| 75 | +### describe_resource |
48 | 76 |
|
49 | | -| Parameter | Type | Required | Description | |
50 | | -|-----------|------|----------|-------------| |
51 | | -| operator_namespace | string | no | Operator namespace. Default: opendatahub-operator-system | |
52 | | -| name | string | no | Filter to specific dependent by name | |
| 77 | +Get any Kubernetes resource by apiVersion/kind/name. Returns the full resource as JSON with sensitive data redacted (Secret `.data`, token fields). |
| 78 | + |
| 79 | +| Parameter | Type | Required | Default | Description | |
| 80 | +|-----------|------|----------|---------|-------------| |
| 81 | +| `apiVersion` | string | yes | | API version, e.g. `v1`, `apps/v1`, `datasciencecluster.opendatahub.io/v2` | |
| 82 | +| `kind` | string | yes | | Resource kind, e.g. `Pod`, `Deployment`, `DSCInitialization` | |
| 83 | +| `name` | string | yes | | Resource name | |
| 84 | +| `namespace` | string | no | | Namespace. Omit for cluster-scoped resources | |
| 85 | + |
| 86 | +```jsonc |
| 87 | +// Example call |
| 88 | +{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"describe_resource","arguments":{ |
| 89 | + "apiVersion":"dscinitialization.opendatahub.io/v2","kind":"DSCInitialization","name":"default-dsci" |
| 90 | +}}} |
| 91 | + |
| 92 | +// Example output (truncated) |
| 93 | +{ |
| 94 | + "apiVersion": "dscinitialization.opendatahub.io/v2", |
| 95 | + "kind": "DSCInitialization", |
| 96 | + "metadata": {"name": "default-dsci", "creationTimestamp": "2025-01-15T10:00:00Z", ...}, |
| 97 | + "spec": {"applicationsNamespace": "opendatahub", ...}, |
| 98 | + "status": {"phase": "Ready", "conditions": [...]} |
| 99 | +} |
| 100 | +``` |
53 | 101 |
|
54 | 102 | ### recent_events |
55 | 103 |
|
56 | | -Warning/error events in ODH namespaces sorted by last timestamp. |
| 104 | +Warning/error events in ODH namespaces, sorted by last timestamp (most recent first). Auto-discovers ODH namespaces from DSCI if not specified. |
| 105 | + |
| 106 | +| Parameter | Type | Required | Default | Description | |
| 107 | +|-----------|------|----------|---------|-------------| |
| 108 | +| `namespace` | string | no | auto-discover | Comma-separated namespaces to query | |
| 109 | +| `since` | string | no | `5m` | Go duration for look-back window (e.g. `5m`, `1h`) | |
| 110 | +| `event_type` | string | no | all | Filter by type: `Warning`, `Normal` | |
| 111 | + |
| 112 | +```jsonc |
| 113 | +// Example call |
| 114 | +{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"recent_events","arguments":{"since":"1h"}}} |
| 115 | + |
| 116 | +// Example output |
| 117 | +[ |
| 118 | + { |
| 119 | + "namespace": "opendatahub", |
| 120 | + "name": "dashboard-pod.abc123", |
| 121 | + "kind": "Pod", |
| 122 | + "type": "Warning", |
| 123 | + "reason": "BackOff", |
| 124 | + "message": "Back-off restarting failed container", |
| 125 | + "count": 5, |
| 126 | + "lastTimestamp": "2025-01-15T12:30:00Z" |
| 127 | + } |
| 128 | +] |
| 129 | +``` |
| 130 | + |
| 131 | +### classify_failure |
57 | 132 |
|
58 | | -| Parameter | Type | Required | Description | |
59 | | -|-----------|------|----------|-------------| |
60 | | -| namespace | string | no | Comma-separated namespaces. Omit to auto-discover from DSCI | |
61 | | -| since | string | no | Go duration for look-back window (e.g. 5m, 1h). Default: 5m | |
62 | | -| event_type | string | no | Filter by type: Warning, Normal. Omit for all | |
| 133 | +Run cluster health checks and classify the failure deterministically. Returns a structured classification with category, subcategory, error code, evidence, and confidence. |
| 134 | + |
| 135 | +| Parameter | Type | Required | Default | Description | |
| 136 | +|-----------|------|----------|---------|-------------| |
| 137 | +| `sections` | string | no | all | Same as `platform_health` | |
| 138 | +| `layer` | string | no | all | Same as `platform_health` | |
| 139 | +| `operator_namespace` | string | no | auto-discover (env → `opendatahub-operator-system`) | Operator namespace | |
| 140 | +| `applications_namespace` | string | no | auto-discover (DSCI → env → `opendatahub`) | Applications namespace | |
| 141 | + |
| 142 | +```jsonc |
| 143 | +// Example call |
| 144 | +{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"classify_failure","arguments":{}}} |
| 145 | + |
| 146 | +// Example output |
| 147 | +{ |
| 148 | + "category": "component", |
| 149 | + "subcategory": "degraded", |
| 150 | + "error_code": "COMP_DEGRADED", |
| 151 | + "evidence": "Dashboard deployment has 0/1 ready replicas", |
| 152 | + "confidence": 0.9 |
| 153 | +} |
| 154 | +``` |
63 | 155 |
|
64 | | -Event output includes a `count` field showing how many times the event occurred. |
| 156 | +### component_status |
65 | 157 |
|
66 | | -### describe_resource |
| 158 | +Get detailed status of a specific ODH component: CR conditions, pod statuses, and deployment readiness. |
| 159 | + |
| 160 | +| Parameter | Type | Required | Default | Description | |
| 161 | +|-----------|------|----------|---------|-------------| |
| 162 | +| `component` | string | yes | | Component name: `kserve`, `dashboard`, `workbenches`, `ray`, `trustyai`, `modelregistry`, `datasciencepipelines`, `trainingoperator`, `feastoperator`, `trainer`, `kueue`, `mlflowoperator`, `sparkoperator`, etc. | |
| 163 | +| `applications_namespace` | string | no | auto-discover (DSCI → env → `opendatahub`) | Applications namespace | |
| 164 | + |
| 165 | +```jsonc |
| 166 | +// Example call |
| 167 | +{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"component_status","arguments":{"component":"dashboard"}}} |
| 168 | + |
| 169 | +// Example output |
| 170 | +{ |
| 171 | + "component": "dashboard", |
| 172 | + "crFound": true, |
| 173 | + "conditions": [ |
| 174 | + {"type": "Ready", "status": "True", "reason": "Ready", "message": ""} |
| 175 | + ], |
| 176 | + "deployments": [ |
| 177 | + {"name": "odh-dashboard", "replicas": 2, "ready": 2} |
| 178 | + ], |
| 179 | + "pods": [ |
| 180 | + {"name": "odh-dashboard-abc12", "phase": "Running"}, |
| 181 | + {"name": "odh-dashboard-def34", "phase": "Running"} |
| 182 | + ] |
| 183 | +} |
| 184 | +``` |
| 185 | + |
| 186 | +### pod_logs |
67 | 187 |
|
68 | | -Get any Kubernetes resource by apiVersion/kind/name. Returns full resource as JSON with sensitive data redacted. |
| 188 | +Retrieve recent logs for a specific pod/container. Returns plaintext (not JSON). |
| 189 | + |
| 190 | +| Parameter | Type | Required | Default | Description | |
| 191 | +|-----------|------|----------|---------|-------------| |
| 192 | +| `pod_name` | string | yes | | Name of the pod | |
| 193 | +| `namespace` | string | yes | | Namespace of the pod | |
| 194 | +| `container` | string | no | default container | Container name | |
| 195 | +| `previous` | boolean | no | `false` | Return logs from the previous container instance | |
| 196 | +| `tail_lines` | number | no | `100` | Number of lines from the end of the log | |
| 197 | + |
| 198 | +```jsonc |
| 199 | +// Example call |
| 200 | +{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"pod_logs","arguments":{ |
| 201 | + "pod_name":"odh-dashboard-abc12","namespace":"opendatahub","tail_lines":10 |
| 202 | +}}} |
| 203 | +``` |
69 | 204 |
|
70 | | -| Parameter | Type | Required | Description | |
71 | | -|-----------|------|----------|-------------| |
72 | | -| apiVersion | string | yes | API version (e.g. v1, apps/v1) | |
73 | | -| kind | string | yes | Resource kind (e.g. Pod, Deployment) | |
74 | | -| name | string | yes | Resource name | |
75 | | -| namespace | string | no | Namespace (omit for cluster-scoped resources) | |
| 205 | +```text |
| 206 | +// Example output (plaintext, not JSON) |
| 207 | +2025-01-15T12:00:01Z INFO Starting server on :8080 |
| 208 | +2025-01-15T12:00:02Z INFO Connected to database |
| 209 | +2025-01-15T12:00:03Z INFO Health check passed |
| 210 | +... |
| 211 | +``` |
76 | 212 |
|
77 | | -### classify_failure |
| 213 | +Output is capped at 50KB. If exceeded, a `[truncated: output exceeded 50KB limit]` marker is appended. |
78 | 214 |
|
79 | | -Run health checks and classify the failure deterministically. |
| 215 | +## Client Configuration |
80 | 216 |
|
81 | | -| Parameter | Type | Required | Description | |
82 | | -|-----------|------|----------|-------------| |
83 | | -| sections | string | no | Comma-separated sections to check | |
84 | | -| layer | string | no | Comma-separated layers to check | |
85 | | -| operator_namespace | string | no | Operator namespace. Default: opendatahub-operator-system | |
86 | | -| applications_namespace | string | no | Apps namespace. Default: opendatahub | |
| 217 | +**Claude Code:** This repo includes a `.mcp.json` at the project root — no setup needed. |
87 | 218 |
|
88 | | -## Running |
| 219 | +**Cursor / Claude Desktop:** Add to your MCP config (`.cursor/mcp.json` for Cursor, `claude_desktop_config.json` for Claude Desktop): |
89 | 220 |
|
90 | | -```bash |
91 | | -cd cmd/mcp-server && go run . |
| 221 | +```json |
| 222 | +{ |
| 223 | + "mcpServers": { |
| 224 | + "odh-diagnostics": { |
| 225 | + "command": "/absolute/path/to/opendatahub-operator/bin/mcp-server", |
| 226 | + "env": { |
| 227 | + "KUBECONFIG": "/absolute/path/to/.kube/config" |
| 228 | + } |
| 229 | + } |
| 230 | + } |
| 231 | +} |
92 | 232 | ``` |
93 | 233 |
|
94 | | -## Testing |
| 234 | +Build the binary first with `make mcp-server`. The `env` block can be omitted if `KUBECONFIG` is already in your shell environment. |
95 | 235 |
|
96 | | -```bash |
97 | | -cd cmd/mcp-server && go test -v ./... |
98 | | -``` |
| 236 | +## Integration Testing |
| 237 | + |
| 238 | +For manual testing against a live cluster, see [INTEGRATION_TEST.md](INTEGRATION_TEST.md). |
0 commit comments