Skip to content

amito/rhoai-observability-mcp

Repository files navigation

Red Hat OpenShift AI (RHOAI) Observability MCP

CI Build codecov Python 3.11+ License: MIT

An MCP (Model Context Protocol) server that gives AI assistants direct access to Red Hat OpenShift AI observability data. Query Prometheus metrics, Alertmanager alerts, Loki logs, Grafana dashboards, and Kubernetes cluster state to troubleshoot vLLM inference workloads.

Features

  • 17 tools across 6 categories for comprehensive observability
  • vLLM-aware metrics (TTFT, TPOT, E2E latency, KV cache, queue depth)
  • Composite investigation tools that correlate metrics, logs, and alerts automatically
  • Auto-detection of in-cluster vs external access to OpenShift services
  • Built on FastMCP with async backends via httpx

Architecture

graph TD
    A[Claude / AI Assistant] -->|MCP Protocol| B[rhoai-observability-mcp]
    B --> C[Thanos / Prometheus]
    B --> D[Alertmanager]
    B --> E[Loki]
    B --> F[Grafana]
    B --> G[Kubernetes / OpenShift]
Loading

Backends:

Backend Purpose Source
Prometheus (Thanos) Metrics queries (PromQL) backends/prometheus.py
Alertmanager Active alerts and alert groups backends/alertmanager.py
Loki Log queries (LogQL) backends/loki.py
Grafana Dashboard discovery and panel queries backends/grafana.py
Kubernetes (OpenShift) Pods, events, nodes, InferenceServices backends/openshift.py

Quick Start

# Clone and install
git clone https://github.com/amito/rhoai-observability-mcp.git
cd rhoai-observability-mcp
uv pip install -e ".[dev]"

# Configure (see INSTALL.md for all options)
export THANOS_URL=https://thanos-querier.openshift-monitoring.svc:9091
export ALERTMANAGER_URL=https://alertmanager-main.openshift-monitoring.svc:9093
export OPENSHIFT_TOKEN=$(oc whoami -t)

# Run
python -m rhoai_obs_mcp.server

See INSTALL.md for detailed setup, configuration, and Claude Desktop integration.

Build & Deploy

Build the container image

make build

Override the image name or tag:

make build IMAGE_NAME=quay.io/myorg/rhoai-observability-mcp IMAGE_TAG=v1.0.0

Push to registry

make push

Deploy to OpenShift

Prerequisites: oc login to your cluster and create the target project:

oc new-project rhoai-obs-mcp

Then deploy:

make deploy

This applies the manifests in deploy/ to the rhoai-obs-mcp namespace. To deploy to a different namespace:

make deploy NAMESPACE=my-namespace

Undeploy

make undeploy

If you deployed to a custom namespace, pass the same value:

make undeploy NAMESPACE=my-namespace

CI-built images

Container images are automatically built from main and published to GHCR:

ghcr.io/amito/rhoai-observability-mcp:latest

Tool Reference

Metrics

Tool Description
query_prometheus Execute a raw PromQL query against ThanosQuerier
get_vllm_metrics Get a summary of key vLLM metrics (TTFT, TPOT, E2E, cache, queue) for a model
list_metrics List available Prometheus metric names, optionally filtered by regex

Alerts

Tool Description
get_alerts Get active alerts from Alertmanager, filterable by severity and labels
get_alert_groups Get alerts grouped by their routing labels

Logs

Tool Description
query_logs Execute a LogQL query against OpenShift LokiStack
get_pod_logs Get logs for a specific pod by namespace and name

Cluster

Tool Description
get_pods List pods in a namespace with status, restarts, and creation time
get_events List Kubernetes events, filterable by resource and reason
get_node_status Get node status, capacity, and GPU allocation info
describe_resource Get detailed description of a Kubernetes resource
get_inference_services List KServe InferenceService resources

Dashboards

Tool Description
list_dashboards List available Grafana dashboards, filterable by tag or title
get_dashboard_panels Get panels and their queries from a Grafana dashboard

Investigation

Tool Description
investigate_latency Correlate latency metrics, error logs, and alerts for a vLLM model
investigate_gpu Correlate GPU utilization, KV cache, queue depth, and pod status
investigate_errors Correlate error logs, alerts, and Kubernetes events in a namespace

Documentation

License

MIT

About

MCP server for troubleshooting vLLM inference workloads on Red Hat OpenShift AI — queries Prometheus, Alertmanager, Loki, Grafana, and Kubernetes from AI assistants.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors