You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 001-streamshub-mcp-strimzi.md
+54-15Lines changed: 54 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -46,19 +46,21 @@ Just giving an LLM access to `kubectl` has several limitations:
46
46
## Use Cases & Example Interactions
47
47
48
48
Below are typical DevOps questions and how the MCP server helps answer them.
49
+
In these examples, the user selects an [MCP Prompt Template](#mcp-prompt-templates) that tells the LLM what tools to call and in what order.
50
+
The user can also attach [MCP Resources](#mcp-resources) to provide the LLM with additional context about the cluster.
49
51
50
52
### "Why is my Kafka cluster not ready?"
51
53
52
54
**User prompt**: "The Kafka cluster `my-cluster` in namespace `kafka-prod` has been NotReady for 15 minutes. What's going on?"
53
55
54
-
**MCP flow** (LLM calls tools based on the `diagnose-cluster-issue` prompt template):
56
+
**MCP flow** (user selects the `diagnose-cluster-issue` prompt template, which guides the LLM to call the following tools):
55
57
1.`get_resource_status` - reads Kafka CR status and conditions, finds `NotReady` condition with reason
56
58
2.`get_operator_logs` - reads Strimzi operator logs filtered to the relevant time window and namespace, finds reconciliation errors
57
59
3.`get_pod_status` - checks broker pod statuses, finds one pod in CrashLoopBackOff
58
60
4. LLM correlates findings and decides pod logs are needed
59
61
5.`get_pod_logs` - reads logs from the failing pod (including previous container logs), finds OOM kill
60
62
61
-
**Expected output summary**: broker-2 is in CrashLoopBackOff due to OOM kills (memory limit 2Gi, request 4Gi observed), which is blocking the cluster reconciliation.
63
+
**Expected output summary**: broker-2 is in CrashLoopBackOff due to OOM kills (container memory limit is 2Gi but the JVM heap requires more), which is blocking the cluster reconciliation.
62
64
The suggestion would be to increase the memory limit in the KafkaNodePool resource.
63
65
64
66
### "Show me all Kafka clusters across namespaces"
-`strimzi://operator.strimzi.io/v1/namespaces/{namespace}/clusteroperator/{name}/status` - Strimzi operator deployment status, version, and managed namespaces.
189
191
190
-
The URI hierarchy follows the Kubernetes API structure (`{apiGroup}/{version}/namespaces/{namespace}/{resource}/{name}`) so that the resource paths are familiar to Kubernetes users and naturally include API versioning.
192
+
The URI hierarchy follows the Kubernetes API structure (`{apiGroup}/{version}/namespaces/{namespace}/{resource}/{name}`) so that the resource paths are familiar to Kubernetes users.
193
+
The `v1` in the URI refers to the Strimzi API version (matching `kafka.strimzi.io/v1`).
191
194
MCP Resources don't have built-in API versioning, so the version is encoded in the URI path.
195
+
Since the MCP server only supports the v1 Strimzi API, versioning in the URI is mainly for consistency with the K8s API structure — if it turns out to be unnecessary, the URIs can be simplified later.
192
196
The operator URI uses the same pattern for consistency, even though the operator is a Kubernetes Deployment, not a Strimzi CRD.
193
197
194
198
Dedicated MCP Resource URIs for other Strimzi CRs (KafkaConnect, KafkaUser, etc.) are out of scope for the initial implementation and can be added later.
@@ -235,14 +239,14 @@ Guides the LLM to check listener configuration, bootstrap addresses, TLS certifi
235
239
236
240
The MCP server supports two approaches to diagnostic workflows:
237
241
238
-
**Client-driven (prompt templates + fine-grained tools)**: The LLM reads a [prompt template](#mcp-prompt-templates), calls fine-grained tools one by one (`get_resource_status`, `get_pod_logs`, etc.), reasons about results between calls, and asks the user for clarification when needed.
242
+
**Client-driven (prompt templates + fine-grained tools)**: The user selects a [prompt template](#mcp-prompt-templates), which the LLM then follows — calling fine-grained tools one by one (`get_resource_status`, `get_pod_logs`, etc.), reasoning about results between calls, and asking the user for clarification when needed.
239
243
This works with every MCP client since prompt templates are just text and tools are standard MCP tool calls.
240
244
241
245
**Server-driven (composite tools with Sampling and Elicitation)**: The composite `diagnose_cluster` tool internally orchestrates multiple steps.
242
246
During execution, it uses two MCP features to interact with the client:
243
247
244
-
-[Sampling](https://modelcontextprotocol.io/specification/2025-11-25/server/sampling): The server sends intermediate data (e.g., CR status + operator logs gathered so far) to the LLM and asks it to analyze the findings and decide whether deeper investigation (e.g., pod logs) is needed — all within a single tool call.
245
-
-[Elicitation](https://modelcontextprotocol.io/specification/2025-11-25/server/elicitation): The server asks the user for input mid-execution, for example "Which namespace?" when the query is ambiguous, or "What time range?" before retrieving logs.
248
+
-[Sampling](https://modelcontextprotocol.io/specification/2025-11-25/client/sampling): The server sends intermediate data (e.g., CR status + operator logs gathered so far) to the LLM and asks it to analyze the findings and decide whether deeper investigation (e.g., pod logs) is needed — all within a single tool call.
249
+
-[Elicitation](https://modelcontextprotocol.io/specification/2025-11-25/client/elicitation): The server asks the user for input mid-execution, for example "Which namespace?" when the query is ambiguous, or "What time range?" before retrieving logs.
246
250
247
251
This provides a more streamlined experience with fewer round trips, but requires MCP client support for Sampling and Elicitation.
248
252
Composite tools gracefully fall back when the client doesn't support these features (e.g., skipping Sampling and returning raw data, or using default values instead of Elicitation).
@@ -369,6 +373,21 @@ Each tool returns data as a structured JSON DTO that removes noise from raw Kube
369
373
-**Quarkus MCP Server**: Provides MCP protocol support and tool definition framework.
370
374
-**Quarkus extensions**: Logging (JSON output), Config (environment-based configuration), Health (liveness and readiness endpoints).
371
375
376
+
#### Pluggable Architecture
377
+
378
+
The proposal mentions three pluggable systems: [log providers](#log-handling), [metrics providers](#metrics-strategy), and [guardrail filters](#prompt-injection-protection).
379
+
All three follow the same pattern using Quarkus CDI (Contexts and Dependency Injection):
380
+
381
+
1. Define a Java interface for each provider like `LogProvider`, `MetricsProvider`, `GuardrailFilter`.
382
+
2. Implement the default version as a CDI bean like `KubernetesLogProvider` that reads pod logs via Fabric8, `PodScrapingMetricsProvider` that reads HTTP endpoints.
383
+
3. Select the active implementation via configuration using Quarkus `@LookupIfProperty` or `@IfBuildProfile` annotations.
384
+
4. Custom implementations can be packaged as separate Maven modules and added to the container image.
385
+
386
+
For example, a user who wants to use an external log aggregation system would add the provider module to the image and set `mcp.log.provider=custom-provider` in the configuration.
387
+
The tool layer stays the same regardless of which provider is active.
388
+
389
+
Provider interfaces and the default implementations live in the `common` module so they can be reused by other MCP servers.
390
+
372
391
#### Security & RBAC
373
392
374
393
**Dedicated Service Account and ClusterRole**:
@@ -446,6 +465,8 @@ Access to each MCP server endpoint can be controlled through standard Kubernetes
446
465
The MCP protocol supports OAuth 2.1 for authentication and authorization ([spec](https://modelcontextprotocol.io/specification/2025-11-25), [Quarkus OIDC example](https://quarkus.io/blog/secure-mcp-oidc-client/)).
447
466
This is out of scope for the initial implementation.
448
467
MCP-level auth/authz (e.g., user identity propagation via OIDC, Kubernetes user impersonation) can be proposed separately later, as it will be reused by other MCP servers in StreamsHub-MCP.
468
+
This will be especially important when a Kafka MCP is added.
469
+
The approach should align with how StreamsHub Console handles authorization.
449
470
450
471
**Output sanitization**:
451
472
- TLS secrets: Only certificate metadata is returned (expiry date, issuer, SANs).
@@ -459,10 +480,13 @@ All tool inputs are validated against JSON Schemas with strict type checking, en
459
480
For example, namespace and resource name inputs are validated against Kubernetes naming conventions (`[a-z0-9]([-a-z0-9]*[a-z0-9])?`).
460
481
461
482
Tool outputs are structured DTOs, not raw strings, so there is less risk of prompt injection through resource content.
462
-
Fields that could contain user-controlled content (labels, annotations) are sanitized by stripping control characters and limiting length.
483
+
Fields that could contain user-controlled content (labels, annotations, log messages) are sanitized by stripping control characters and limiting length.
463
484
464
485
Each tool has a maximum response size so it won't fill up the LLM context.
465
486
487
+
Input validation and output sanitization should be implemented as a pluggable filter chain.
488
+
This way the filters can be shared across all MCP modules in the mono-repo, and more advanced guardrails (e.g., integration with external safety systems) can be added later without changing the tool implementations.
489
+
466
490
#### Log Handling
467
491
468
492
Log retrieval needs some care to avoid returning too much data or leaking sensitive information:
@@ -484,28 +508,38 @@ Log retrieval needs some care to avoid returning too much data or leaking sensit
484
508
- **Request throttling**: Log requests are rate-limited to protect the Kubernetes API server from too much load.
485
509
- **Sensitive data redaction**: Log lines are scanned for common sensitive patterns (bearer tokens, passwords, connection strings, API keys) and redacted before being returned.
486
510
511
+
**Pluggable log provider**: Log retrieval should be implemented behind a provider interface so that different backends can be used.
512
+
The default implementation reads logs directly from Kubernetes pod logs via the Fabric8 client.
513
+
Other providers (e.g., for external log aggregation systems) can be added later and selected via configuration.
514
+
This keeps the tool layer independent of the log source.
515
+
487
516
**Limitations**: Direct pod log reading only provides current and previous container logs.
488
-
For production use cases that require querying historical logs, integration with an external log aggregation system would be needed.
517
+
For production use cases that require querying historical logs, a provider for an external log aggregation system would be needed.
489
518
This is not part of the initial implementation and should be proposed in a follow-up proposal.
490
519
491
520
#### Metrics Strategy
492
521
493
-
The initial implementation will read metrics from broker and controller pod HTTP endpoints as described in the [Infrastructure only](#proposal) section, using either the [Strimzi Metrics Reporter](https://github.com/strimzi/metrics-reporter) or the Kafka JMX exporter depending on user configuration.
522
+
**Pluggable metrics provider**: Metrics retrieval should be implemented behind a provider interface, same as [log retrieval](#log-handling).
523
+
The default implementation reads metrics directly from broker and controller pod HTTP endpoints, using either the [Strimzi Metrics Reporter](https://github.com/strimzi/metrics-reporter) or the Kafka JMX exporter depending on user configuration.
524
+
Other providers (e.g., for Prometheus or other metrics systems) can be added later and selected via configuration.
494
525
495
-
**Prometheus API integration** (querying an existing Prometheus instance for historical and aggregated metrics) **is explicitly out of scope** for this proposal and can be added in a separate proposal later.
526
+
**Prometheus API integration** (querying an existing Prometheus instance for historical and aggregated metrics) **is explicitly out of scope** for this proposal and can be added as a separate metrics provider later.
496
527
497
528
**Caveats**:
498
529
- Metrics port: The server auto-detects the metrics port from the pod spec container ports (Strimzi Metrics Reporter defaults to 8080, JMX exporter typically 9404).
499
530
- Custom metric names: Strimzi and Kafka metric names can be customized by users, so the MCP server will need configurable metric name mappings.
500
531
- RBAC: Direct pod scraping needs `get` on `pods/proxy` (included in the ClusterRole above).
501
532
- Direct pod scraping only gives point-in-time metrics, no historical data.
502
-
For historical data, Prometheus integration would be needed.
533
+
For historical data, a Prometheus provider would be needed.
534
+
- Scraping load: Direct pod scraping adds load to the broker/controller pods.
535
+
During incidents this could make things worse, so metrics requests should be rate-limited (same as log requests).
536
+
For production use, a centralized metrics system that has already collected the data is preferred.
503
537
504
538
Metrics are mostly useful during ad-hoc incident investigation, for example "Is the broker under heavy load?", "What's the replication lag?", "Are there under-replicated partitions?".
505
539
For ongoing monitoring, existing alerting infrastructure (Prometheus alerts, Grafana) is still the primary tool.
506
540
507
541
**Future extensibility**: Integrations with external logging and metrics systems are not part of this proposal and should be proposed separately.
508
-
The implementation should be structured so that adding such integrations later does not require rewriting the core.
542
+
The pluggable provider architecture for both logs and metrics makes it possible to add such integrations without rewriting the core.
509
543
510
544
### Deployment & Delivery
511
545
@@ -563,9 +597,14 @@ N/A, this is a new project under StreamsHub.
563
597
564
598
### Option 1: Custom CLI Tool instead of MCP
565
599
566
-
Instead of creating an MCP server, we could give users custom CLI tools as a shortcut for calling kubectl commands.
567
-
But this won't work with LLMs and users would still need to learn how Strimzi and Kafka Kubernetes deployments work instead of using natural language.
568
-
It also wouldn't support the correlation and diagnosis that LLM integration brings.
600
+
A custom CLI tool could provide Strimzi specific commands as a shortcut for kubectl, including structured output formats like JSON or YAML.
601
+
LLM agents can use CLI tools, so a CLI could work with LLMs.
602
+
However, MCP and CLI serve different use cases:
603
+
- MCP keeps the processing server-side, so the client only needs an MCP connection — no local kubectl access, kubeconfig, or Strimzi CLI installation required.
604
+
- MCP Prompt Templates provide multistep diagnostic workflows that combine multiple tools and resources into a guided sequence, while CLI help describes individual commands without orchestrating them together.
605
+
- MCP Resources allow the server to proactively push updated state to the client, while a CLI requires the LLM to repeatedly poll for changes.
606
+
607
+
A CLI could be a useful complement to MCP in the future, but for the initial use case (LLM-driven diagnosis) MCP is a better fit.
569
608
570
609
### Option 2: Use existing generic Kubernetes MCP server
0 commit comments