address Tom's comments

kornys · kornys · commit 318c5ad2ad6a · 2026-03-11T08:45:42.000+01:00
Signed-off-by: David Kornel &lt;kornys@outlook.com&gt;
diff --git a/001-streamshub-mcp-strimzi.md b/001-streamshub-mcp-strimzi.md
@@ -46,19 +46,21 @@ Just giving an LLM access to `kubectl` has several limitations:
 ## Use Cases & Example Interactions
 
 Below are typical DevOps questions and how the MCP server helps answer them.
+In these examples, the user selects an [MCP Prompt Template](#mcp-prompt-templates) that tells the LLM what tools to call and in what order.
+The user can also attach [MCP Resources](#mcp-resources) to provide the LLM with additional context about the cluster.
 
 ### "Why is my Kafka cluster not ready?"
 
 **User prompt**: "The Kafka cluster `my-cluster` in namespace `kafka-prod` has been NotReady for 15 minutes. What's going on?"
 
-**MCP flow** (LLM calls tools based on the `diagnose-cluster-issue` prompt template):
+**MCP flow** (user selects the `diagnose-cluster-issue` prompt template, which guides the LLM to call the following tools):
 1. `get_resource_status` - reads Kafka CR status and conditions, finds `NotReady` condition with reason
 2. `get_operator_logs` - reads Strimzi operator logs filtered to the relevant time window and namespace, finds reconciliation errors
 3. `get_pod_status` - checks broker pod statuses, finds one pod in CrashLoopBackOff
 4. LLM correlates findings and decides pod logs are needed
 5. `get_pod_logs` - reads logs from the failing pod (including previous container logs), finds OOM kill
 
-**Expected output summary**: broker-2 is in CrashLoopBackOff due to OOM kills (memory limit 2Gi, request 4Gi observed), which is blocking the cluster reconciliation.
+**Expected output summary**: broker-2 is in CrashLoopBackOff due to OOM kills (container memory limit is 2Gi but the JVM heap requires more), which is blocking the cluster reconciliation.
 The suggestion would be to increase the memory limit in the KafkaNodePool resource.
 
 ### "Show me all Kafka clusters across namespaces"
@@ -187,8 +189,10 @@ Defined resource URIs:
 - `strimzi://kafka.strimzi.io/v1/namespaces/{namespace}/kafkatopics/{name}/status` - KafkaTopic status, conditions, and topic configuration (partitions, replicas, config overrides).
 - `strimzi://operator.strimzi.io/v1/namespaces/{namespace}/clusteroperator/{name}/status` - Strimzi operator deployment status, version, and managed namespaces.
 
-The URI hierarchy follows the Kubernetes API structure (`{apiGroup}/{version}/namespaces/{namespace}/{resource}/{name}`) so that the resource paths are familiar to Kubernetes users and naturally include API versioning.
+The URI hierarchy follows the Kubernetes API structure (`{apiGroup}/{version}/namespaces/{namespace}/{resource}/{name}`) so that the resource paths are familiar to Kubernetes users.
+The `v1` in the URI refers to the Strimzi API version (matching `kafka.strimzi.io/v1`).
 MCP Resources don't have built-in API versioning, so the version is encoded in the URI path.
+Since the MCP server only supports the v1 Strimzi API, versioning in the URI is mainly for consistency with the K8s API structure — if it turns out to be unnecessary, the URIs can be simplified later.
 The operator URI uses the same pattern for consistency, even though the operator is a Kubernetes Deployment, not a Strimzi CRD.
 
 Dedicated MCP Resource URIs for other Strimzi CRs (KafkaConnect, KafkaUser, etc.) are out of scope for the initial implementation and can be added later.
@@ -235,14 +239,14 @@ Guides the LLM to check listener configuration, bootstrap addresses, TLS certifi
 
 The MCP server supports two approaches to diagnostic workflows:
 
-**Client-driven (prompt templates + fine-grained tools)**: The LLM reads a [prompt template](#mcp-prompt-templates), calls fine-grained tools one by one (`get_resource_status`, `get_pod_logs`, etc.), reasons about results between calls, and asks the user for clarification when needed.
+**Client-driven (prompt templates + fine-grained tools)**: The user selects a [prompt template](#mcp-prompt-templates), which the LLM then follows — calling fine-grained tools one by one (`get_resource_status`, `get_pod_logs`, etc.), reasoning about results between calls, and asking the user for clarification when needed.
 This works with every MCP client since prompt templates are just text and tools are standard MCP tool calls.
 
 **Server-driven (composite tools with Sampling and Elicitation)**: The composite `diagnose_cluster` tool internally orchestrates multiple steps.
 During execution, it uses two MCP features to interact with the client:
 
-- [Sampling](https://modelcontextprotocol.io/specification/2025-11-25/server/sampling): The server sends intermediate data (e.g., CR status + operator logs gathered so far) to the LLM and asks it to analyze the findings and decide whether deeper investigation (e.g., pod logs) is needed — all within a single tool call.
-- [Elicitation](https://modelcontextprotocol.io/specification/2025-11-25/server/elicitation): The server asks the user for input mid-execution, for example "Which namespace?" when the query is ambiguous, or "What time range?" before retrieving logs.
+- [Sampling](https://modelcontextprotocol.io/specification/2025-11-25/client/sampling): The server sends intermediate data (e.g., CR status + operator logs gathered so far) to the LLM and asks it to analyze the findings and decide whether deeper investigation (e.g., pod logs) is needed — all within a single tool call.
+- [Elicitation](https://modelcontextprotocol.io/specification/2025-11-25/client/elicitation): The server asks the user for input mid-execution, for example "Which namespace?" when the query is ambiguous, or "What time range?" before retrieving logs.
 
 This provides a more streamlined experience with fewer round trips, but requires MCP client support for Sampling and Elicitation.
 Composite tools gracefully fall back when the client doesn't support these features (e.g., skipping Sampling and returning raw data, or using default values instead of Elicitation).
@@ -369,6 +373,21 @@ Each tool returns data as a structured JSON DTO that removes noise from raw Kube
 - **Quarkus MCP Server**: Provides MCP protocol support and tool definition framework.
 - **Quarkus extensions**: Logging (JSON output), Config (environment-based configuration), Health (liveness and readiness endpoints).
 
+#### Pluggable Architecture
+
+The proposal mentions three pluggable systems: [log providers](#log-handling), [metrics providers](#metrics-strategy), and [guardrail filters](#prompt-injection-protection).
+All three follow the same pattern using Quarkus CDI (Contexts and Dependency Injection):
+
+1. Define a Java interface for each provider like `LogProvider`, `MetricsProvider`, `GuardrailFilter`.
+2. Implement the default version as a CDI bean like `KubernetesLogProvider` that reads pod logs via Fabric8, `PodScrapingMetricsProvider` that reads HTTP endpoints.
+3. Select the active implementation via configuration using Quarkus `@LookupIfProperty` or `@IfBuildProfile` annotations.
+4. Custom implementations can be packaged as separate Maven modules and added to the container image.
+
+For example, a user who wants to use an external log aggregation system would add the provider module to the image and set `mcp.log.provider=custom-provider` in the configuration.
+The tool layer stays the same regardless of which provider is active.
+
+Provider interfaces and the default implementations live in the `common` module so they can be reused by other MCP servers.
+
 #### Security & RBAC
 
 **Dedicated Service Account and ClusterRole**:
@@ -446,6 +465,8 @@ Access to each MCP server endpoint can be controlled through standard Kubernetes
 The MCP protocol supports OAuth 2.1 for authentication and authorization ([spec](https://modelcontextprotocol.io/specification/2025-11-25), [Quarkus OIDC example](https://quarkus.io/blog/secure-mcp-oidc-client/)).
 This is out of scope for the initial implementation.
 MCP-level auth/authz (e.g., user identity propagation via OIDC, Kubernetes user impersonation) can be proposed separately later, as it will be reused by other MCP servers in StreamsHub-MCP.
+This will be especially important when a Kafka MCP is added.
+The approach should align with how StreamsHub Console handles authorization.
 
 **Output sanitization**:
 - TLS secrets: Only certificate metadata is returned (expiry date, issuer, SANs).
@@ -459,10 +480,13 @@ All tool inputs are validated against JSON Schemas with strict type checking, en
 For example, namespace and resource name inputs are validated against Kubernetes naming conventions (`[a-z0-9]([-a-z0-9]*[a-z0-9])?`).
 
 Tool outputs are structured DTOs, not raw strings, so there is less risk of prompt injection through resource content.
-Fields that could contain user-controlled content (labels, annotations) are sanitized by stripping control characters and limiting length.
+Fields that could contain user-controlled content (labels, annotations, log messages) are sanitized by stripping control characters and limiting length.
 
 Each tool has a maximum response size so it won't fill up the LLM context.
 
+Input validation and output sanitization should be implemented as a pluggable filter chain.
+This way the filters can be shared across all MCP modules in the mono-repo, and more advanced guardrails (e.g., integration with external safety systems) can be added later without changing the tool implementations.
+
 #### Log Handling
 
 Log retrieval needs some care to avoid returning too much data or leaking sensitive information:
@@ -484,28 +508,38 @@ Log retrieval needs some care to avoid returning too much data or leaking sensit
 - **Request throttling**: Log requests are rate-limited to protect the Kubernetes API server from too much load.
 - **Sensitive data redaction**: Log lines are scanned for common sensitive patterns (bearer tokens, passwords, connection strings, API keys) and redacted before being returned.
 
+**Pluggable log provider**: Log retrieval should be implemented behind a provider interface so that different backends can be used.
+The default implementation reads logs directly from Kubernetes pod logs via the Fabric8 client.
+Other providers (e.g., for external log aggregation systems) can be added later and selected via configuration.
+This keeps the tool layer independent of the log source.
+
 **Limitations**: Direct pod log reading only provides current and previous container logs.
-For production use cases that require querying historical logs, integration with an external log aggregation system would be needed.
+For production use cases that require querying historical logs, a provider for an external log aggregation system would be needed.
 This is not part of the initial implementation and should be proposed in a follow-up proposal.
 
 #### Metrics Strategy
 
-The initial implementation will read metrics from broker and controller pod HTTP endpoints as described in the [Infrastructure only](#proposal) section, using either the [Strimzi Metrics Reporter](https://github.com/strimzi/metrics-reporter) or the Kafka JMX exporter depending on user configuration.
+**Pluggable metrics provider**: Metrics retrieval should be implemented behind a provider interface, same as [log retrieval](#log-handling).
+The default implementation reads metrics directly from broker and controller pod HTTP endpoints, using either the [Strimzi Metrics Reporter](https://github.com/strimzi/metrics-reporter) or the Kafka JMX exporter depending on user configuration.
+Other providers (e.g., for Prometheus or other metrics systems) can be added later and selected via configuration.
 
-**Prometheus API integration** (querying an existing Prometheus instance for historical and aggregated metrics) **is explicitly out of scope** for this proposal and can be added in a separate proposal later.
+**Prometheus API integration** (querying an existing Prometheus instance for historical and aggregated metrics) **is explicitly out of scope** for this proposal and can be added as a separate metrics provider later.
 
 **Caveats**:
 - Metrics port: The server auto-detects the metrics port from the pod spec container ports (Strimzi Metrics Reporter defaults to 8080, JMX exporter typically 9404).
 - Custom metric names: Strimzi and Kafka metric names can be customized by users, so the MCP server will need configurable metric name mappings.
 - RBAC: Direct pod scraping needs `get` on `pods/proxy` (included in the ClusterRole above).
 - Direct pod scraping only gives point-in-time metrics, no historical data.
-  For historical data, Prometheus integration would be needed.
+  For historical data, a Prometheus provider would be needed.
+- Scraping load: Direct pod scraping adds load to the broker/controller pods.
+  During incidents this could make things worse, so metrics requests should be rate-limited (same as log requests).
+  For production use, a centralized metrics system that has already collected the data is preferred.
 
 Metrics are mostly useful during ad-hoc incident investigation, for example "Is the broker under heavy load?", "What's the replication lag?", "Are there under-replicated partitions?".
 For ongoing monitoring, existing alerting infrastructure (Prometheus alerts, Grafana) is still the primary tool.
 
 **Future extensibility**: Integrations with external logging and metrics systems are not part of this proposal and should be proposed separately.
-The implementation should be structured so that adding such integrations later does not require rewriting the core.
+The pluggable provider architecture for both logs and metrics makes it possible to add such integrations without rewriting the core.
 
 ### Deployment & Delivery
 
@@ -563,9 +597,14 @@ N/A, this is a new project under StreamsHub.
 
 ### Option 1: Custom CLI Tool instead of MCP
 
-Instead of creating an MCP server, we could give users custom CLI tools as a shortcut for calling kubectl commands.
-But this won't work with LLMs and users would still need to learn how Strimzi and Kafka Kubernetes deployments work instead of using natural language.
-It also wouldn't support the correlation and diagnosis that LLM integration brings.
+A custom CLI tool could provide Strimzi specific commands as a shortcut for kubectl, including structured output formats like JSON or YAML.
+LLM agents can use CLI tools, so a CLI could work with LLMs.
+However, MCP and CLI serve different use cases:
+- MCP keeps the processing server-side, so the client only needs an MCP connection — no local kubectl access, kubeconfig, or Strimzi CLI installation required.
+- MCP Prompt Templates provide multistep diagnostic workflows that combine multiple tools and resources into a guided sequence, while CLI help describes individual commands without orchestrating them together.
+- MCP Resources allow the server to proactively push updated state to the client, while a CLI requires the LLM to repeatedly poll for changes.
+
+A CLI could be a useful complement to MCP in the future, but for the initial use case (LLM-driven diagnosis) MCP is a better fit.
 
 ### Option 2: Use existing generic Kubernetes MCP server