OpenTelemetry Instrumentation
Description
This epic adds OpenTelemetry metrics and distributed tracing across all SBOMScanner components: controller, worker, storage
API server, and MCP server. The goal is end-to-end observability of the scan pipeline, from ScanJob creation through catalog
discovery, SBOM generation, vulnerability scanning, and report aggregation.
Trace context is propagated through NATS JetStream message headers, so a single trace can follow a scan from the controller
reconciler through every worker handler. The OTel SDK dependencies already exist in go.mod; this epic wires them into the
application.
All services export traces and metrics via OTLP (gRPC), configurable through standard OTel environment variables
(OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME, etc.). Instrumentation is opt-in: when no exporter endpoint is
configured, the services behave as before with no overhead.
User Stories
-
As an operator, I want distributed traces across the scan pipeline so I can debug slow or failed scans.
- Given a ScanJob is created, when it progresses through catalog creation, SBOM generation, and vulnerability
scanning, then the entire flow is captured as a single distributed trace with spans for each stage, linked through NATS
JetStream message headers.
-
As an operator, I want traces for external calls so I can identify bottlenecks.
- Given the worker processes an image, when it calls a container registry API or invokes the Trivy library, then
each call is captured as a child span with duration, status, and relevant attributes (registry URL, image reference).
-
As an operator, I want the Kubernetes reconcilers and webhooks instrumented so I can monitor controller performance.
- Given the controller is running, when a reconciler processes a resource or a webhook validates a request, then
a span is created with the resource kind, name, namespace, and reconciliation outcome.
-
As an operator, I want the storage API server instrumented so I can monitor database performance.
- Given the storage API server is handling requests, when a PostgreSQL query is executed, then a span is created
with the operation type, table name, and duration.
-
As an operator, I want metrics for the scan pipeline so I can build dashboards and alerts.
- Given OTel metrics export is configured, when scans are running, then the following metrics are available:
scan duration histograms, images scanned per ScanJob, vulnerabilities found by severity, NATS message processing latency,
handler error rates, registry API call duration, and Trivy invocation time.
-
As an operator, I want NATS JetStream specific tracing so I can monitor message flow.
- Given a message is published to the
SBOMSCANNER stream, when a worker consumes it, then the consumer span is
a child of the producer span, and both include attributes following OTel messaging semantic conventions (messaging.system,
messaging.destination.name, messaging.operation.type, JetStream stream/consumer names, delivery attempt count).
Components to Instrument
Controller
* ScanJobReconciler, VulnerabilityReportReconciler, WorkloadScanReconciler, ImageWorkloadScanReconciler
* RegistryScanRunner periodic loop
* Admission webhooks (Registry, ScanJob, WorkloadScanConfiguration)
* NATS JetStream publish (producer spans with context injection)
Worker
* NATS JetStream consume (consumer spans with context extraction)
* CreateCatalogHandler, GenerateSBOMHandler, ScanSBOMHandler
* Container registry client calls (catalog, tags, manifests, layers)
* Trivy library invocations (SBOM generation and vulnerability scan)
Storage API Server
* HTTP request handling (otelhttp middleware)
* PostgreSQL repository operations (Create, Get, Update, Delete, List)
MCP Server
* Tool call execution spans
* Kubernetes client operations
OpenTelemetry Instrumentation
Description
This epic adds OpenTelemetry metrics and distributed tracing across all SBOMScanner components: controller, worker, storage
API server, and MCP server. The goal is end-to-end observability of the scan pipeline, from ScanJob creation through catalog
discovery, SBOM generation, vulnerability scanning, and report aggregation.
Trace context is propagated through NATS JetStream message headers, so a single trace can follow a scan from the controller
reconciler through every worker handler. The OTel SDK dependencies already exist in
go.mod; this epic wires them into theapplication.
All services export traces and metrics via OTLP (gRPC), configurable through standard OTel environment variables
(
OTEL_EXPORTER_OTLP_ENDPOINT,OTEL_SERVICE_NAME, etc.). Instrumentation is opt-in: when no exporter endpoint isconfigured, the services behave as before with no overhead.
User Stories
As an operator, I want distributed traces across the scan pipeline so I can debug slow or failed scans.
scanning, then the entire flow is captured as a single distributed trace with spans for each stage, linked through NATS
JetStream message headers.
As an operator, I want traces for external calls so I can identify bottlenecks.
each call is captured as a child span with duration, status, and relevant attributes (registry URL, image reference).
As an operator, I want the Kubernetes reconcilers and webhooks instrumented so I can monitor controller performance.
a span is created with the resource kind, name, namespace, and reconciliation outcome.
As an operator, I want the storage API server instrumented so I can monitor database performance.
with the operation type, table name, and duration.
As an operator, I want metrics for the scan pipeline so I can build dashboards and alerts.
scan duration histograms, images scanned per ScanJob, vulnerabilities found by severity, NATS message processing latency,
handler error rates, registry API call duration, and Trivy invocation time.
As an operator, I want NATS JetStream specific tracing so I can monitor message flow.
SBOMSCANNERstream, when a worker consumes it, then the consumer span isa child of the producer span, and both include attributes following OTel messaging semantic conventions (
messaging.system,messaging.destination.name,messaging.operation.type, JetStream stream/consumer names, delivery attempt count).Components to Instrument
Controller
* ScanJobReconciler, VulnerabilityReportReconciler, WorkloadScanReconciler, ImageWorkloadScanReconciler
* RegistryScanRunner periodic loop
* Admission webhooks (Registry, ScanJob, WorkloadScanConfiguration)
* NATS JetStream publish (producer spans with context injection)
Worker
* NATS JetStream consume (consumer spans with context extraction)
* CreateCatalogHandler, GenerateSBOMHandler, ScanSBOMHandler
* Container registry client calls (catalog, tags, manifests, layers)
* Trivy library invocations (SBOM generation and vulnerability scan)
Storage API Server
* HTTP request handling (otelhttp middleware)
* PostgreSQL repository operations (Create, Get, Update, Delete, List)
MCP Server
* Tool call execution spans
* Kubernetes client operations