Skip to content

[EPIC]: OTel instrumentation #972

@fabriziosestito

Description

@fabriziosestito

OpenTelemetry Instrumentation

Description

This epic adds OpenTelemetry metrics and distributed tracing across all SBOMScanner components: controller, worker, storage
API server, and MCP server. The goal is end-to-end observability of the scan pipeline, from ScanJob creation through catalog
discovery, SBOM generation, vulnerability scanning, and report aggregation.

Trace context is propagated through NATS JetStream message headers, so a single trace can follow a scan from the controller
reconciler through every worker handler. The OTel SDK dependencies already exist in go.mod; this epic wires them into the
application.

All services export traces and metrics via OTLP (gRPC), configurable through standard OTel environment variables
(OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME, etc.). Instrumentation is opt-in: when no exporter endpoint is
configured, the services behave as before with no overhead.

User Stories

  • As an operator, I want distributed traces across the scan pipeline so I can debug slow or failed scans.

    • Given a ScanJob is created, when it progresses through catalog creation, SBOM generation, and vulnerability
      scanning, then the entire flow is captured as a single distributed trace with spans for each stage, linked through NATS
      JetStream message headers.
  • As an operator, I want traces for external calls so I can identify bottlenecks.

    • Given the worker processes an image, when it calls a container registry API or invokes the Trivy library, then
      each call is captured as a child span with duration, status, and relevant attributes (registry URL, image reference).
  • As an operator, I want the Kubernetes reconcilers and webhooks instrumented so I can monitor controller performance.

    • Given the controller is running, when a reconciler processes a resource or a webhook validates a request, then
      a span is created with the resource kind, name, namespace, and reconciliation outcome.
  • As an operator, I want the storage API server instrumented so I can monitor database performance.

    • Given the storage API server is handling requests, when a PostgreSQL query is executed, then a span is created
      with the operation type, table name, and duration.
  • As an operator, I want metrics for the scan pipeline so I can build dashboards and alerts.

    • Given OTel metrics export is configured, when scans are running, then the following metrics are available:
      scan duration histograms, images scanned per ScanJob, vulnerabilities found by severity, NATS message processing latency,
      handler error rates, registry API call duration, and Trivy invocation time.
  • As an operator, I want NATS JetStream specific tracing so I can monitor message flow.

    • Given a message is published to the SBOMSCANNER stream, when a worker consumes it, then the consumer span is
      a child of the producer span, and both include attributes following OTel messaging semantic conventions (messaging.system,
      messaging.destination.name, messaging.operation.type, JetStream stream/consumer names, delivery attempt count).

Components to Instrument

Controller
* ScanJobReconciler, VulnerabilityReportReconciler, WorkloadScanReconciler, ImageWorkloadScanReconciler
* RegistryScanRunner periodic loop
* Admission webhooks (Registry, ScanJob, WorkloadScanConfiguration)
* NATS JetStream publish (producer spans with context injection)

Worker
* NATS JetStream consume (consumer spans with context extraction)
* CreateCatalogHandler, GenerateSBOMHandler, ScanSBOMHandler
* Container registry client calls (catalog, tags, manifests, layers)
* Trivy library invocations (SBOM generation and vulnerability scan)

Storage API Server
* HTTP request handling (otelhttp middleware)
* PostgreSQL repository operations (Create, Get, Update, Delete, List)

MCP Server
* Tool call execution spans
* Kubernetes client operations

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    No status

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions