Skip to content

Conversation

@cmacrae
Copy link
Contributor

@cmacrae cmacrae commented Nov 12, 2025

Summary

Implements unified observability using OpenTelemetry with two modes:

  • metrics flag: Serves metrics on :9091/metrics for Prometheus scraping
  • otel flag: Pushes metrics, traces, and logs to OTLP endpoint

Key Changes

  • Unified metrics: Replaces @hono/prometheus with OpenTelemetry for single canonical metric definitions across both modes. NOTE: Metrics format changes (metric names, labels follow OpenTelemetry semantic conventions)
  • Automatic instrumentation: HTTP and MongoDB operations traced/metered automatically
  • Pino-to-OTel bridge: Logs go to both stdout and OTLP when otel enabled

Configuration

New environment variables (all optional):

  • OTEL_ENDPOINT - OTLP endpoint URL (default: http://localhost)
  • OTEL_SERVICE_NAME - Service name (default: nildb)
  • OTEL_TEAM_NAME - Team name (default: nildb)
  • OTEL_DEPLOYMENT_ENV - Environment (default: local)
  • OTEL_METRICS_EXPORT_INTERVAL_MS - Export interval in ms (default: 60000)

Local Testing

The local stack now includes an OTel Collector with debug exporter:

# View telemetry
docker compose -f local/docker-compose.yaml logs otel-collector

# Switch modes by changing APP_ENABLED_FEATURES:
# - openapi,metrics,migrations (scrape mode)
# - openapi,otel,migrations (push mode)

@github-actions
Copy link

github-actions bot commented Nov 12, 2025

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 76.88% 1141 / 1484
🔵 Statements 76.62% 1154 / 1506
🔵 Functions 81.66% 570 / 698
🔵 Branches 46.84% 171 / 365
File Coverage
File Stmts Branches Functions Lines Uncovered Lines
Changed Files
packages/api/src/app.ts 95% 50% 100% 95% 29-32
packages/api/src/env.ts 95% 88.88% 100% 100% 57
packages/api/src/main.ts 0% 0% 0% 0% 27-155
packages/api/src/common/logger.ts 12.5% 3.33% 40% 12.5% 24-84, 93-107
packages/api/src/common/otel.ts 0% 0% 0% 0% 68-246
packages/api/src/middleware/logger.middleware.ts 82.35% 75% 100% 82.35% 41-52
packages/api/src/middleware/metrics.middleware.ts 100% 50% 100% 100%
packages/api/src/system/system.controllers.ts 94.59% 50% 100% 94.59% 117-118
packages/api/src/system/system.router.ts 100% 100% 100% 100%
Generated in workflow #449 for commit 5d22b10 by the Vitest Coverage Report Action

Copy link
Collaborator

@tim-hm tim-hm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent 🙌

@cmacrae cmacrae marked this pull request as ready for review November 13, 2025 16:38
Copy link
Collaborator

@tim-hm tim-hm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Add OpenTelemetry packages to production dependencies and configure
external bundling. This enables automatic instrumentation for metrics,
traces, and logs with dual export modes (OTLP push and Prometheus scrape).
  Add OTEL_METRICS_EXPORT_INTERVAL_MS environment variable to control
  how frequently metrics are exported. Defaults to 60000ms (1 minute),
  can be set lower for local testing.
  Add OpenTelemetry Collector with debug exporter for local testing.
  Replace Jaeger with simpler debug-to-stdout approach. Configure
  faster metrics export interval (10s) for local development.
  Add Proxy-based bridge that emits Pino logs to both stdout and
  OpenTelemetry LoggerProvider when otel feature flag is enabled.
  Maintains backward compatibility with stdout-only logging.
  Add initializeMetricsOnly() for metrics-only mode (PrometheusExporter
  serves /metrics endpoint) and initializeOtel() for full mode (OTLP
  push only). Both use OpenTelemetry APIs with automatic HTTP
  instrumentation, providing DRY metric definitions.

  - metrics flag: serves metrics on :9091/metrics for scraping
  - otel flag: pushes metrics/traces/logs to OTLP, no /metrics endpoint
  Remove getMetrics() function and metrics server logic. Metrics are
  now handled entirely by OpenTelemetry providers (PrometheusExporter
  for scraping or OTLP for push), not Hono routers.
Automatic trace instrumentation via @opentelemetry/auto-instrumentations-node
handles HTTP request tracing and metrics.
  Update main.ts to initialize correct observability mode based on
  feature flags:
  - metrics flag only: initializeMetricsOnly() for Prometheus scraping
  - otel flag: initializeOtel() for full OTLP push (metrics/traces/logs)

  Add proper shutdown handlers for both modes.
  Document new metrics/otel feature flag behavior, remove Jaeger
  references, add OTEL_METRICS_EXPORT_INTERVAL_MS configuration,
  and clarify when /metrics endpoint is available.
Create build-deps target that builds workspace dependencies and make
check, test, and build targets depend on it. These workspace packages
must be built before type checking and testing can succeed
Replaces ATTR_DEPLOYMENT_ENVIRONMENT with ATTR_DEPLOYMENT_ENVIRONMENT_NAME
to align with current OpenTelemetry semantic conventions. The deprecated
attribute set 'deployment.environment', while the new one correctly sets
'deployment.environment.name'.
Replace custom shouldEmitTelemetry() function with standard OpenTelemetry
OTEL_SDK_DISABLED environment variable. This provides better separation
of concerns: deployment environment name is now purely a resource
attribute, while OTEL_SDK_DISABLED controls emission.
The 'metrics' and 'otel' feature flags are now mutually exclusive.
The server will exit with a clear error message if both are enabled,
preventing misconfiguration and forcing explicit choice of observability mode.
Replace String(value) with JSON.stringify(value) for complex objects
in the Pino-to-OTel logger bridge to prevent "[object Object]" in logs.
Added try-catch to gracefully handle circular references by falling
back to String(value).
HTTP request logs were inconsistent with the rest of the codebase,
passing only attributes without a message string. This resulted in
empty OTLP body fields. All other logs use the standard Pino pattern
of log.method({ attrs }, "message"). Now HTTP logs follow the same
pattern with messages like "GET /health 200".
…figuration

- Add envDetector and processDetector to merge resource attributes from environment
- Make createOtelResource, initializeOtel, and initializeMetricsOnly async to support detection
- Update main.ts to await async OpenTelemetry initialization functions
- Document OTEL_RESOURCE_ATTRIBUTES usage and precedence rules

Enables dynamic resource attribute configuration per OpenTelemetry standards:
  OTEL_RESOURCE_ATTRIBUTES=service.instance.id=nildb-r5nw

Environment variables take precedence over programmatically set values, allowing
deployment-specific attributes to override defaults without code changes.
Integrate @opentelemetry/instrumentation-runtime-node and
@opentelemetry/host-metrics to provide comprehensive metrics for:

- Node.js runtime metrics (heap usage, RSS, GC, event loop lag)
- System metrics (CPU, memory, network)

Both metrics-only and full OTEL modes now include these metrics,
enabling better visibility into Node.js performance and resource usage.
Replace auto-instrumentation HTTP metrics with custom Hono middleware
that records http.server.duration with http.route attribute for
endpoint-level observability.

- Add metrics.middleware.ts with semantic conventions v1.28.0
- Disable @opentelemetry/instrumentation-http to prevent double counting
- Integrate middleware early in stack for complete request coverage
@cmacrae
Copy link
Contributor Author

cmacrae commented Nov 20, 2025

@tim-hm had to rebase from main, so the history since you last reviewed now has stale commit hashes. these are the commits since you last reviewed:

Copy link
Collaborator

@tim-hm tim-hm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 🙌

@cmacrae cmacrae merged commit bdf939c into main Nov 20, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants