Status: Draft
Last updated: 2025-10-07
Owner: Visor team
Supersedes: telemetry-platform-rfc.md, opentelemetry-tracing-rfc.md (merged into this single RFC) Related: execution-statistics-rfc.md, failure-routing-rfc.md
We want SonarCloud‑style visibility and durable, searchable traces for every Visor run. Beyond raw telemetry, the goal is to power dashboards (per repo/org, per check, trends) with no rate limiting in CI, while remaining privacy‑first and operable in both connected and serverless environments.
- Open standards: adopt OpenTelemetry (OTel) as the primary API/SDK for tracing and metrics.
- Well‑thought spans covering runs, checks, providers, routing, forEach items, and diagram evaluation.
- Zero sampling in CI (AlwaysOn); rely on batching/queues, not drops or rate limiting.
- Serverless mode: write NDJSON (one span per line) simplified spans to disk for later ingestion (OTLP JSON optional later).
- Privacy by default: no raw code/AI payloads; hash/redact sensitive strings.
- Minimal overhead: lightweight Mermaid analysis; no rendering.
- Rendering Mermaid diagrams into images/SVG.
- Replacing existing logs/execution statistics; this augments and unifies them.
- Sending data to third parties without explicit config.
We introduce a unified telemetry platform centered on OpenTelemetry. The platform emits structured spans, metrics, and (optionally) events. It integrates with existing ExecutionStatistics and adds small, cheap Mermaid diagram checks for observability only.
Two deployment modes:
- Connected: OTLP (HTTP/gRPC) exporters to an OTel Collector (Tempo/Jaeger/etc.).
- Serverless: write NDJSON (one span per line) simplified spans to
output/traces/for later upload. (Optional: OTLP JSON in a future iteration.)
- Use
@opentelemetry/sdk-nodewith AsyncLocalStorage context manager. - Auto‑instrumentations:
@opentelemetry/auto-instrumentations-node(http, undici, child_process, dns; fs optional). - Exporters:
@opentelemetry/exporter-trace-otlp-httpor gRPC; plus a custom FileSpanExporter for serverless. - Sampler: AlwaysOn in CI (no sampling). In dev, allow ratio sampling.
- BatchSpanProcessor tuned for high throughput; large queues instead of rate limits.
- Preferred approach: keep our existing logger, but inject OTel context (
trace_id,span_id,trace_flags) into every log line. - Implementation:
logger.tsconsults@opentelemetry/apitrace.getSpan(context.active())to fetch ids and appends them as structured fields. - Output: JSONL to stderr (unchanged) with added fields for correlation; pretty mode in local dev.
- Optional: if OTel Logs for Node is sufficiently stable in our env, add an OTLP Log Exporter alongside console JSONL. Otherwise, route logs to Loki/ELK with fields.
- Guarantees: every
console/loggermessage becomes traceable viatrace_id/span_idand shows up next to spans in backends that support exemplars/correlation.
trace: visor.run (root span)
├─ visor.check (per check execution)
│ ├─ visor.provider (ai|command|http|claude-code)
│ ├─ visor.routing (retry/goto/run remediation)
│ ├─ visor.foreach.item (when forEach is active)
│ └─ visor.mermaid (extract/evaluate; no rendering)
└─ auto‑instrumented http/undici/... spans
Naming & Status:
visor.run— root;OKunless the entire run aborts.visor.check—ERRORon execution error or fail_if triggered; issues alone do not imply error.visor.provider—ERRORif provider call fails; captures model/provider attributes.visor.routing— spans for retry/goto/run_js decisions;ERRORif loop budget exceeded.visor.foreach.item— isolates per‑item timing/attributes.visor.mermaid— extract/evaluate spans; metrics only, no source persisted.
Span events to record decisions: issues.present, fail_if.triggered, retry.scheduled, goto.target, run_js.result, goto_js.result.
State change coverage (exhaustive):
- Run:
run.started,run.completed(attributes: totals, duration). - Check lifecycle:
check.scheduled,check.started,check.skipped(withskip_reason),check.completed. - forEach items:
foreach.started/foreach.completed(index/total attributes) or dedicated child spans. - Dependencies:
dependency.waiting(dep id),dependency.ready. - Conditional evaluation:
if.evaluated(expression hash, result),fail_if.evaluated(hash, result),fail_if.triggered. - Routing:
retry.scheduled(attempt, backoff),goto.target(step id),run.remediation(step id list). - Providers:
provider.request/provider.response(sizes, durations; no payloads),ai.session.reused.
Resource (standard + custom):
service.name=visor,service.version,service.namespace=<org>deployment.environment=github-actions|ci|local
Run (visor.run):
visor.run.id,visor.run.mode(cli|github-actions)visor.repo.owner,visor.repo.name,visor.pr.number,visor.git.head,visor.git.basevisor.files.changed_count,visor.diff.additions,visor.diff.deletionsvisor.max_parallelism,visor.fail_fast
Check (visor.check):
visor.check.id,visor.check.type,visor.check.group,visor.check.schema,visor.check.tagsvisor.check.depends_on,visor.check.skipped,visor.check.skip_reasonvisor.check.foreach.index,visor.check.foreach.totalvisor.issues.total,visor.issues.critical,visor.issues.error,visor.issues.warning,visor.issues.info
Provider (visor.provider):
visor.provider.type,visor.ai.model,visor.ai.session_reused
Routing (visor.routing):
visor.routing.action(retry|goto|run),visor.routing.attempt,visor.routing.backoff_ms,visor.routing.goto_target,visor.routing.loop_count,visor.routing.max_loops
Mermaid (visor.mermaid):
- Extract:
visor.diagram.blocks_found,visor.diagram.source(content|issue) - Evaluate:
visor.diagram.syntax_ok,visor.diagram.node_count,visor.diagram.edge_count,visor.diagram.components,visor.diagram.density,visor.diagram.isolated_nodes,visor.diagram.references_changed_components,visor.diagram.score
Privacy: do not persist raw code, diagram text, prompts, or responses unless explicitly enabled. Hash file paths and messages when redaction is on.
Counters:
visor.check.issues(attrs: check_id, severity)visor.diagram.blocks(attrs: syntax_ok)
Histograms:
visor.check.duration_msvisor.provider.duration_msvisor.diagram.evaluate.duration_ms
Gauges:
visor.run.active_checks
- Extraction via fenced block regex for ```mermaid in rendered outputs and issue messages.
- No local analysis. We emit full Mermaid code as telemetry so downstream services can analyze asynchronously.
- Event shape:
diagram.blockwith attributes{ check, origin: 'content'|'issue', code: <full_mermaid> }. - Redaction: none by default for diagrams (unaltered code is sent) per current requirements.
VisorConfig.telemetry (excerpt):
telemetry:
enabled: false
sink: otlp # otlp|file|console
diagrams:
evaluate: true # evaluation only; no render
redaction:
hash_files: true
hash_messages: true
tracing:
sampler: always_on # in CI; ratio allowed in dev
batch:
max_queue_size: 65536
max_export_batch_size: 8192
scheduled_delay_ms: 1000
export_timeout_ms: 10000
otlp:
protocol: http # or grpc
endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT}
headers: ${OTEL_EXPORTER_OTLP_HEADERS}
file_trace:
dir: output/traces
format: otlp_json # otlp_json|ndjsonCLI flags: --telemetry, --telemetry-sink, --telemetry-endpoint, --telemetry-diagrams.
ENV: Standard OTel envs + VISOR_TELEMETRY_*.
FileSpanExporterwrites NDJSON (one span per line) simplified span objects tooutput/traces/run-<id>-<ts>.ndjson.- These files are ingestible by an OTel Collector via a
filelogreceiver + transform pipeline to OTLP. - Optional OTLP JSON export may be added later if needed.
- Provide
scripts/push-traces.ts(planned) to upload stored traces to a collector.
- Existing
ExecutionStatistics(engine) remains the authoritative per‑run aggregate. run_completedspan records aggregate counters as attributes; metrics are derived as histograms/counters.- This RFC supersedes earlier execution‑only docs for schema guidance;
execution-statistics-rfc.mdremains as historical rationale.
src/telemetry/opentelemetry.ts— NodeSDK bootstrap, exporters, resource.src/check-execution-engine.ts— createvisor.runandvisor.check/routing/foreach.itemspans; add events and attributes.src/providers/*— wrap provider calls withvisor.providerspans; propagate context.src/reviewer.ts— after content assembly, scan for Mermaid, emitvisor.mermaidspans (attributes only).src/pr-analyzer.ts— enrich root span with PR/repo attributes.src/github-check-service.ts— optional summary check; include trace id link in details.- Logger — append
trace_id/span_idwhen telemetry enabled.
// src/logger.ts (sketch)
import { context, trace } from '@opentelemetry/api';
function traceContext() {
const span = trace.getSpan(context.active());
const ctx = span?.spanContext();
return ctx ? { trace_id: ctx.traceId, span_id: ctx.spanId, trace_flags: ctx.traceFlags } : undefined;
}
function write(level: string, msg: string, extra?: Record<string, unknown>) {
const tc = traceContext();
const payload = { level, msg, ts: new Date().toISOString(), ...(tc || {}), ...(extra || {}) };
process.stderr.write(JSON.stringify(payload) + '\n');
}- Provide example Grafana dashboards (Tempo + Prometheus):
- Run overview: success/failure/skip, duration percentiles, parallelism, issue counts by severity.
- Check deep dive: duration by check, error hot spots, routing actions.
- Diagram quality: syntax pass rate, average score.
- SDK bootstrap + FileSpanExporter + minimal spans in engine/providers.
- OTLP exporters and CI config; Grafana/Tempo example dashboards.
- Mermaid evaluation attributes; logs correlation; script for pushing stored traces.
- CI runs emit full traces with AlwaysOn sampling and no drops under typical load (batched exports).
- Serverless mode writes valid OTLP JSON traces loadable by an OTel Collector.
- Spans show the expected hierarchy; attributes are present and PII‑safe by default.
- Diagram spans emit evaluation metrics only; no rendering or diagram source persisted.
- Overhead: keep spans coarse (few levels), batch aggressively; disable heavy auto‑instrumentations.
- Data leakage: default redaction and attribute size caps; no raw code/AI payloads.
- Volume: AlwaysOn in CI; rely on queue sizing and backpressure; allow local ratio sampling.
- Default protocol (HTTP vs gRPC) for OTLP in GitHub Actions.
- Injecting
traceparentinto command providers by default to link downstream tools. - Which CI environments beyond GitHub Actions to tailor resource attributes for.
Completed
- Single RFC consolidating telemetry + OpenTelemetry tracing.
- Config:
telemetryblock (enabled, sink=otlp|file|console, otlp/file options) with env overrides. - Bootstrap:
initTelemetry(OTLP HTTP traces, optional OTLP metrics; serverless NDJSON exporter); AlwaysOn in CI (no rate limiting). - Log correlation: logger adds
[trace_id span_id]; console.* patched when telemetry enabled. - CLI and GitHub Actions: root
visor.runspan around major phases; full root coverage in CLI; partial in Actions (reviewPR path). - Spans:
visor.check(parent),visor.provider(child),visor.foreach.item(per item). - Events:
check.started/completed,if.evaluated,fail_if.evaluated,check.skipped,retry.scheduled,goto.target,run.remediation,command.exec.completed/error,http.request/response,foreach.started/skipped/completed. - Metrics: histograms (
visor.check.duration_ms,visor.provider.duration_ms,visor.foreach.item.duration_ms), counter (visor.check.issues{severity}), gauge (visor.run.active_checksvia UpDownCounter), counter (visor.fail_if.triggered{scope}). - Providers: command exec child span + W3C context injection; http request/response events with status.
- E2E tests: forEach + transform chains; read and validate resulting JSON; assert spans/events counts and attributes.
- Example Grafana dashboards: overview + diagrams (skeleton) and setup guide.
- Schema/types:
telemetryblock added toVisorConfigTypeScript types and generator picks it up for JSON Schema. - Mermaid telemetry: emit
diagram.blockwith full Mermaid code (no local analysis or redaction).
Changed vs. original plan
- Serverless exporter writes NDJSON simplified spans (one span per line) instead of full OTLP JSON. Rationale: easier diffs and ingestion via Collector filelog; OTLP JSON remains optional for later.
- Diagram rendering explicitly out of scope; only telemetry signals are planned. (No rendering implemented.)
Pending / Next
- Wrap the entire GitHub Actions
run()with a single rootvisor.runspan (early logs included); ensure consistent repo/PR attributes at start. - Diagram counters/panels: add metrics counters for diagram blocks to power Grafana panels.
- Redaction controls: enforce hashing/truncation for file paths/messages when enabled in
telemetry.redaction. - OTLP Logs: consider optional exporter when Node OTel Logs is stable; otherwise keep JSONL logs with trace correlation.
- Telemetry summary GitHub Check: optional status check summarizing totals and linking to trace.
- Additional tests: fail_if trigger E2E, Actions path E2E, command/http provider span details.
- Schema: add
telemetryblock to generated config schema and validation (Ajv), with docs/examples. - Optional OTLP JSON file exporter alongside NDJSON if needed by downstream tooling.