Skip to content

Analytics sidecar: ClickHouse + Grafana for cross-session event analysis & historical replay #336

@jonathaneoliver

Description

@jonathaneoliver

Problem

testing.html renders rich live charts (Chart.js + vis-timeline) over the SSE event stream from go-proxy, but events are ephemeral — once a session ends, the data is gone. We have no way to:

  • Replay a past session through the same visualizer.
  • Run cross-session analytics (e.g. "compare buffer-depth distributions across all sessions with transient-shock failure injection," "p95 startup time over the last 30 days," "which event sequences correlate with rebuffer events").
  • Do ad-hoc SQL exploration over the event corpus.

We want this without adding storage / archival code paths to go-proxy or other existing services. The streaming app stays as it is; analytics is a sidecar.

Proposal

Stand up an analytics tier alongside the existing stack:

  1. ClickHouse (single-node, Docker) — columnar store for events. One wide events table keyed by session_id + ts, typed columns for hot-path fields (event_type, bitrate, buffer_depth, fps, dropped, etc.), a JSON column for the long tail.
  2. SSE→ClickHouse forwarder — small standalone process (Go or Python, ~50–100 lines) that subscribes to /api/sessions/stream and batch-inserts into ClickHouse. Not part of go-proxy. If it dies, live charts keep working; we just lose archival until restart.
  3. Grafana with the official ClickHouse datasource — ad-hoc dashboards, cross-session aggregates, the "Splunk-equivalent" exploration surface.
  4. testing.html historical mode — a ?session=<id>&replay=1 mode that fetches the event array from ClickHouse over HTTP (SELECT … FORMAT JSON) instead of subscribing to SSE. Same Chart.js / vis-timeline renderer; only the feeder swaps. Add a scrubber, hide the live/pause toggle.

Why ClickHouse over alternatives

Option Verdict
Loki + Grafana Strong on logs, weak on cross-session math. Skip.
OpenSearch / ELK Heavier ops, DSL instead of SQL. Skip.
TimescaleDB Fine but less efficient for wide event rows + columnar scans.
DuckDB + Parquet Tempting (zero servers) but batch-only — doesn't fit "query yesterday and right now in the same query."
ClickHouse Real-time ingest, columnar SQL at scale, single-node trivial in Docker, first-class Grafana plugin. ✅

Acceptance criteria

  • ClickHouse service added to docker-compose.yml (and k3s manifests).
  • events schema designed; migrations applied at startup.
  • Forwarder process subscribes to SSE, batch-inserts events, survives reconnects.
  • Grafana service added with ClickHouse datasource provisioned.
  • One starter Grafana dashboard: per-session timeline + cross-session aggregates (e.g. variant distribution, rebuffer count by failure type).
  • testing.html supports replay=1&session=<id>: fetches events from ClickHouse, renders through the existing Chart.js / vis-timeline code, scrubber for time range.
  • No code changes in go-proxy, go-live, or go-upload beyond what's needed to expose the SSE stream (already exposed).
  • Retention policy documented (default 30 days, configurable).

Out of scope

  • Replacing the live-mode SSE path. Live stays SSE-direct.
  • Multi-tenant access controls on Grafana.
  • Real-time alerting on event patterns (separate issue if we want it).

Open questions

  • Forwarder language: Go (matches stack) vs Python (faster to prototype). Lean Go.
  • Schema: how aggressively to flatten event payloads vs lean on ClickHouse JSON type.
  • Where forwarder runs: same container as go-proxy (sidecar) vs its own service. Lean own service.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions