Deep-dive companion to README.md. Read the README first for the
quickstart and headline story; read this for the multi-node compose
shape, plugin write-back path, schema rationale, and scaling notes.
- Domain model
- Schema and per-table retention
- Multi-node compose
- Plugin write-back path (cross-node)
- Three patterns for feeding a UI panel
- Processing Engine triggers
- Enterprise features used
- Token bootstrap (cluster-wide)
- Plugin conventions and gotchas
- Security notes
- Scaling to production
- Extending the cluster
A data-center Clos fabric: 8 spines, 16 leaves, 48 servers per leaf
rack (768 servers total, modeled only as flow src_ip/dst_ip).
Total ~1024 fabric interfaces, ~128 leaf↔spine BGP sessions, ~5,000
sampled flow records per second, 64 latency probe pairs.
Vocabulary: fabric, spine, leaf, ECMP, BGP, peer, prefix, flap, flow record, sampled, top-N talkers, microburst, ECN-mark, PFC pause, oversubscription, hotspot.
Six tables (see influxdb/schema.md for the detailed reference).
Tables are created explicitly at init time via the configure API
(POST /api/v3/configure/table or the influxdb3 create table CLI),
not via implicit creation from the first write. This means:
- Caches and triggers can reference tables immediately at init time without a sentinel-row workaround.
- Schemas (tag set, field types, retention) are declared up front rather than inferred.
- LVC and DVC reads don't have to filter out a
__initsentinel row.
fabric_health is the only table with a retention period (24 hours),
demonstrating per-table retention. Other tables have no retention in
the demo. In production, typical retention values would be:
interface_counters,bgp_sessions,latency_probes: 7-30 daysflow_records: 30-90 days (regulatory)fabric_health,anomalies: 365 days (operational history)
Five InfluxDB 3 Enterprise nodes plus a one-shot token-bootstrap and
init container, plus simulator/UI/scenarios. All five InfluxDB nodes
mount the same influxdb-data named volume at /var/lib/influxdb3.
Sharing the disk is what makes the cluster a cluster — every node
sees the same object store and catalog, so writes from one ingest node
are immediately visible from the query and process nodes, and the
catalog (databases, tables, caches, triggers) stays consistent across
all nodes without explicit coordination.
| Node | Mode | Purpose |
|---|---|---|
nt-ingest-1, nt-ingest-2 |
ingest |
Accept writes; simulator round-robins per batch |
nt-query |
query |
Serves UI partials, browser direct fetches, CLI; hosts request plugins |
nt-compact |
compact |
Background compaction only |
nt-process |
process,query |
Hosts schedule plugins; queries locally, writes back via httpx through an ingest node |
The process node uses the process,query mode combo so plugin code can
call influxdb3_local.query() against the local engine without
HTTP-hopping to another node for reads. (Setting --plugin-dir
implicitly adds process mode; explicitly setting --mode query
keeps the query engine available.)
This is a new convention introduced by this repo (now codified in
the meta repo's CONVENTIONS.md).
A schedule plugin running on a process-only node has no obvious local
ingest target — the engine doesn't accept writes locally on a
non-ingest node. The plugin module-level code therefore loads the
admin token once at import time (from the shared volume) and uses
httpx to POST line protocol back through an ingest node's
/api/v3/write_lp endpoint. The shared plugins/_writeback.py
module factors this out: round-robin over the configured ingest URLs,
one fallback hop on connection error.
Configuration via env vars on the process node (set in
docker-compose.yml):
NT_INGEST_URLS=http://nt-ingest-1:8181,http://nt-ingest-2:8181
NT_DB=nt
NT_TOKEN_FILE=/var/lib/influxdb3/.nt-operator-token
LineBuilder is not used by the schedule plugins in this repo —
the cross-node write-back replaces it.
This repo demonstrates all three ways to get data into the dashboard, side-by-side, each with its own latency badge:
| Pattern | Where the call goes | Used for |
|---|---|---|
| SQL via FastAPI (Python proxy) | browser → nt-ui:8080/partials/... → nt-query:8181/api/v3/query_sql |
Banner, KPIs, throughput chart, anomalies |
| SQL from browser (DVC TVF) | browser → nt-query:8181/api/v3/query_sql directly |
Source-IP typeahead. Sub-ms badge teaches DVC speed. |
| Request plugin from browser (Processing Engine) | browser → nt-query:8181/api/v3/engine/<name> directly |
Top-N talkers, source-IP detail. Composite payloads. |
When to pick which: SQL through FastAPI when the response is HTML fragments (HTMX swaps); SQL direct from browser when the cache speed is the headline (typeahead); request plugin when the response is a composite shape that joins multiple queries' worth of data.
| Name | Type | Spec | Where it runs | Effect |
|---|---|---|---|---|
fabric_health |
Schedule | every:5s |
process | Writes one row per layer to fabric_health |
anomaly_detector |
Schedule | every:5s |
process | Detects and writes anomalies to anomalies |
top_talkers |
Request | request:top_talkers |
query | Top src_ip aggregates |
src_ip_detail |
Request | request:src_ip_detail |
query | Composite drill-down for one IP |
The repo uses every:5s exclusively for schedule triggers — short,
regular intervals don't need cron's time-of-day alignment, and every:
is more readable.
| Feature | Where |
|---|---|
| Multi-node ingest | 2 ingest nodes; simulator round-robins |
| Multi-node split (ingest/query/compact/process) | 5-node compose |
| Last Value Cache | bgp_session_last; powers banner BGP up-count |
| Distinct Value Cache | src_ip_distinct; powers typeahead with sub-ms badge |
| Per-table retention | fabric_health 24h. Exclusive to this repo in the portfolio. |
Schedule trigger via every: syntax |
Exclusive to this repo in the portfolio. |
| Schedule plugin with cross-node write-back | Both schedule plugins via _writeback.py. New convention. |
| Request trigger | top_talkers + src_ip_detail on query node |
| Custom UI | Three patterns side-by-side |
A single token-bootstrap compose service generates one offline admin
token at first boot, written to the shared volume. All five InfluxDB
nodes start with --admin-token-file pointing at the same path; the
simulator, UI, init, and process node read the same token from the
same volume. License validation also happens once per cluster.
See CONVENTIONS.md in the meta repo. Highlights specific to this repo:
LineBuilderis INJECTED — not used by this repo's schedule plugins (they use httpx).- 6-field cron OR
every:interval — this repo usesevery:5s. - LVC reads via
last_cache(table, cache_name)TVF. - DVC reads via
distinct_cache(table, cache_name)TVF. - Multiple unaliased
COUNT(*)scalar subqueries don't compose under DataFusion. date_bin()returns ns-integer strings on the wire.- Browser-facing endpoints need
INFLUX_PUBLIC_URL.
Demo simplifications, called out for production users:
- One admin token, shared by all services. Production should issue scoped tokens per service (read-only for UI, write-only for simulator, scoped for plugin write-back).
- The browser sees the admin token (passed in template context for the direct-fetch panels). Production should proxy through the UI backend or use a token-exchange flow.
- No TLS in compose. Production needs TLS between nodes and to clients.
- More ingest: add ingest-3, ingest-4, etc. Simulator's round-robin scales without code change. Production would put them behind a load balancer.
- Multi-query: add a second query node for read scaling. Both serve the same SQL endpoints; UI hits whichever responds first or load-balanced.
- Object store: swap
filefor S3/GCS/Azure. No code changes; one env var per node. - Retention: extend per-table retention to all tables per the production guidance in §2.
- K8s: the compose service shape maps 1:1 to a Helm chart per node-role. Not shipped here per portfolio policy.
To add a new schedule plugin:
- Create
plugins/schedule_<name>.pyfollowing the existing pattern; importfrom _writeback import write_lines. - Add a trigger registration in
init.sh'sensure_triggers(). - Add a unit test under
tests/test_plugins/. make down && make up— init.sh registers the new trigger on next boot.
To add a new request plugin:
- Create
plugins/request_<name>.py. - Add a trigger registration with
--trigger-spec request:<name>. - Add a unit test.
- The plugin is reachable at
/api/v3/engine/<name>after restart.