feat(clickhouse): OPS HCL per-node composition + migration codegen#65969
feat(clickhouse): OPS HCL per-node composition + migration codegen#65969orian wants to merge 12 commits into
Conversation
Generate a ClickHouse migration from a declarative-HCL change. The HCL stays the source of truth for the schema; this produces the run_sql_with_exceptions(...) operations the existing infi.clickhouse_orm runner executes. - topology.py: explicit, reviewed map of each OPS object -> (node_roles, replicated, sharded). node_roles is a deliberate choice (the dump hostClusterRole vocabulary differs from NodeRole and migrations target a curated subset, so it can't be mechanically derived); seeded by introspecting ../clickhouse-schema and reconciled against 0273/0274. - gen_migration.py: runs ops/diff.sh, maps each DDL statement to its targeting, derives sharded / is_alter_on_replicated_table from the engine kind, and emits the operations. One HCL change can span multiple roles. Errors on objects missing from topology.py; flags UNSAFE recreations.
|
Reviews (1): Last reviewed commit: "feat(clickhouse): OPS HCL → migration co..." | Re-trigger Greptile |
Query snapshots: Backend query snapshots updatedChanges: 1 snapshots (1 modified, 0 added, 0 deleted) What this means:
Next steps:
|
Express object placement via composition instead of a side-table (option 2), replacing codegen/topology.py. A node's schema = compose(its layers); node_roles and per-env gating are derived from which compositions surface a change. - Split base/+prod/ into shared/ (all-role objects), roles/ops + roles/ops-prod (OPS-only), keeping env/ overlays. Needed because role-specific variants exist (the DATA custom_metrics aggregator unions an extra metric the OPS node lacks), which a flat object->roles map cannot express. - Add the `nodes` composition manifest: (env, role) -> ordered layer list. - check.sh / diff.sh are now manifest-driven, iterating every (env, role); add per-role goldens (shared managed subset) for data/aux/ai_events/sessions. - gen_migration.py diffs per (env, role) node and derives node_roles (union of surfacing roles), sharded / is_alter_on_replicated_table (engine), and flags env-specific statements for CLOUD_DEPLOYMENT gating. topology.py removed. check.sh green across all (env, role) compositions.
ec7c645 to
f70dd7d
Compare
| # The OPS node is modeled per env (it carries the env differences). The other roles | ||
| # carry only the shared managed subset today, so one representative env (prod-us) is | ||
| # enough to put them in the role set — shared objects are env-identical. Fully modeling | ||
| # the DATA node (events_recent + the DATA-specific custom_metrics aggregator + |
There was a problem hiding this comment.
Generated migrations misroute shared query_log_archive changes to OPS nodes
The new generator derives node_roles from the composition manifest, but the manifest currently models non-OPS roles as shared only and explicitly says events_recent is temporarily treated as OPS-only. That means shared query_log_archive objects like query_log_archive/writable_query_log_archive are now generated with [DATA, ENDPOINTS, AUX, AI_EVENTS, SESSIONS, OPS] instead of the narrower role sets used by existing migrations (for example writable_query_log_archive is local-only in 0276, and historical write/read path migrations target specific roles like DATA or ENDPOINTS). Because this script emits ready-to-commit migrations, a developer changing the OPS HCL can now auto-generate DDL that runs on extra clusters and overwrites infra-managed distributed tables, creating an integrity/availability incident in production ClickHouse.
Prompt To Fix With AI
Do not derive migration `node_roles` solely from the current per-node composition manifest while that manifest is intentionally incomplete for DATA/endpoints/shared objects. Restore an explicit authoritative mapping for migration targeting, or extend the manifest/model so it fully captures which roles should receive each managed object and preserves the existing special cases (for example writable/read query_log_archive objects that are cloud-infra managed or only exist on certain clusters). Add a guard/test comparing generated role targeting for known OPS objects against existing hand-written migrations like 0273/0276 to prevent widening execution scope.Severity: medium | Confidence: 87% | React with 👍 if useful or 👎 if not
Fixed in a4128f52.
Focus the managed roles on OPS and LOGS; comment out the other shared-only roles in the manifest. node_roles for shared objects now derives to [LOGS, OPS] (uncomment a role + regenerate its golden to widen). - Add LOGS to the manifest for all three cloud envs (dev, prod-us, prod-eu) with per-env goldens (the shared managed subset extracted from each env's logs host). - Remove the data/aux/ai_events/sessions goldens (their manifest lines are commented out). - Generator ROLE_ORDER includes logs. check.sh green for OPS (local + 3 cloud envs) and LOGS (3 cloud envs).
Author the LOGS role's full schema (logs/traces/metrics ingestion) from the clickhouse-schema *-logs.hcl dumps, composed per cloud env: - roles/logs/ — the 16 objects identical across dev/prod-us/prod-eu. - env-logs/<env>/ — per-env objects (kafka configs, zoo_paths, distributed cluster names that differ; the traces subsystem on prod-us/prod-eu only; the prod-us-only trace_spans_kafka_metrics path). dev's daniel_ddl_test is left unmanaged. - LOGS composition = shared + roles/logs + env-logs/<env>; goldens regenerated as the full logs node (minus the unmanaged dev table). - engine_map now resolves one composition per role so is_alter/sharded are correct for LOGS tables too. check.sh green for OPS (local + 3 cloud) and LOGS (3 cloud).
| prod-us ops shared roles/ops roles/ops-prod env/prod-us | ||
| prod-eu ops shared roles/ops roles/ops-prod env/prod-eu | ||
|
|
||
| dev logs shared roles/logs env-logs/dev |
There was a problem hiding this comment.
Generator now deploys shared query_log_archive DDL onto LOGS nodes
The latest manifest change adds LOGS to the managed set and gen_migration.py derives node_roles directly from that manifest, so any shared HCL change under shared/qla.hcl will now generate migrations targeting [NodeRole.LOGS, NodeRole.OPS]. Existing migrations show query_log_archive objects are only managed on OPS / data-path roles, while LOGS has its own separate schema. This makes the generator produce ready-to-commit migrations that create or alter query_log_archive tables/views on LOGS clusters where they do not belong, which can break the logs cluster or overwrite unrelated objects there.
Prompt To Fix With AI
Do not infer that every role composing `shared/` should receive every generated migration operation. Separate shared schema fragments by target domain, or add an explicit per-object/per-layer migration-target mapping so query_log_archive and custom_metrics changes are not emitted for LOGS unless there is verified migration history and runtime ownership for those objects on LOGS. Add a regression test that changing `shared/qla.hcl` does not generate `NodeRole.LOGS` operations.Severity: medium | Confidence: 92% | React with 👍 if useful or 👎 if not
Fixed in 7afcae27, 6462b429.
ops/gen-sql.sh emits sql/<env>-<role>.sql for every node in the manifest — the full, dependency-ordered CREATE schema (hclexp diff vs an empty schema). E.g. apply sql/local-ops.sql to a local ClickHouse to create the OPS schema. check.sh now verifies the committed sql/ is in sync with the HCL (regenerates to a temp dir and diffs), so it can't go stale.
…-a-change runbook
…ded) The bin/hclexp wrapper already falls back to the pinned ghcr.io image, but the scripts pass committed-tree / scratch paths under $TMPDIR that weren't mounted, so the docker path broke. Mount the temp dir alongside the repo so diff.sh, gen-sql.sh, check.sh, and the generator all work via Docker with zero setup. Docs now recommend Docker (just have it running) and show the explicit `docker run … ghcr.io/posthog/chschema` command; HCLEXP_BIN is documented as an optional local-binary speedup.
Refreshing a golden is now `ops/gen-golden.sh` (hclexp load per node from the manifest; optional env/role filter) — no hand-typed layer lists. Goldens are regenerated to the resolved-composition form (the desired schema), so check.sh's diff now catches a stale golden (a layer edited without regenerating). Reality fidelity (HCL vs the live cluster) is the post-deploy introspection's job. Docs + check.sh header updated to match.
--auto writes the operations to the next posthog/clickhouse/migrations/NNNN_<name>.py (number derived from max_migration.txt) and bumps max_migration.txt. Without it, output still goes to stdout/--out. Docs updated. Also drop a stray test_table that a prior gen-golden run had captured into the logs goldens, restoring a clean/consistent baseline.
The tooling covers OPS and LOGS (and is general per-node), so the `ops/` nesting was misleading. Move hcl/ops/* up to hcl/ and update path references in check.sh, diff.sh, gen-sql.sh, gen-golden.sh, codegen/gen_migration.py, the CI workflow, and the docs. No schema change.
…bility Adopt chschema #68/#69/#70 (hclexp sha-1871283): the layer loader now resolves file()/heredoc query refs and normalizes them to one canonical form, so a query authored as file("sql/<object>.sql") compares equal to its golden. Move all 34 long inline view/materialized_view `query = "..."` one-liners (some 1800+ chars) out of the layer files into beautified `<layer>/sql/<object>.sql` fragments, referenced via query = file("sql/<object>.sql"). Regenerate golden/ (now readable heredocs) and sql/ (beautified CREATE bodies). Bump the pinned image in bin/hclexp and the CI workflow. No schema change — check.sh is green (every node diffs clean against its golden).
Problem
The OPS ClickHouse schema is declarative HCL, but two things were missing to apply a change safely: a way to know which node roles an object lives on, and a generator that turns an HCL edit into a correctly-targeted migration.
A node's schema is fixed by two axes — environment (dev/prod-us/prod-eu) and node role (OPS/DATA/ENDPOINTS/…). An earlier attempt expressed the role axis as a flat object→roles map, but that can't model role-specific variants: e.g. the
custom_metricsaggregator on a DATA node unions an extracustom_metrics_events_recent_lagthat the OPS node lacks. So role becomes a second composition axis instead.Changes
Per-node composition. A node's schema =
compose(its layers):shared/— objects identical on every role (query_log_archivepath + the invariantcustom_metrics_*sub-views).roles/ops/,roles/ops-prod/— OPS-exclusive objects.env/*/— env overlays.nodes— the composition manifest:(env, role) -> ordered layer list. The single source of truth for placement (replaces the object→roles map).Verification.
check.sh/diff.share now manifest-driven, iterating every(env, role). Added per-role goldens (the shared managed subset) fordata/aux/ai_events/sessions;endpoints/localare validate-only (no captured host).Codegen (
codegen/gen_migration.py). Diffs each(env, role)node (committed vs working) and derives the targeting:node_roles= the roles whose composition surfaced the statement — shared objects → all roles, OPS-only →[OPS]. Derived from composition, no side-table.is_alter_on_replicated_table/sharded= from the engine kind.settings.CLOUD_DEPLOYMENTgating.Known follow-ups
CLOUD_DEPLOYMENTbranches for env-specific statements (currently a# NOTEcomment).max_migration.txt(currently stdout).events_recent+ the DATA-specificcustom_metricsaggregator); until thenevents_recentderives as OPS-only.How did you test this code?
check.shgreen across all(env, role)compositions (OPS per env; data/aux/ai_events/sessions diffing their goldens; endpoints/local validate-only).sharded_query_log_archive(OPS-only) andquery_log_archive(shared): emitted two correctly-targeted operations —[OPS]withis_alter_on_replicated_table=True, andALL_ROLESwith both flagsFalse— derived purely from composition.