feat(clickhouse): OPS HCL per-node composition + migration codegen by orian · Pull Request #65969 · PostHog/posthog

orian · 2026-06-25T05:36:50Z

Problem

The OPS ClickHouse schema is declarative HCL, but two things were missing to apply a change safely: a way to know which node roles an object lives on, and a generator that turns an HCL edit into a correctly-targeted migration.

A node's schema is fixed by two axes — environment (dev/prod-us/prod-eu) and node role (OPS/DATA/ENDPOINTS/…). An earlier attempt expressed the role axis as a flat object→roles map, but that can't model role-specific variants: e.g. the custom_metrics aggregator on a DATA node unions an extra custom_metrics_events_recent_lag that the OPS node lacks. So role becomes a second composition axis instead.

Changes

Per-node composition. A node's schema = compose(its layers):

shared/ — objects identical on every role (query_log_archive path + the invariant custom_metrics_* sub-views).
roles/ops/, roles/ops-prod/ — OPS-exclusive objects.
env/*/ — env overlays.
nodes — the composition manifest: (env, role) -> ordered layer list. The single source of truth for placement (replaces the object→roles map).

Verification. check.sh / diff.sh are now manifest-driven, iterating every (env, role). Added per-role goldens (the shared managed subset) for data/aux/ai_events/sessions; endpoints/local are validate-only (no captured host).

Codegen (codegen/gen_migration.py). Diffs each (env, role) node (committed vs working) and derives the targeting:

node_roles = the roles whose composition surfaced the statement — shared objects → all roles, OPS-only → [OPS]. Derived from composition, no side-table.
is_alter_on_replicated_table / sharded = from the engine kind.
env-specific statements are flagged for settings.CLOUD_DEPLOYMENT gating.

Known follow-ups

Auto-emit CLOUD_DEPLOYMENT branches for env-specific statements (currently a # NOTE comment).
Auto-place output as the next numbered migration + bump max_migration.txt (currently stdout).
Fully model the DATA node (events_recent + the DATA-specific custom_metrics aggregator); until then events_recent derives as OPS-only.

How did you test this code?

check.sh green across all (env, role) compositions (OPS per env; data/aux/ai_events/sessions diffing their goldens; endpoints/local validate-only).
Ran the generator on a change touching sharded_query_log_archive (OPS-only) and query_log_archive (shared): emitted two correctly-targeted operations — [OPS] with is_alter_on_replicated_table=True, and ALL_ROLES with both flags False — derived purely from composition.

Generate a ClickHouse migration from a declarative-HCL change. The HCL stays the source of truth for the schema; this produces the run_sql_with_exceptions(...) operations the existing infi.clickhouse_orm runner executes. - topology.py: explicit, reviewed map of each OPS object -> (node_roles, replicated, sharded). node_roles is a deliberate choice (the dump hostClusterRole vocabulary differs from NodeRole and migrations target a curated subset, so it can't be mechanically derived); seeded by introspecting ../clickhouse-schema and reconciled against 0273/0274. - gen_migration.py: runs ops/diff.sh, maps each DDL statement to its targeting, derives sharded / is_alter_on_replicated_table from the engine kind, and emits the operations. One HCL change can span multiple roles. Errors on objects missing from topology.py; flags UNSAFE recreations.

greptile-apps · 2026-06-25T05:40:56Z

_{Reviews (1): Last reviewed commit: "feat(clickhouse): OPS HCL → migration co..." | Re-trigger Greptile}

tests-posthog · 2026-06-25T05:59:22Z

Query snapshots: Backend query snapshots updated

Changes: 1 snapshots (1 modified, 0 added, 0 deleted)

What this means:

Query snapshots have been automatically updated to match current output
These changes reflect modifications to database queries or schema

Next steps:

Review the query changes to ensure they're intentional
If unexpected, investigate what caused the query to change

Review snapshot changes →

Express object placement via composition instead of a side-table (option 2), replacing codegen/topology.py. A node's schema = compose(its layers); node_roles and per-env gating are derived from which compositions surface a change. - Split base/+prod/ into shared/ (all-role objects), roles/ops + roles/ops-prod (OPS-only), keeping env/ overlays. Needed because role-specific variants exist (the DATA custom_metrics aggregator unions an extra metric the OPS node lacks), which a flat object->roles map cannot express. - Add the `nodes` composition manifest: (env, role) -> ordered layer list. - check.sh / diff.sh are now manifest-driven, iterating every (env, role); add per-role goldens (shared managed subset) for data/aux/ai_events/sessions. - gen_migration.py diffs per (env, role) node and derives node_roles (union of surfacing roles), sharded / is_alter_on_replicated_table (engine), and flags env-specific statements for CLOUD_DEPLOYMENT gating. topology.py removed. check.sh green across all (env, role) compositions.

hex-security-app · 2026-06-25T08:53:23Z

+# The OPS node is modeled per env (it carries the env differences). The other roles
+# carry only the shared managed subset today, so one representative env (prod-us) is
+# enough to put them in the role set — shared objects are env-identical. Fully modeling
+# the DATA node (events_recent + the DATA-specific custom_metrics aggregator +


Generated migrations misroute shared query_log_archive changes to OPS nodes

The new generator derives node_roles from the composition manifest, but the manifest currently models non-OPS roles as shared only and explicitly says events_recent is temporarily treated as OPS-only. That means shared query_log_archive objects like query_log_archive/writable_query_log_archive are now generated with [DATA, ENDPOINTS, AUX, AI_EVENTS, SESSIONS, OPS] instead of the narrower role sets used by existing migrations (for example writable_query_log_archive is local-only in 0276, and historical write/read path migrations target specific roles like DATA or ENDPOINTS). Because this script emits ready-to-commit migrations, a developer changing the OPS HCL can now auto-generate DDL that runs on extra clusters and overwrites infra-managed distributed tables, creating an integrity/availability incident in production ClickHouse.

Prompt To Fix With AI

Do not derive migration `node_roles` solely from the current per-node composition manifest while that manifest is intentionally incomplete for DATA/endpoints/shared objects. Restore an explicit authoritative mapping for migration targeting, or extend the manifest/model so it fully captures which roles should receive each managed object and preserves the existing special cases (for example writable/read query_log_archive objects that are cloud-infra managed or only exist on certain clusters). Add a guard/test comparing generated role targeting for known OPS objects against existing hand-written migrations like 0273/0276 to prevent widening execution scope.

_{Severity: medium | Confidence: 87% | React with 👍 if useful or 👎 if not}

Fixed in a4128f52.

Focus the managed roles on OPS and LOGS; comment out the other shared-only roles in the manifest. node_roles for shared objects now derives to [LOGS, OPS] (uncomment a role + regenerate its golden to widen). - Add LOGS to the manifest for all three cloud envs (dev, prod-us, prod-eu) with per-env goldens (the shared managed subset extracted from each env's logs host). - Remove the data/aux/ai_events/sessions goldens (their manifest lines are commented out). - Generator ROLE_ORDER includes logs. check.sh green for OPS (local + 3 cloud envs) and LOGS (3 cloud envs).

Author the LOGS role's full schema (logs/traces/metrics ingestion) from the clickhouse-schema *-logs.hcl dumps, composed per cloud env: - roles/logs/ — the 16 objects identical across dev/prod-us/prod-eu. - env-logs/<env>/ — per-env objects (kafka configs, zoo_paths, distributed cluster names that differ; the traces subsystem on prod-us/prod-eu only; the prod-us-only trace_spans_kafka_metrics path). dev's daniel_ddl_test is left unmanaged. - LOGS composition = shared + roles/logs + env-logs/<env>; goldens regenerated as the full logs node (minus the unmanaged dev table). - engine_map now resolves one composition per role so is_alter/sharded are correct for LOGS tables too. check.sh green for OPS (local + 3 cloud) and LOGS (3 cloud).

hex-security-app · 2026-06-25T10:00:43Z

+prod-us  ops        shared roles/ops roles/ops-prod env/prod-us
+prod-eu  ops        shared roles/ops roles/ops-prod env/prod-eu
+
+dev      logs       shared roles/logs env-logs/dev


Generator now deploys shared query_log_archive DDL onto LOGS nodes

The latest manifest change adds LOGS to the managed set and gen_migration.py derives node_roles directly from that manifest, so any shared HCL change under shared/qla.hcl will now generate migrations targeting [NodeRole.LOGS, NodeRole.OPS]. Existing migrations show query_log_archive objects are only managed on OPS / data-path roles, while LOGS has its own separate schema. This makes the generator produce ready-to-commit migrations that create or alter query_log_archive tables/views on LOGS clusters where they do not belong, which can break the logs cluster or overwrite unrelated objects there.

Prompt To Fix With AI

Do not infer that every role composing `shared/` should receive every generated migration operation. Separate shared schema fragments by target domain, or add an explicit per-object/per-layer migration-target mapping so query_log_archive and custom_metrics changes are not emitted for LOGS unless there is verified migration history and runtime ownership for those objects on LOGS. Add a regression test that changing `shared/qla.hcl` does not generate `NodeRole.LOGS` operations.

_{Severity: medium | Confidence: 92% | React with 👍 if useful or 👎 if not}

Fixed in 7afcae27, 6462b429.

ops/gen-sql.sh emits sql/<env>-<role>.sql for every node in the manifest — the full, dependency-ordered CREATE schema (hclexp diff vs an empty schema). E.g. apply sql/local-ops.sql to a local ClickHouse to create the OPS schema. check.sh now verifies the committed sql/ is in sync with the HCL (regenerates to a temp dir and diffs), so it can't go stale.

…-a-change runbook

…ded) The bin/hclexp wrapper already falls back to the pinned ghcr.io image, but the scripts pass committed-tree / scratch paths under $TMPDIR that weren't mounted, so the docker path broke. Mount the temp dir alongside the repo so diff.sh, gen-sql.sh, check.sh, and the generator all work via Docker with zero setup. Docs now recommend Docker (just have it running) and show the explicit `docker run … ghcr.io/posthog/chschema` command; HCLEXP_BIN is documented as an optional local-binary speedup.

Refreshing a golden is now `ops/gen-golden.sh` (hclexp load per node from the manifest; optional env/role filter) — no hand-typed layer lists. Goldens are regenerated to the resolved-composition form (the desired schema), so check.sh's diff now catches a stale golden (a layer edited without regenerating). Reality fidelity (HCL vs the live cluster) is the post-deploy introspection's job. Docs + check.sh header updated to match.

--auto writes the operations to the next posthog/clickhouse/migrations/NNNN_<name>.py (number derived from max_migration.txt) and bumps max_migration.txt. Without it, output still goes to stdout/--out. Docs updated. Also drop a stray test_table that a prior gen-golden run had captured into the logs goldens, restoring a clean/consistent baseline.

The tooling covers OPS and LOGS (and is general per-node), so the `ops/` nesting was misleading. Move hcl/ops/* up to hcl/ and update path references in check.sh, diff.sh, gen-sql.sh, gen-golden.sh, codegen/gen_migration.py, the CI workflow, and the docs. No schema change.

…bility Adopt chschema #68/#69/#70 (hclexp sha-1871283): the layer loader now resolves file()/heredoc query refs and normalizes them to one canonical form, so a query authored as file("sql/<object>.sql") compares equal to its golden. Move all 34 long inline view/materialized_view `query = "..."` one-liners (some 1800+ chars) out of the layer files into beautified `<layer>/sql/<object>.sql` fragments, referenced via query = file("sql/<object>.sql"). Regenerate golden/ (now readable heredocs) and sql/ (beautified CREATE bodies). Bump the pinned image in bin/hclexp and the CI workflow. No schema change — check.sh is green (every node diffs clean against its golden).

assign-reviewers-posthog Bot assigned orian Jun 25, 2026

graphite-app Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread posthog/clickhouse/hcl/ops/codegen/gen_migration.py Outdated

greptile-apps Bot reviewed Jun 25, 2026

View reviewed changes

orian added 2 commits June 25, 2026 10:47

fix(clickhouse): generator hclexp path + parse engine from load stderr

f70dd7d

orian force-pushed the pawel/feat/ops-hcl-migration-codegen branch from ec7c645 to f70dd7d Compare June 25, 2026 08:53

hex-security-app Bot reviewed Jun 25, 2026

View reviewed changes

orian changed the title ~~feat(clickhouse): OPS HCL → migration codegen~~ feat(clickhouse): OPS HCL per-node composition + migration codegen Jun 25, 2026

orian added 2 commits June 25, 2026 11:47

hex-security-app Bot reviewed Jun 25, 2026

View reviewed changes

orian added 7 commits June 25, 2026 12:28

docs(clickhouse): rewrite HCL README for composition model + add make…

4a4f35f

…-a-change runbook

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(clickhouse): OPS HCL per-node composition + migration codegen#65969

feat(clickhouse): OPS HCL per-node composition + migration codegen#65969
orian wants to merge 12 commits into
masterfrom
pawel/feat/ops-hcl-migration-codegen

orian commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tests-posthog Bot commented Jun 25, 2026

Uh oh!

hex-security-app Bot Jun 25, 2026 •

edited

Loading

Uh oh!

hex-security-app Bot Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

orian commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Known follow-ups

How did you test this code?

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tests-posthog Bot commented Jun 25, 2026

Query snapshots: Backend query snapshots updated

Uh oh!

hex-security-app Bot Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hex-security-app Bot Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

orian commented Jun 25, 2026 •

edited

Loading

hex-security-app Bot Jun 25, 2026 •

edited

Loading

hex-security-app Bot Jun 25, 2026 •

edited

Loading