Skip to content

feat(clickhouse): OPS HCL per-node composition + migration codegen#65969

Open
orian wants to merge 12 commits into
masterfrom
pawel/feat/ops-hcl-migration-codegen
Open

feat(clickhouse): OPS HCL per-node composition + migration codegen#65969
orian wants to merge 12 commits into
masterfrom
pawel/feat/ops-hcl-migration-codegen

Conversation

@orian

@orian orian commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Problem

The OPS ClickHouse schema is declarative HCL, but two things were missing to apply a change safely: a way to know which node roles an object lives on, and a generator that turns an HCL edit into a correctly-targeted migration.

A node's schema is fixed by two axes — environment (dev/prod-us/prod-eu) and node role (OPS/DATA/ENDPOINTS/…). An earlier attempt expressed the role axis as a flat object→roles map, but that can't model role-specific variants: e.g. the custom_metrics aggregator on a DATA node unions an extra custom_metrics_events_recent_lag that the OPS node lacks. So role becomes a second composition axis instead.

Changes

Per-node composition. A node's schema = compose(its layers):

  • shared/ — objects identical on every role (query_log_archive path + the invariant custom_metrics_* sub-views).
  • roles/ops/, roles/ops-prod/ — OPS-exclusive objects.
  • env/*/ — env overlays.
  • nodes — the composition manifest: (env, role) -> ordered layer list. The single source of truth for placement (replaces the object→roles map).

Verification. check.sh / diff.sh are now manifest-driven, iterating every (env, role). Added per-role goldens (the shared managed subset) for data/aux/ai_events/sessions; endpoints/local are validate-only (no captured host).

Codegen (codegen/gen_migration.py). Diffs each (env, role) node (committed vs working) and derives the targeting:

  • node_roles = the roles whose composition surfaced the statement — shared objects → all roles, OPS-only → [OPS]. Derived from composition, no side-table.
  • is_alter_on_replicated_table / sharded = from the engine kind.
  • env-specific statements are flagged for settings.CLOUD_DEPLOYMENT gating.

Known follow-ups

  • Auto-emit CLOUD_DEPLOYMENT branches for env-specific statements (currently a # NOTE comment).
  • Auto-place output as the next numbered migration + bump max_migration.txt (currently stdout).
  • Fully model the DATA node (events_recent + the DATA-specific custom_metrics aggregator); until then events_recent derives as OPS-only.

How did you test this code?

  • check.sh green across all (env, role) compositions (OPS per env; data/aux/ai_events/sessions diffing their goldens; endpoints/local validate-only).
  • Ran the generator on a change touching sharded_query_log_archive (OPS-only) and query_log_archive (shared): emitted two correctly-targeted operations — [OPS] with is_alter_on_replicated_table=True, and ALL_ROLES with both flags False — derived purely from composition.

Generate a ClickHouse migration from a declarative-HCL change. The HCL stays the
source of truth for the schema; this produces the run_sql_with_exceptions(...)
operations the existing infi.clickhouse_orm runner executes.

- topology.py: explicit, reviewed map of each OPS object -> (node_roles,
  replicated, sharded). node_roles is a deliberate choice (the dump
  hostClusterRole vocabulary differs from NodeRole and migrations target a
  curated subset, so it can't be mechanically derived); seeded by introspecting
  ../clickhouse-schema and reconciled against 0273/0274.
- gen_migration.py: runs ops/diff.sh, maps each DDL statement to its targeting,
  derives sharded / is_alter_on_replicated_table from the engine kind, and emits
  the operations. One HCL change can span multiple roles. Errors on objects
  missing from topology.py; flags UNSAFE recreations.
Comment thread posthog/clickhouse/hcl/ops/codegen/gen_migration.py Outdated
@greptile-apps

greptile-apps Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Reviews (1): Last reviewed commit: "feat(clickhouse): OPS HCL → migration co..." | Re-trigger Greptile

Comment thread posthog/clickhouse/hcl/ops/codegen/gen_migration.py Outdated
Comment thread posthog/clickhouse/hcl/ops/codegen/gen_migration.py Outdated
Comment thread posthog/clickhouse/hcl/ops/codegen/gen_migration.py Outdated
Comment thread posthog/clickhouse/hcl/ops/codegen/gen_migration.py Outdated
@tests-posthog

tests-posthog Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Query snapshots: Backend query snapshots updated

Changes: 1 snapshots (1 modified, 0 added, 0 deleted)

What this means:

  • Query snapshots have been automatically updated to match current output
  • These changes reflect modifications to database queries or schema

Next steps:

  • Review the query changes to ensure they're intentional
  • If unexpected, investigate what caused the query to change

Review snapshot changes →

orian added 2 commits June 25, 2026 10:47
Express object placement via composition instead of a side-table (option 2),
replacing codegen/topology.py. A node's schema = compose(its layers); node_roles
and per-env gating are derived from which compositions surface a change.

- Split base/+prod/ into shared/ (all-role objects), roles/ops + roles/ops-prod
  (OPS-only), keeping env/ overlays. Needed because role-specific variants exist
  (the DATA custom_metrics aggregator unions an extra metric the OPS node lacks),
  which a flat object->roles map cannot express.
- Add the `nodes` composition manifest: (env, role) -> ordered layer list.
- check.sh / diff.sh are now manifest-driven, iterating every (env, role); add
  per-role goldens (shared managed subset) for data/aux/ai_events/sessions.
- gen_migration.py diffs per (env, role) node and derives node_roles (union of
  surfacing roles), sharded / is_alter_on_replicated_table (engine), and flags
  env-specific statements for CLOUD_DEPLOYMENT gating. topology.py removed.

check.sh green across all (env, role) compositions.
@orian orian force-pushed the pawel/feat/ops-hcl-migration-codegen branch from ec7c645 to f70dd7d Compare June 25, 2026 08:53
Comment thread posthog/clickhouse/hcl/ops/nodes Outdated
# The OPS node is modeled per env (it carries the env differences). The other roles
# carry only the shared managed subset today, so one representative env (prod-us) is
# enough to put them in the role set — shared objects are env-identical. Fully modeling
# the DATA node (events_recent + the DATA-specific custom_metrics aggregator +

@hex-security-app hex-security-app Bot Jun 25, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Generated migrations misroute shared query_log_archive changes to OPS nodes

The new generator derives node_roles from the composition manifest, but the manifest currently models non-OPS roles as shared only and explicitly says events_recent is temporarily treated as OPS-only. That means shared query_log_archive objects like query_log_archive/writable_query_log_archive are now generated with [DATA, ENDPOINTS, AUX, AI_EVENTS, SESSIONS, OPS] instead of the narrower role sets used by existing migrations (for example writable_query_log_archive is local-only in 0276, and historical write/read path migrations target specific roles like DATA or ENDPOINTS). Because this script emits ready-to-commit migrations, a developer changing the OPS HCL can now auto-generate DDL that runs on extra clusters and overwrites infra-managed distributed tables, creating an integrity/availability incident in production ClickHouse.

Prompt To Fix With AI
Do not derive migration `node_roles` solely from the current per-node composition manifest while that manifest is intentionally incomplete for DATA/endpoints/shared objects. Restore an explicit authoritative mapping for migration targeting, or extend the manifest/model so it fully captures which roles should receive each managed object and preserves the existing special cases (for example writable/read query_log_archive objects that are cloud-infra managed or only exist on certain clusters). Add a guard/test comparing generated role targeting for known OPS objects against existing hand-written migrations like 0273/0276 to prevent widening execution scope.

Severity: medium | Confidence: 87% | React with 👍 if useful or 👎 if not

Fixed in a4128f52.

@orian orian changed the title feat(clickhouse): OPS HCL → migration codegen feat(clickhouse): OPS HCL per-node composition + migration codegen Jun 25, 2026
orian added 2 commits June 25, 2026 11:47
Focus the managed roles on OPS and LOGS; comment out the other shared-only
roles in the manifest. node_roles for shared objects now derives to [LOGS, OPS]
(uncomment a role + regenerate its golden to widen).

- Add LOGS to the manifest for all three cloud envs (dev, prod-us, prod-eu) with
  per-env goldens (the shared managed subset extracted from each env's logs host).
- Remove the data/aux/ai_events/sessions goldens (their manifest lines are
  commented out).
- Generator ROLE_ORDER includes logs.

check.sh green for OPS (local + 3 cloud envs) and LOGS (3 cloud envs).
Author the LOGS role's full schema (logs/traces/metrics ingestion) from the
clickhouse-schema *-logs.hcl dumps, composed per cloud env:

- roles/logs/ — the 16 objects identical across dev/prod-us/prod-eu.
- env-logs/<env>/ — per-env objects (kafka configs, zoo_paths, distributed
  cluster names that differ; the traces subsystem on prod-us/prod-eu only; the
  prod-us-only trace_spans_kafka_metrics path). dev's daniel_ddl_test is left
  unmanaged.
- LOGS composition = shared + roles/logs + env-logs/<env>; goldens regenerated
  as the full logs node (minus the unmanaged dev table).
- engine_map now resolves one composition per role so is_alter/sharded are
  correct for LOGS tables too.

check.sh green for OPS (local + 3 cloud) and LOGS (3 cloud).
prod-us ops shared roles/ops roles/ops-prod env/prod-us
prod-eu ops shared roles/ops roles/ops-prod env/prod-eu

dev logs shared roles/logs env-logs/dev

@hex-security-app hex-security-app Bot Jun 25, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Generator now deploys shared query_log_archive DDL onto LOGS nodes

The latest manifest change adds LOGS to the managed set and gen_migration.py derives node_roles directly from that manifest, so any shared HCL change under shared/qla.hcl will now generate migrations targeting [NodeRole.LOGS, NodeRole.OPS]. Existing migrations show query_log_archive objects are only managed on OPS / data-path roles, while LOGS has its own separate schema. This makes the generator produce ready-to-commit migrations that create or alter query_log_archive tables/views on LOGS clusters where they do not belong, which can break the logs cluster or overwrite unrelated objects there.

Prompt To Fix With AI
Do not infer that every role composing `shared/` should receive every generated migration operation. Separate shared schema fragments by target domain, or add an explicit per-object/per-layer migration-target mapping so query_log_archive and custom_metrics changes are not emitted for LOGS unless there is verified migration history and runtime ownership for those objects on LOGS. Add a regression test that changing `shared/qla.hcl` does not generate `NodeRole.LOGS` operations.

Severity: medium | Confidence: 92% | React with 👍 if useful or 👎 if not

Fixed in 7afcae27, 6462b429.

orian added 7 commits June 25, 2026 12:28
ops/gen-sql.sh emits sql/<env>-<role>.sql for every node in the manifest — the
full, dependency-ordered CREATE schema (hclexp diff vs an empty schema). E.g.
apply sql/local-ops.sql to a local ClickHouse to create the OPS schema.

check.sh now verifies the committed sql/ is in sync with the HCL (regenerates to
a temp dir and diffs), so it can't go stale.
…ded)

The bin/hclexp wrapper already falls back to the pinned ghcr.io image, but the
scripts pass committed-tree / scratch paths under $TMPDIR that weren't mounted,
so the docker path broke. Mount the temp dir alongside the repo so diff.sh,
gen-sql.sh, check.sh, and the generator all work via Docker with zero setup.

Docs now recommend Docker (just have it running) and show the explicit
`docker run … ghcr.io/posthog/chschema` command; HCLEXP_BIN is documented as an
optional local-binary speedup.
Refreshing a golden is now `ops/gen-golden.sh` (hclexp load per node from the
manifest; optional env/role filter) — no hand-typed layer lists. Goldens are
regenerated to the resolved-composition form (the desired schema), so check.sh's
diff now catches a stale golden (a layer edited without regenerating). Reality
fidelity (HCL vs the live cluster) is the post-deploy introspection's job.

Docs + check.sh header updated to match.
--auto writes the operations to the next posthog/clickhouse/migrations/NNNN_<name>.py
(number derived from max_migration.txt) and bumps max_migration.txt. Without it,
output still goes to stdout/--out. Docs updated.

Also drop a stray test_table that a prior gen-golden run had captured into the
logs goldens, restoring a clean/consistent baseline.
The tooling covers OPS and LOGS (and is general per-node), so the `ops/` nesting
was misleading. Move hcl/ops/* up to hcl/ and update path references in check.sh,
diff.sh, gen-sql.sh, gen-golden.sh, codegen/gen_migration.py, the CI workflow, and
the docs. No schema change.
…bility

Adopt chschema #68/#69/#70 (hclexp sha-1871283): the layer loader now resolves
file()/heredoc query refs and normalizes them to one canonical form, so a query
authored as file("sql/<object>.sql") compares equal to its golden.

Move all 34 long inline view/materialized_view `query = "..."` one-liners (some
1800+ chars) out of the layer files into beautified `<layer>/sql/<object>.sql`
fragments, referenced via query = file("sql/<object>.sql"). Regenerate golden/
(now readable heredocs) and sql/ (beautified CREATE bodies). Bump the pinned
image in bin/hclexp and the CI workflow. No schema change — check.sh is green
(every node diffs clean against its golden).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant