Skip to content

Replace the single ClickHouse user with a least-privilege user model#6

Open
toffer wants to merge 13 commits into
mainfrom
toffermann/clickhouse-least-privilege-user-model
Open

Replace the single ClickHouse user with a least-privilege user model#6
toffer wants to merge 13 commits into
mainfrom
toffermann/clickhouse-least-privilege-user-model

Conversation

@toffer

@toffer toffer commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

What this does

Replaces the single, broadly-privileged otel ClickHouse user with a least-privilege model — one user per access path — enforced declaratively via ClickHouse config-level grants in the ClickHouseInstallation. The stock default superuser stays removed, and each component authenticates as a user scoped to exactly what it does.

Users

User Role
schema_owner Runs the schema-migration Job (DDL, telemetry/system-log TTLs, SYSTEM FLUSH LOGS) and is the materialized-view DEFINER.
otel The OTel collector's ingest identity. Can be tightened to INSERT-only on the source tables via clickhouse.otel.restrictGrants.
llm_worker The llm-worker Deployment — reads and appends the LLM job queue only.
monte_carlo Monte Carlo's read-only data-source identity and the agent's queue producer: reads everything, writes only the queue + eval scores, never telemetry.
readonly_user Optional SELECT-only user (readonly=2) for human / MCP / JDBC clients (e.g. DataGrip).
admin Optional, gated break-glass superuser — reachable only over loopback (pod-exec) by default.

A shared "reader bundle" — SELECT on the telemetry DB plus the system / information_schema metadata reads JDBC/MCP clients need — is defined once and shared by monte_carlo and readonly_user.

How it's enforced and verified

  • Privileges come from config-level <grants> in the CHI (declarative, applied by the operator) — no SQL-RBAC bootstrap user.
  • The materialized views are pinned to DEFINER = schema_owner (new sql/0014 migration reparents existing installs in place), so the ingest user needs nothing on the normalized tables and the normalization cascade keeps working once otel is restricted.
  • Init/readiness probes use EXISTS TABLE on a table the pod's user is granted (privilege-robust, no reliance on system-table row filtering).
  • verify-deployment.sh now asserts the whole model on every deploy: each user's grants, the security-critical denials, and that the MVs run as schema_owner.

Key decisions

  • Config <grants>, not SQL RBAC — declarative and in-chart. A throwaway verification spike confirmed config grants actually restrict config-defined users, and that access_management cannot be combined with <grants> — so admin uses GRANT ALL … WITH GRANT OPTION, which already carries the access-management privilege group.
  • MVs pinned to a stable DEFINER — on this ClickHouse version an MV defaults to DEFINER = its creator, so pre-existing MVs are implicitly owned by otel; reparenting them is required (not just hardening) before otel can be locked down.
  • otel lock-down is flag-gated (restrictGrants, default off) so it ships inert and is enabled only after external readers have moved to monte_carlo.
  • admin defaults to loopback-only — the superuser is never on a network path; using it requires pod access.
  • Append-only model confirmed — no caller mutates rows, so only the schema Job needs anything beyond SELECT/INSERT.

Notes

  • Each password-backed user has its own ExternalSecret; schema_owner / llm_worker / monte_carlo are always provisioned, and the README install walkthroughs wire all of them.
  • Public platform docs need a matching update (tracked separately), notably switching the documented agent connection to monte_carlo once that cutover happens.

🤖 Generated with Claude Code


  • Ran /code-review

toffer and others added 10 commits June 26, 2026 12:59
Introduce the schema_owner identity that owns the schema and serves as
the materialized-view DEFINER, and move the migration Job onto it.

- CHI: add schema_owner with config-level grants/query (CREATE DATABASE;
  full DDL/SELECT/INSERT on otel_traces.*; ALTER on system.*; SYSTEM
  FLUSH LOGS) and a configurable networksIp. Deliberately no access
  management.
- sql/0014_reparent_mv_definers.sql (new): ALTER ... MODIFY SQL SECURITY
  DEFINER DEFINER = schema_owner for the three MVs. In-place (no
  rebuild/gap); existing installs carry an implicit DEFINER=otel that
  would break once otel is tightened. New file — the create migrations
  are immutable IF NOT EXISTS (no-op on upgrade).
- schema-job: authenticate as schema_owner (was otel).
- values: add clickhouse.schemaOwner block (secret, networksIp,
  externalSecret).
- _helpers: factor ExternalSecret rendering into
  ao-data-platform.externalSecret (external-secret.yaml now ranges over
  a user list); add the shared ao-data-platform.readerGrants bundle for
  monte_carlo/readonly_user (later phases).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Give the llm-worker its own least-privilege identity instead of sharing
otel.

- CHI: add llm_worker with config grants/query — SELECT on
  otel_traces.llm_{batches,inputs,results}, INSERT on
  otel_traces.llm_{batches,results}. Append-only, matching the worker's
  actual access pattern; no DDL/mutations.
- llm-worker Deployment: authenticate as llm_worker (CH_USER + both
  secret refs), and rewrite the init readiness probe from a
  system.databases SELECT to `EXISTS TABLE otel_traces.llm_batches`
  (privilege-robust under tightened grants; the worker holds a grant on
  that table).
- values: add clickhouse.llmWorkerUser block; external-secret: render
  its ExternalSecret.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Provision the Monte Carlo identity: the read-only data-source service
account AND the AO agent's LLM-queue producer (the agent's credential
itself is set in the the consuming application, not here).

- CHI: add monte_carlo. grants/query = the shared reader bundle
  (ao-data-platform.readerGrants: SELECT on otel_traces.*,
  system.tables/parts/ query_log, information_schema.*) plus INSERT on
  llm_inputs/llm_batches/ conversation_eval_scores. Deliberately NO
  readonly profile — readonly=2 blocks INSERT, so grants alone enforce
  the read-mostly shape.
- values: add clickhouse.monteCarloUser block; external-secret: render
  its ExternalSecret.

The reader bundle is shared with readonly_user (Phase 6) via the helper
so the metadata-read set has a single source of truth.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reduce the always-running collector identity to least privilege, gated
so it can ship dark (an earlier rollout step) and be flipped on later (a
later rollout step).

- CHI: when clickhouse.otel.restrictGrants=true, otel gets grants/query
  = INSERT on otel_traces.otel_traces (+ otel_metrics) only. Default
  false leaves otel with implicit full access (backward-compatible). The
  MV cascade runs as schema_owner (the MV DEFINER), so otel needs
  nothing on the normalized targets.
- Collector init probe: SELECT-on-system.databases -> EXISTS TABLE
  otel_traces.otel_traces (privilege-robust; otel holds a grant on that
  table).
- values: add clickhouse.otel.restrictGrants (default false).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make readonly_user an explicit, self-sufficient reader and stop the
verify script depending on the removed default user.

- CHI: readonly_user/grants/query = the shared reader bundle
  (ao-data-platform. readerGrants — same as monte_carlo's reads).
  Required because going to explicit config grants removes the implicit
  all-DB read a config user otherwise has, which DataGrip/MCP
  introspection relies on. readonly=2 profile kept (governs SET, not
  table access). networks/ip now configurable via
  readonlyUser.networksIp.
- hack/verify-deployment.sh: route the 9 in-pod read queries through an
  explicit reader credential (prefer readonly_user, fall back to the
  always-present monte_carlo) instead of the bare/default user; keep the
  2 otel auth checks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an optional superuser escape hatch for DBA / ops tasks the operator
never performs (SYSTEM, BACKUP, ad-hoc DDL) and which no other user can
do now that the default superuser is removed.

- CHI (gated on adminUser.enabled, default off): admin with
  access_management=1 and GRANT ALL ON *.* WITH GRANT OPTION;
  configurable networks/ip (intended trusted networks-only per env). Not
  for service-to-service.
- values: add clickhouse.adminUser block (enabled=false).
- external-secret: render admin's ExternalSecret only when enabled.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- README: add a "ClickHouse user model" section (per-user privilege
  table, the shared reader bundle, and the readonly_user/admin gates);
  extend the config table with the new per-user knobs
  (schemaOwner/llmWorkerUser/monteCarloUser/ adminUser +
  otel.restrictGrants + readonlyUser.networksIp); update the AWS-prereq
  secrets line for the additional users.
- values: correct the stale otelNetworksIp comment — the the deploying
  environment does not override it with a VPC CIDR; reachability is
  enforced at the load balancer and per-caller CH-level scoping is a
  separate change.

Per-user networksIp knobs (default open) were added with each user in
the prior phases. Cross-repo rollout handoff is tracked separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Functional verification on k3d surfaced a ClickHouse startup crash
(CANNOT_LOAD_CONFIG): "Any other access control settings can't be
specified with `grants`" while parsing user 'admin'. ClickHouse rejects
combining the legacy <access_management> setting with config-level
<grants>.

GRANT ALL ON *.* WITH GRANT OPTION already includes the ACCESS
MANAGEMENT privilege group (verified live: admin can CREATE USER / GRANT
/ run SYSTEM), so the setting was redundant. Removed it. (profile +
grants remains valid — that's why readonly_user is unaffected.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The chart always provisions otel, schema_owner, llm_worker, and
monte_carlo, and each requires its own ExternalSecret (secretStoreRef.name
and remoteRef.key). The getting-started walkthroughs only wired otel, so
following them verbatim hit a required-value failure at template time.

- Local Development: seed schema_owner/llm_worker/monte_carlo keys in the
  Fake ClusterSecretStore and set their externalSecret.* in the install
  command.
- Deploying to AWS EKS: show an ExternalSecret for each always-provisioned
  user; note readonly_user/admin are off by default and wired the same way.
- Note that verify-deployment.sh runs its ClickHouse data checks as
  readonly_user, so it needs readonlyUser.enabled=true.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a post-deploy check that each ClickHouse user authenticates and holds
the expected grants, and that the security-critical denials are enforced:

- schema_owner holds DDL; llm_worker reads/writes only the queue and is
  denied telemetry reads; monte_carlo reads telemetry and produces to the
  queue but cannot write telemetry; readonly_user is SELECT-only.
- Reports whether otel has been tightened to INSERT-only — its grant shape
  legitimately varies before/after the cutover, so it is not a hard
  assertion.
- Confirms the materialized views run under schema_owner as DEFINER.
- admin is checked only when it is enabled.

Queries run as each user over the pod's local port via kubectl exec, so the
checks hold while each user's networks/ip include loopback (the default);
they must move to the Service endpoint once per-caller network scoping lands.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@toffer toffer requested a review from a team as a code owner June 26, 2026 22:48
…bump to 2.0.0

Code-review fixes for the ClickHouse least-privilege user model (PR #6).

verify-deployment.sh (the user-model verification section):
- Make denial assertions robust: case-insensitive ACCESS_DENIED/ not-enough-priv
  grep (F1), tighten readonly_user needle to "GRANT INSERT" + add a runtime write
  denial (F2), convert monte_carlo telemetry-write denial to a runtime INSERT
  attempt (F3), add schema_owner no-access-management denials (F4).
- Assert otel restriction post-cutover via VERIFY_OTEL_RESTRICTED instead of only
  warning (F5); quoting-tolerant MV DEFINER grep (F6).
- Pass passwords via env CLICKHOUSE_PASSWORD, not --password CLI args (F10).
- Full admin needle incl. WITH GRANT OPTION (F13); per-secret missing-secret
  diagnostics (F14); extra grant/denial coverage for the queue users (F17).

Chart:
- Tighten schema_owner: enumerate ALTER on the 7 system-log tables the schema Job
  TTLs instead of ALTER ON system.*; comment why CREATE DATABASE stays global (F7).
- Rename the otel user's values keys to the cohesive nested clickhouse.otel.*
  shape, matching every other user — breaking, no aliases (F11).
- Document the readerGrants nindent contract and externalSecret required-message
  values paths (F15); cross-reference comments + aligned user ordering across the
  three user-defining files (F16); move the --- separator into the externalSecret
  helper (F18).

Docs:
- Fix the user-model table (schema_owner reads, state-dependent otel reads,
  otel_metrics runtime-created footnote) and add a 1.x -> 2.0.0 upgrade guide (F9, F12).

Versioning:
- Bump Chart.yaml to 2.0.0 — this is a breaking change to the values contract (F19).

Verified: helm lint clean, full helm template renders (6 ExternalSecrets, valid
multi-doc YAML), bash -n parses. F8 (EXISTS TABLE probe) dropped — confirmed
working in k3d.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@toffer

toffer commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author
Code Review — 19 findings (18 fixed, 1 skipped) · validated in k3d

Reviewed: 2026-06-27 UTC | Reviewers: security, devdocs, architecture, correctness, testing, coordinator | Scope: full PR

# Severity Finding Status
F1 ISSUE expect_denied case-sensitive ACCESS_DENIED grep — fragile across CH versions ✅ Case-insensitive, broadened pattern
F2 ISSUE readonly_user forbid needle "INSERT" too loose; no runtime check ✅ Tightened to GRANT INSERT + runtime write denial
F3 ISSUE monte_carlo telemetry-write denial was grant-text only ✅ Now a runtime INSERT attempt
F4 ISSUE schema_owner "no access management" property unverified ✅ Added access-mgmt + CREATE USER denials
F5 ISSUE otel restrict state reported, never asserted ✅ Asserts via VERIFY_OTEL_RESTRICTED
F6 ISSUE MV DEFINER grep may miss CH identifier quoting ✅ Quoting/case-tolerant grep (hedge)
F7 ISSUE schema_owner ALTER ON system.* + global CREATE DATABASE over-broad ✅ Enumerated 7 system-log tables; documented why CREATE DATABASE stays global
F8 ISSUE EXISTS TABLE probe may need system.tables SELECT ⏭️ Skipped — confirmed non-issue in k3d (works for llm_worker AND INSERT-only otel)
F9 ISSUE README user-model table misstated schema_owner/otel reads + otel_metrics ✅ Table corrected + runtime-created footnote
F10 SUGGESTION Passwords passed as --password CLI args (process list/audit leak) ✅ Via env CLICKHOUSE_PASSWORD
F11 SUGGESTION otel values shape inconsistent with other users ✅ Renamed to nested clickhouse.otel.* (breaking, no aliases)
F12 SUGGESTION No upgrade guide for existing installs ✅ Added 1.x→2.0.0 upgrade section
F13 SUGGESTION admin needle misses WITH GRANT OPTION ✅ Fixed — split into two substring checks (first attempt false-FAILed in k3d: SHOW GRANTS interpolates TO admin)
F14 SUGGESTION Undifferentiated missing-secret message ✅ Names the specific missing user
F15 SUGGESTION Helper nindent contract / required paths undocumented ✅ Documented
F16 SUGGESTION User defs split across 3 files, no cross-ref ✅ Cross-ref comments + aligned ordering
F17 SUGGESTION Queue-user grant/denial coverage gaps ✅ Added checks
F18 NIT --- separator placement in range loop ✅ Moved into the ExternalSecret helper
F19 ISSUE Chart.yaml not bumped for a breaking change ✅ Bumped to 2.0.0

👉 Fix commits: 1afe264 · bc398a1 (k3d-found F13 correction)

k3d validation (2.0.0 chart, full 1.x→2.0.0 upgrade path): helm lint clean; ESO synced all 6 per-user secrets; default superuser removed; 0014 MV-DEFINER reparent confirmed (all 3 MVs DEFINER = schema_owner); all user-model assertions pass pre-cutover and post-cutover (restrictGrants=true + VERIFY_OTEL_RESTRICTED=true); llm-worker + INSERT-only otel collector init EXISTS TABLE probes both pass.

toffer and others added 2 commits June 27, 2026 09:04
k3d testing of the new user-model verification revealed the F13 fix false-FAILed:
the contiguous needle "GRANT ALL ON *.* WITH GRANT OPTION" never matches because
ClickHouse renders SHOW GRANTS as `GRANT ALL ON *.* TO admin WITH GRANT OPTION`
(the grantee is interpolated mid-string) and expect_grant is fixed-string (grep -qiF).

Split into two substring checks — "GRANT ALL ON *.*" and "WITH GRANT OPTION" —
which together verify both the grant-all and the delegation capability.

Validated end-to-end in k3d against the 2.0.0 chart (1.x->2.0.0 upgrade path):
all user-model checks pass pre-cutover and post-cutover (restrictGrants=true,
VERIFY_OTEL_RESTRICTED=true), MV DEFINER reparent confirmed, default user removed,
and EXISTS TABLE init probes work for both llm_worker and the INSERT-only otel
collector (confirming F8 is a non-issue).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drops the inconsistent decorative `...User` suffix so every per-user values block
key matches its actual ClickHouse SQL username — self-documenting and removes the
naming judgment call (schemaOwner and monteCarloUser were the same kind of thing
named two different ways).

  clickhouse.llmWorkerUser.*  -> clickhouse.llmWorker.*
  clickhouse.monteCarloUser.* -> clickhouse.monteCarlo.*
  clickhouse.adminUser.*      -> clickhouse.admin.*

otel / schemaOwner / readonlyUser are unchanged — each already matches its SQL
username (otel / schema_owner / readonly_user). Keying off the username is the
invariant; renaming the readonly_user SQL user to drop the asymmetry would touch
grants + the readonly profile for no real gain.

Value-key paths only — SQL usernames and K8s Secret names are unchanged, so
verify-deployment.sh is unaffected. Updated all chart templates, the README
config table + walkthrough, and the 2.0.0 upgrade rename table. Re-validated in
k3d: clean deploy + all user-model checks pass (0 failures).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant