- Status: Accepted
- Date: 2026-04-30
- Deciders: OpenFoundry platform architecture group
- Supersedes: the "Trino as edge BI only" leg of ADR-0009. The internal Flight SQL P2P posture from ADR-0009 is retained.
- Related work:
services/sql-bi-gateway-service/(Apache Arrow Flight SQL gateway, port50133Flight SQL gRPC +50134HTTP side router)services/sql-warehousing-service/(DataFusion compute pool, Flight SQL P2P, port50123)libs/query-engine/(DataFusion +FlightSqlTableProvider)libs/auth-middleware/(JWT decoding + per-tenant quota policy)infra/k8s/platform/manifests/lakekeeper/— lakehouse catalog deployment addressed via per-statement routing- Deletions:
infra/k8s/platform/manifests/trino/,infra/runbooks/trino.md
ADR-0009 removed Trino from the internal SQL fabric (service-to-service SQL travels over Apache Arrow Flight SQL P2P) but kept Trino as the "edge BI only" gateway for external Tableau / Superset / JDBC clients. Six months in, that residual Trino footprint is still expensive:
- A second SQL dialect to support across docs, tests and BI dashboards (Trino SQL alongside DataFusion / Vespa / Postgres dialects).
- A second auth / quota / audit surface. Tenant policies that we
already implement in
libs/auth-middleware(TenantContext,TenantQuotaPolicy) had to be re-encoded inside Trino'saccess-controlplugin chain. Drift between the two surfaces is a recurring source of incidents. - A second deployment to operate (Trino coordinator + workers,
secret rotation, Iceberg catalog properties, Helm values, PDBs) — roughly
10 % of
infra/k8s/purely for an external-facing read path. - The
sql-bi-gateway-servicewas a stub (fn main() {}) that only existed to delegate to Trino. It already had partial scaffolding for saved queries, audit, JWT auth and migrations — all wasted as long as the actual edge surface was Trino.
Two solutions were considered (see problem statement that opened this
ADR): a thin proxy in sql-bi-gateway-service that still forwards to
Trino (Solution A), or a full Apache Arrow Flight SQL server inside the
gateway that subsumes Trino's edge role (Solution B). Solution B is the
"platform-grade" choice: federation is preserved at the external
boundary (Tableau still sees a single catalog tree) while collapsing
the auth / quotas / audit / saved-queries plane onto a single Rust
service we already own.
-
Implement
sql-bi-gateway-serviceas a real Apache Arrow Flight SQL server (arrow-flight::sql::server::FlightSqlService) on port50133, backed by DataFusion (query-engine::QueryContext). The service implements the Flight SQL verbs that Tableau / Superset and the Apache Arrow Flight SQL JDBC driver call during normal operation:do_handshake,get_flight_info_catalogs,get_flight_info_schemas,get_flight_info_tables,get_flight_info_statement,get_flight_info_prepared_statement,do_get_statementanddo_put_statement_update. -
Per-statement backend routing. Every SQL statement is classified by the catalog prefix of its first table reference and routed to:
Catalog prefix Backend (per ADR-0009 table) iceberg.*/ no prefixlocal DataFusion SessionContext, optionally delegated tosql-warehousing-servicetrino.*Trino via TRINO_FLIGHT_SQL_URLvespa.*Vespa via VESPA_FLIGHT_SQL_URLpostgres.*/postgresql.*Postgres via POSTGRES_FLIGHT_SQL_URLBackends that are not configured cause the gateway to reject the statement at planning time with
failed_preconditionand a clear error message — no silent re-routing. -
Auth, quotas and audit on the Flight SQL surface. JWTs are extracted from the gRPC
authorization: Bearer <token>metadata and decoded withauth_middleware::jwt. The resultingTenantContextclamps the row count viaclamp_query_limit. Every executed statement emits a structuredtracingevent under thesql_bi_gateway.audittarget containing the tenant id, tier, user email, backend, remote-vs-local, SQL fingerprint, row count, duration and outcome — never the SQL payload itself. -
Saved queries on the HTTP side router. A small
axumserver on port50134exposes/healthz(Kubernetes probes) and thePOST/GET/DELETE /api/v1/queries/savedCRUD against the per-bounded-context Postgres cluster (migration20260419100003_initial_queries.sql). -
Catalog tree advertised to BI clients. A single catalog
openfoundrywith one schema per backend (iceberg,trino,vespa,postgres) is returned viaGetCatalogs/GetSchemas/GetTablesso Tableau / Superset render a sensible navigator panel even before any backend metadata is wired in. -
Trino is deleted entirely. The following are removed in this ADR's accompanying change-set:
infra/k8s/platform/manifests/trino/(coordinator, PDB, catalog ConfigMaps, values)infra/runbooks/trino.md- All Trino references in
infra/k8s/platform/manifests/flink/maintenance/rewrite-data-files.yaml,infra/k8s/platform/observability/prometheus-rules/lakekeeper.yaml,docs/architecture/runtime-topology.md,docs/architecture/services-and-ports.md,docs/architecture/adr/ADR-0012-data-plane-slos.md,ROADMAP.md.
-
Tableau / Superset use the Apache Arrow Flight SQL JDBC driver (Apache-2.0) pointed at
sql-bi-gateway-service:50133. Connection string template:jdbc:arrow-flight-sql://<host>:50133?useEncryption=falseplus a JWT in the
authorizationheader for production. No client changes are required vs. the Trino driver beyond the URL and the driver JAR.
- Single edge surface, single policy plane. Tenant scoping,
quotas, JWT validation and audit are owned by the same Rust crate
that owns the data-plane bounded contexts; drift between Trino's
access-control plugin and
auth-middlewaredisappears. - Smaller infra footprint. Trino's coordinator + workers (~1 vCPU / 2 GiB minimum, plus PDBs and Helm values) are gone; the gateway is a single Rust binary co-located with its CNPG cluster.
- Single SQL dialect for new work. Internal services already speak DataFusion SQL (per ADR-0009); BI now also reaches the same dialect on the local side, with routed engines (Trino, Vespa, Postgres) only relevant when explicitly using those catalog prefixes.
- Observable, auditable end-to-end. The Flight SQL surface emits one structured audit event per statement, including a deterministic SQL fingerprint that is safe to log.
- Roadmap clear. Column masking, row-level security and a query cache can be implemented at one well-defined plane in Rust.
- We re-implement small pieces of Trino. The Flight SQL
GetCatalogs/GetSchemas/GetTablesplumbing, the routing-by-prefix planner and the row-quota enforcement are all things Trino does well. We accept this scope explicitly: the goal is to own the ~5 verbs Tableau actually uses, not to rebuild a generic federated query engine. Anything more ambitious (full cross-backend joins, cost-based planning) stays inside DataFusion, not the gateway. - Cross-backend joins are limited to the Iceberg/local DataFusion
side: a
JOINbetween, say,trino.of_lineage.runsandvespa.documentsis not supported. Callers that need such joins must materialise into Iceberg viasql-warehousing-servicefirst. The router selects a single backend per statement (the catalog of the firstFROM/JOIN/UPDATEtarget). do_put_statement_updatereturns0for affected-row count. DataFusion does not surface a portable affected-row count; Flight SQL clients treat0as "unknown". This matches Trino JDBC behaviour for non-DML and is acceptable for BI workloads.
- Deploy
sql-bi-gateway-service:50133alongside the existing Trino deployment. - Re-point Tableau / Superset workbooks to the Arrow Flight SQL JDBC driver and the new URL; verify the catalog navigator and the top-N production dashboards.
- Delete the Trino Helm release and the
infra/k8s/platform/manifests/trino/manifests once the cut-over has soaked for one error-budget window (cf. ADR-0012). - Remove Trino references from runbooks and docs (done in this PR).
- Solution A — keep
sql-bi-gateway-serviceas a thin proxy that forwards to Trino. Would have preserved the zero-engineering posture but kept all four costs from §Context (dialect, auth, ops, stub gateway). Rejected. - Promote
sql-warehousing-serviceto also be the edge BI surface. Would conflate the internal compute pool (large analytical workloads, no per-tenant quotas) with the BI edge (small interactive queries, strict per-tenant quotas, audit). The two have different scaling characteristics and different SLOs (cf. ADR-0012); kept separate.