feat: OpenLineage integration and Quarkus management interface by MrIvv · Pull Request #696 · memiiso/debezium-server-iceberg

MrIvv · 2026-05-04T15:19:32Z

Summary

Two independent features bundled in a single PR (each in a dedicated commit) for the iceberg sink:

OpenLineage output dataset emission — emit DatasetMetadata after each successful Iceberg commit, using the standard DebeziumOpenLineageEmitter API
Quarkus management interface — enable a separate HTTP server on port 9000 for health/ready/live endpoints, isolated from the main event loop

OpenLineage integration

In IcebergTableOperator.addToTablePerSchema(), after a successful commit, emit dataset metadata (table name, schema fields, OUTPUT/DATABASE) using DebeziumOpenLineageEmitter.emit(). Falls back to NoOpLineageEmitter when OpenLineage runtime is not configured, so downstream users without OpenLineage continue to work unchanged.

Files:

pom.xml — add debezium-openlineage-api dependency
IcebergTableOperator.java — add emitOpenLineageEvent() after commit

Quarkus management interface

Add application.properties with quarkus.management.enabled=true as a build-time property. This enables a separate HTTP server on port 9000 for health/ready/live endpoints, using its own thread pool independent from the main Vert.x event loop.

Why this matters in production: without this, health probes share the same event loop that gets blocked for 10-20+ seconds during Iceberg commits and GCS uploads, causing K8s liveness probes to fail and the pod to restart mid-commit.

Files:

application.properties — quarkus.management.enabled=true

Backward compatibility

OpenLineage emission is non-blocking: any failure is caught and logged at DEBUG. No regression for users without OpenLineage configured.
The management interface is opt-in via the build-time property. Endpoints on port 9000 don't conflict with the existing port 8080 (where the application binds by default).
No API or configuration changes for existing consumers.

…erator Emit DatasetMetadata (table name, schema fields, OUTPUT/DATABASE) after each successful Iceberg commit in addToTablePerSchema(). Uses the standard DebeziumOpenLineageEmitter API with graceful fallback to NoOpLineageEmitter when OpenLineage runtime is not configured. Signed-off-by: ivan.senyk <ivan.senyk94@gmail.com>

…le splitting This commit implements the streaming snapshot flush pattern for the Iceberg sink. Combined with the parallel incremental snapshot SPI introduced in debezium/debezium#7362, it dramatically reduces commit overhead and memory pressure during snapshot of large tables. ## Streaming snapshot flush Instead of creating a new Iceberg writer for every batch (5K-20K rows), keep a single writer open per table for the entire snapshot. The writer accumulates data across chunks and produces a single atomic commit at table completion. Periodic file splitting kicks in when the writer reaches a calibrated row threshold, producing ~512MB Parquet files. After the first split-commit, the threshold is recalibrated from actual file size (bytes-per-row) and clamped by available heap (60% of max heap, divided by worker count, divided by an in-memory factor of ~40x for Parquet decompression). ## Components - `IcebergSnapshotCompletionHandler` — implements the SPI from debezium-connector-common. Routes per-chunk events to the streaming writer and triggers final commit on `onTableSnapshotFinished()`. - `BatchCommitCoordinator` — accumulates events from CDC streaming path (legacy fallback when SPI not available). - `IcebergChangeConsumer.StreamingSnapshotContext` — per-table state holder: open writer, cached schema converter, calibrated split threshold. - `IcebergTableOperator.writeChunkToWriter()` / `commitWriter()` — write without commit / final atomic commit + `CommitResult` for adaptive calibration. - `IcebergTableOperator.isSafeTypeChange()` — allows compatible type evolution (timestamptz↔timestamp, decimal↔double, int↔long) for pre-existing tables with legacy schemas. - `StructEventConverter` — cached schema converter constructor, static `fieldMappingCache` for performance. - `EventConverter.isSnapshotEvent()` — used to skip equality-delete writes for READ ops. - Schema evolution + identifier field protection in `IcebergTableOperator.applyFieldAddition()` — protect both new schema's and existing table's identifier fields when key schema is unavailable (e.g. `key.converter.schemas.enable=false`). ## Throughput / memory impact (production, PostgreSQL 16, 116 tables, ~128M rows) | Metric | Before (per-batch writer) | After (streaming + adaptive split) | |-------------------------|---------------------------|-------------------------------------| | Iceberg writers / table | ~1,500 | 1 (with periodic file splits) | | Iceberg commits / table | ~1,500 | ~6-10 (one per ~512MB Parquet file) | | Throughput | ~14K rows/min | ~80-120K rows/min | | Peak memory / worker | ~1.5 GB | ~200-300 MB | ## Build alignment Pin `kafka-clients:4.2.0` (matches `connect-runtime:4.2.0` from `debezium-bom:3.6.0-SNAPSHOT`; the `debezium-server-bom:3.5.0.Final` would otherwise pull `kafka-clients:4.1.1` which is missing `ConfigDef$ValidList.anyNonDuplicateValues`). Pin `httpclient5:5.4.3` to avoid the 5.4.3+5.5 classpath duplication that caused HEAD-request format issues against some REST catalogs (Lakekeeper). ## Dependencies This PR depends on debezium/debezium#7362 which introduces the `SnapshotTableCompletionHandler` SPI in `debezium-connector-common`. The CI build will fail until that PR is merged and `debezium-bom:3.6.0-SNAPSHOT` is published. ## Spinoff PRs (already extracted, mergeable independently before this one) - memiiso#695 — Support nested namespaces with dot separator - memiiso#696 — OpenLineage integration and Quarkus management interface - memiiso#698 — Snapshot READ semantics (READ as INSERT, missing __op handling) - memiiso#699 — Critical data loss fix in processTablesInParallel When those are merged, this PR's diff will shrink to only the streaming flush changes + build alignment. Signed-off-by: ivan.senyk <ivan.senyk94@gmail.com>

ismailsimsek · 2026-05-05T11:11:02Z

+# These properties are applied at Maven build time
+# and CANNOT be modified at runtime.


could we add this setting to user properties instead? or define in the Config class with defaults set to True and 9000. this is not flexible way to do add them.

MrIvv · 2026-05-10T14:03:38Z

Hi @ismailsimsek, switched the two quarkus.management.* properties to placeholder syntax with sensible defaults:

quarkus.management.enabled=${iceberg.management.enabled:true}
quarkus.management.port=${iceberg.management.port:9000}

Defaults stay true / 9000 so the dist build keeps the K8s-friendly behaviour it has today, but a consumer who rebuilds debezium-server-iceberg with -Diceberg.management.enabled=false (or sets the property in their own application.properties before mvn package) overrides them.

Note that quarkus.management.enabled is a Quarkus build-time fixed property, so users who consume the pre-built dist zip cannot disable it at runtime via env var or properties — they have to rebuild the debezium-server-iceberg sources with the override property to get a different value baked in. Per-build override is the most flexibility we can give without converting the sink into a Quarkus extension.

Add application.properties with quarkus.management.enabled=true as a build-time property. This enables a separate HTTP server on port 9000 for health/ready/live endpoints, using its own thread pool independent from the main Vert.x event loop. Without this, health probes share the same event loop that gets blocked for 10-20+ seconds during Iceberg commits and GCS uploads, causing K8s to consider the pod unhealthy and restart it. Signed-off-by: ivan.senyk <ivan.senyk94@gmail.com>

MrIvv force-pushed the pr/openlineage-and-quarkus branch from 3f3afef to 9576770 Compare May 4, 2026 15:22

MrIvv force-pushed the pr/openlineage-and-quarkus branch from 9576770 to e685f2c Compare May 5, 2026 07:03

MrIvv mentioned this pull request May 5, 2026

feat: streaming snapshot flush with persistent writer and adaptive file splitting #693

Open

ismailsimsek reviewed May 5, 2026

View reviewed changes

MrIvv force-pushed the pr/openlineage-and-quarkus branch from e685f2c to 1b3158e Compare May 10, 2026 14:00

MrIvv force-pushed the pr/openlineage-and-quarkus branch from 1b3158e to ccf7011 Compare May 10, 2026 14:06

ismailsimsek added 2 commits May 20, 2026 12:34

minor: make OpenLineage support configurable, enabled disabled

b25d85f

fix formating

a65cdff

ismailsimsek merged commit 7d6316f into memiiso:master May 20, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: OpenLineage integration and Quarkus management interface#696

feat: OpenLineage integration and Quarkus management interface#696
ismailsimsek merged 4 commits into
memiiso:masterfrom
MrIvv:pr/openlineage-and-quarkus

MrIvv commented May 4, 2026

Uh oh!

ismailsimsek May 5, 2026

Uh oh!

MrIvv commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# These properties are applied at Maven build time
		# and CANNOT be modified at runtime.

Conversation

MrIvv commented May 4, 2026

Summary

OpenLineage integration

Quarkus management interface

Backward compatibility

Uh oh!

ismailsimsek May 5, 2026

Choose a reason for hiding this comment

Uh oh!

MrIvv commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants