Skip to content

feat(datahub-upgrade): reconcile configProperties on existing kafka topics#17778

Merged
lakshay-nasa merged 14 commits into
datahub-project:masterfrom
vejeta:retention-fix
Jun 23, 2026
Merged

feat(datahub-upgrade): reconcile configProperties on existing kafka topics#17778
lakshay-nasa merged 14 commits into
datahub-project:masterfrom
vejeta:retention-fix

Conversation

@vejeta

@vejeta vejeta commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Symptom

GMS pods enter a crash loop with /health returning 503, blocked on bootstrap step 1/12 WaitForSystemUpdateStep. That step waits for a "system update completed" marker on DataHubUpgradeHistory_v1. When the topic runs on broker-default retention.ms (commonly 7 days: e.g. because it was auto-created or pre-existed before being declared in application.yaml causing the marker aging out and GMS never becomes Ready.

GMS consumer log when this happens:

Fetch position offset=63 ... is out of range for partition DataHubUpgradeHistory_v1-0, resetting offset.

Fix

Add opt-in KAFKA_SETUP_RECONCILE_EXISTING_TOPIC_CONFIGS (default false).
So we keep the default behaviour (Useful for deployments with restricted Kafka ALTER_CONFIGS permissions.)

When enabled, CreateKafkaTopicsStep reconciles with application.yml, declared configProperties on topics that already exist via incrementalAlterConfigs.

This is how the behaviour will work when true:

  • Additive: keys absent from application.yaml are never touched.
  • Idempotent: describeConfigs is called first; topics whose declared keys already match the broker are skipped, and no AlterConfigs request is issued in the all-match case.

Today CreateKafkaTopicsStep only sets configs at creation time, so this drift (notably retention.ms on DataHubUpgradeHistory_v1) is never corrected on subsequent runs.

New Log format

Per-topic lines use diff prefixes; both emit AlterConfigOp with OpType.SET:

  • ~ key: old -> new — broker returned a value for the key and it differs from the declared one (common case).
  • + key=new — key was absent from describeConfigs (rare for standard topic configs, since Kafka surfaces inherited defaults).

At the end of this step, in the logs, the summary Successfully applied configProperty changes on N topics (X added, Y changed) makes a no-op rollout trivially distinguishable from an effective one.

Validation

Verified end-to-end on a cluster with the flag enabled — 7 topics declared configProperties, 2 already matched the broker and were skipped per-topic, incrementalAlterConfigs was issued for the remaining 5:

e.g.:

Topic MetadataChangeLog_Timeseries_v1: added=[] changed=[~ max.message.bytes: 16777216 -> 5242880, ~ retention.ms: 604800000 -> 7776000000]
Topic MetadataChangeLog_Versioned_v1: declared properties already match broker - nothing to alter
...
Successfully applied configProperty changes on 5 topics (0 added, 6 changed)

If we rerun the process against the same broker we get no-ops (all 7 took the "already match" branch).

Flag-off path also verified: describeConfigs/incrementalAlterConfigs are not called, behavior is identical to master.

@github-actions github-actions Bot added devops PR or Issue related to DataHub backend & deployment community-contribution PR or Issue raised by member(s) of DataHub Community labels Jun 5, 2026
@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Bundle Report

Bundle size has no change ✅

@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.92308% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ub/upgrade/system/kafka/CreateKafkaTopicsStep.java 96.92% 0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@vejeta vejeta marked this pull request as ready for review June 6, 2026 03:01
@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Linear: PFP-4282

Thanks for your contribution! We have created an internal ticket to track this PR. A member of the core DataHub team will be assigned to review it within the next few business days - you will get a follow-up comment once a reviewer is assigned.

@maggiehays maggiehays added the needs-review Label for PRs that need review from a maintainer. label Jun 6, 2026
vejeta added 2 commits June 8, 2026 14:29
…opics

Add opt-in KAFKA_SETUP_RECONCILE_EXISTING_TOPIC_CONFIGS flag. When enabled,
CreateKafkaTopicsStep applies declared configProperties to topics that
already exist via incrementalAlterConfigs (additive SET — undeclared keys
are not touched). Default false to preserve current behavior on deployments
with restricted Kafka ALTER_CONFIGS permissions.

Closes the gap where broker auto-created topics keep broker-default
retention (notably DataHubUpgradeHistory_v1 which must have
retention.ms=-1 per DataHub deployment requirements).
- Hoist the single AlterConfigOp construction out of the if/else
  branches in diffConfigs; the two branches now only differ in
  which log list they populate, the SET op is shared.
- Drop the one-line TopicConfigDiff.isEmpty() helper; the single
  call site now checks diff.ops().isEmpty() directly.
@david-leifker david-leifker added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Jun 12, 2026
- Add testReconcileEmitsChangedDiffWhenBrokerValueDiffers exercising
  the live!=null && !live.equals(desired) branch of diffConfigs
  (changed.add, ops.add, totalChanged accumulator). Closes the
  4-line patch-coverage gap reported by codecov.
- Register kafka.setup.reconcileExistingTopicConfigs in
  PropertiesCollectorConfigurationTest's NON_SENSITIVE_PROPERTIES
  next to the other kafka.setup.* booleans. The flag is a plain
  feature toggle, not credential material.
@lakshay-nasa lakshay-nasa enabled auto-merge (squash) June 23, 2026 13:06
@lakshay-nasa lakshay-nasa merged commit e99984b into datahub-project:master Jun 23, 2026
68 of 71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PR or Issue raised by member(s) of DataHub Community devops PR or Issue related to DataHub backend & deployment merge-pending-ci A PR that has passed review and should be merged once CI is green. needs-review Label for PRs that need review from a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants