Skip to content

[SETU-2783] Raise USM catalog snapshot size limit and exporter request timeout#2495

Open
Varun PV (varunpv) wants to merge 3 commits into
masterfrom
SETU-2783-master
Open

[SETU-2783] Raise USM catalog snapshot size limit and exporter request timeout#2495
Varun PV (varunpv) wants to merge 3 commits into
masterfrom
SETU-2783-master

Conversation

@varunpv
Copy link
Copy Markdown
Member

@varunpv Varun PV (varunpv) commented May 19, 2026

Summary

  • Raises confluent.catalog.collector.max.bytes.per.snapshot from the 850 KB default to 50 MB on the kafka_controller usm_agent_telemetry block (KRaft mode hosts the catalog collector on the controller).
  • Sets confluent.telemetry.exporter._usm.client.request.timeout.ms to 60 s on both controller and broker; sets _usm.events.client.request.timeout.ms to 60 s on the controller. The 60 s value aligns the USM HTTP exporter request timeout with the AHC readTimeout default of 60 s.
  • master is KRaft-only, so no ZK-mode broker block here. The 7.9.x backport line (which still supports ZK mode) will get a separate PR that adds the broker-side usm_agent_telemetry_zk_mode block holding the events-timeout and max-bytes keys.
  • All blocks gated by the existing 'usm_agent' in groups check; non-USM deployments are unchanged.

Jira: SETU-2783. Companion CFK PR: confluentinc/confluent-operator#4312.

Test plan

  • Tests skipped on this PR per scope clarification; molecule scenarios that exercise USM will pick up the new keys when run.

🤖 Generated with Claude Code

…t timeout

Raise the catalog collector's per-snapshot byte cap from the 850 KB
default to 50 MB, and align the USM HTTP exporter request timeout to
the AHC readTimeout default of 60 s.

* Controller `usm_agent_telemetry` block: add all three keys
  (`_usm.client.request.timeout.ms=60000`,
  `_usm.events.client.request.timeout.ms=60000`,
  `catalog.collector.max.bytes.per.snapshot=52428800`).
* Broker `usm_agent_telemetry` block: add `_usm.client.request.timeout.ms=60000`
  unconditionally.
* Broker `usm_agent_telemetry_zk_mode` (new block, gated on
  `not kraft_enabled`): add `_usm.events.client.request.timeout.ms=60000`
  and `catalog.collector.max.bytes.per.snapshot=52428800`. Mirrors CFK's
  broker `IsZookeeperMode` block — the broker hosts the catalog
  collector and events exporter only when running in ZK mode.

Test coverage: new `molecule/verify_usm_catalog_and_timeout.yml` uses
the canonical `confluent.test/check_property.yml` task to assert each
key against rendered `server.properties`. Imported by `scram-rhel`,
`oauth-plain-rhel`, and `oauth-rbac-plain-rhel8` scenarios (already
USM-enabled per their existing `verify_usm_client_metrics.yml` import).

Gated on `'usm_agent' in groups` everywhere; non-USM deployments are
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 19, 2026 04:58
@varunpv Varun PV (varunpv) requested a review from a team as a code owner May 19, 2026 04:58
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the cp-ansible USM Kafka configuration to prevent large catalog snapshots from being dropped and to align USM HTTP exporter request timeouts with the AHC default read timeout.

Changes:

  • Set confluent.catalog.collector.max.bytes.per.snapshot to 50 MB for controller (and broker in ZK mode).
  • Set _usm.client.request.timeout.ms / _usm.events.client.request.timeout.ms to 60s for relevant roles.
  • Add a new Molecule verify playbook and wire it into existing USM-focused scenarios.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
roles/variables/vars/main.yml Adds/adjusts Kafka controller and broker USM-related properties, including a new ZK-only broker block.
molecule/verify_usm_catalog_and_timeout.yml New Molecule verification playbook asserting the new properties in rendered server.properties.
molecule/scram-rhel/verify.yml Imports the new verification playbook into the scenario.
molecule/oauth-rbac-plain-rhel8/verify.yml Imports the new verification playbook into the scenario.
molecule/oauth-plain-rhel/verify.yml Imports the new verification playbook into the scenario.
Comments suppressed due to low confidence (1)

molecule/verify_usm_catalog_and_timeout.yml:82

  • Same issue as above: use not (kraft_enabled|bool) to avoid incorrect condition evaluation when kraft_enabled is provided as a string via extra vars.
      when:
        - "'usm_agent' in groups"
        - not kraft_enabled

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread roles/variables/vars/main.yml Outdated
confluent.consumer.group.status.enabled: "true"
confluent.consumer.lag.calculator.empty.lag.retention.ms: 86400000
usm_agent_telemetry_zk_mode:
enabled: "{{ 'usm_agent' in groups and not kraft_enabled }}"
Comment on lines +68 to +71
when:
- "'usm_agent' in groups"
- not kraft_enabled

Varun PV (varunpv) and others added 2 commits May 19, 2026 10:41
master and 8.1.x/8.2.x/8.3.x are KRaft-only; ZK-mode broker keys only
apply on 7.9.x (handled in a separate backport PR). Also dropping the
new molecule verify playbook per scope clarification — the property
plumbing on master is just the controller and the unconditional broker
`_usm.client.request.timeout.ms`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tests dropped from this PR per scope clarification. Reverts the
verify_usm_catalog_and_timeout.yml imports added in the first commit
to scram-rhel, oauth-plain-rhel, and oauth-rbac-plain-rhel8.

Net result: this PR is just the vars/main.yml change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@mansisinha mansi sinha (mansisinha) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR should be raised against branch 8.1.x if we want to patch this config in all kafka versions for usm. Post merge to 8.1.x, pint merge has to be performed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants