Skip to content

[add_kubernetes_metadata] Fixes indefinite block when Close() is called#50897

Open
khushijain21 wants to merge 4 commits into
elastic:mainfrom
khushijain21:fix-error-handling
Open

[add_kubernetes_metadata] Fixes indefinite block when Close() is called#50897
khushijain21 wants to merge 4 commits into
elastic:mainfrom
khushijain21:fix-error-handling

Conversation

@khushijain21
Copy link
Copy Markdown
Contributor

@khushijain21 khushijain21 commented May 25, 2026

Proposed commit message

If wait_for_metadata:false and wait_metadata_timeout:0s is set. It retries indefinitely until connection succeeds.
But if Close() is called on the processor, it never returns as it is blocked on retrying.

This PR fixes that and uses context to cancel any ongoing connection retries.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
  • I have added an entry in ./changelog/fragments using the changelog tool.

Disruptive User Impact

None

@botelastic botelastic Bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)
  • /test : Run the Buildkite pipeline.

@khushijain21 khushijain21 changed the title [add_kubernetes_metadata] Fixes a potential indefinite block when is called [add_kubernetes_metadata] Fixes a potential indefinite block when Close() is called May 25, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 25, 2026

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @khushijain21? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@khushijain21 khushijain21 marked this pull request as ready for review May 25, 2026 08:38
@khushijain21 khushijain21 requested a review from a team as a code owner May 25, 2026 08:38
@khushijain21 khushijain21 changed the title [add_kubernetes_metadata] Fixes a potential indefinite block when Close() is called [add_kubernetes_metadata] Fixes indefinite block when Close() is called May 25, 2026
@khushijain21 khushijain21 added the backport-skip Skip notification from the automated backport with mergify label May 25, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 25, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 5d9da35d-2e18-4721-91db-fe45a1a1d437

📥 Commits

Reviewing files that changed from the base of the PR and between 13fd138 and 9a7390b.

📒 Files selected for processing (1)
  • libbeat/processors/add_kubernetes_metadata/kubernetes_test.go

📝 Walkthrough

Walkthrough

This PR adds context-aware control flow to the Kubernetes processor initialization, enabling graceful cancellation and explicit error propagation. The validation for WaitMetadataRetryPeriod now rejects zero values. The core changes introduce a context-aware isKubernetesAvailableWithTimeout helper that replaces the previous retry/sleep loop, and refactor kubernetesAnnotator.init to accept a context and return an error. The New method now creates a cancelable context and passes it to init; when WaitMetadata is enabled, initialization runs synchronously and returns errors; otherwise, it runs asynchronously. The Close method is updated to cancel the context and wait for any in-flight initialization goroutines via a WaitGroup, preventing goroutine leaks. Tests cover the async init blocking behavior and timeout/cancellation scenarios.

Possibly related PRs

  • elastic/beats#50509: Modifies Kubernetes availability probing and initialization gating toward timeout-based/wait-for-ready behavior with context-aware timeout logic and async readiness coordination.

Suggested labels

Team:Elastic-Agent-Data-Plane

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • 🛠️ Update Documentation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

TL;DR

libbeat Go integration tests failed due to a transient Kafka leader-election window (leaderless partition) in TestKafkaPublish, not because of the PR’s code change. Immediate action: re-run the failed Buildkite job.

Remediation

Investigation details

Root Cause

The failure is in libbeat/outputs/kafka integration tests, where publish completion was signaled as retry instead of ACK during a broker metadata/leadership transition.

  • Assertion site: libbeat/outputs/kafka/kafka_integration_test.go:314
  • ACK requirement helper: libbeat/outputs/kafka/kafka_integration_test.go:649-656
  • Failing expectation from logs: expected batch to be ACKed, got 4

The commit under test (9a7390b345e91175a44548595c1e33b3f1df17cf) modifies only libbeat/processors/add_kubernetes_metadata/kubernetes_test.go, which is unrelated to Kafka output integration behavior.

Evidence

Verification

  • Additional local rerun not performed here because this integration test requires the Kafka integration environment used in CI.

Follow-up

  • If this becomes frequent, consider making TestKafkaPublish wait for topic leader availability (or retry-on-leaderless before ACK assertion) to reduce infrastructure-driven flakes.

Note

🔒 Integrity filter blocked 2 items

The following items were blocked because they don't meet the GitHub integrity level.

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

What is this? | From workflow: PR Buildkite Detective

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

@cmacknz cmacknz added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 25, 2026
@botelastic botelastic Bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 25, 2026
@infra-vault-gh-plugin-prod
Copy link
Copy Markdown

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-skip Skip notification from the automated backport with mergify skip-changelog Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants