Skip to content

[Bug] Bundle is stuck permanently if collection agent fails on one node #73

@ejweber

Description

@ejweber

For reasons outlined in #72, the support bundle collection process could not complete on one node in a cluster. It looks like we wait here indefinitely to receive all expected bundles before proceeding. Since the collection agent on one node failed before checking in, we did not proceed to finish creating the bundle, and the user had nothing to send to support.

Some suggested resolutions:

  • A timeout mechanism could automatically send on m.ch after some time, even if all bundles had not been received. This would ensure we got something, though we would have to determine what a reasonable timeout should be.
  • Watch for DaemonSet Pod restarts. After some threshold (or maybe just one), stop expecting the corresponding collection agent to send a bundle.
  • The collection agent could survive errors like the one the user experienced and send at least something to the manager. This probably doesn't help us in a network partition, etc.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions