Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phil/controller backoffs #2038

Merged
merged 5 commits into from
Apr 3, 2025
Merged

Phil/controller backoffs #2038

merged 5 commits into from
Apr 3, 2025

Conversation

psFried
Copy link
Member

@psFried psFried commented Apr 2, 2025

Description:

Resolves #2032
Adds backoffs to controller-initiated publications and discovers, so that they don't retry them too frequently when the controller is invoked frequently for various different reasons. This ought to dramatically reduce the rate of failed publications and discovers, which are being run due to frequent shard failures, which trigger controller runs.

In all of these, the goal is to still show the controller as being in error status when it's waiting on a backoff, so that failures are still obvious when looking at the overall status.

Also fixes a bug in the runtime, which caused us to fail to set lastSourcePublishedAt in materialization stats documents if no documents were read from the source.


This change is Reviewable

psFried added 5 commits April 2, 2025 12:41
Introduces a backoff for failed periodic publications, to limit the
rate of attempts when the controller gets run very frequently.
Introduces a backoff error after failed auto-discover attempts. This avoids
attempting auto-disocvers too frequently in cases where the controller is being
run frequently for other reasons. Having an error means that controllers with
failed auto-discovers will still indicate this obviously.

Also fixes a bug in auto-discovers, which didn't account for specs that have
auto-discovers disabled completely, and would still attempt to discover them.
Updates the dependency update controller to consistently apply the backoff
after failed publications.  Previously, this would re-try the publication every
time the controller runs, which resulted in frequent attempts when the
controller was being invoked frequently for various other reasons.
Introduces a backoff in the materialization controller when publications to add
sourceCapture bindings fail. This prevents rapid retries when the controller is
invoked frequently.
Fixes a typo in the runtime, which resulted in only setting
`lastSourcePublishedAt` in the stats documents if there were documents loaded
from the destination. This field should be present if any documents were read
from source collections.
@psFried psFried force-pushed the phil/controller-backoffs branch from 6f9abc9 to 45b71e5 Compare April 2, 2025 21:43
@psFried psFried requested a review from jgraettinger April 2, 2025 21:50
Copy link
Member

@jgraettinger jgraettinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

update_next_run(status, state, model, control_plane).await?;
has_changes
}
let outcome = match result {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional nit: could use map_err

@psFried psFried merged commit 616e874 into master Apr 3, 2025
3 checks passed
@psFried psFried deleted the phil/controller-backoffs branch April 3, 2025 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

agent: controllers should backoff failed publications
2 participants