Skip to content

fix: health should ignore max_unavailable [NR-365513]#1051

Merged
paologallinaharbur merged 4 commits intomainfrom
fix/health
Feb 10, 2025
Merged

fix: health should ignore max_unavailable [NR-365513]#1051
paologallinaharbur merged 4 commits intomainfrom
fix/health

Conversation

@paologallinaharbur
Copy link
Member

Before we were taking too many variables into account, but we took wrong assumptions:

  • we do not know if a rolling updated is ongoing
  • we do not have a timeout

Therefore, we were always taking into account max_unavailable, making very easy to report false negatives.

The implementation got simplified aiming to report unhealthy if not all the pods expected are ready:

This implies that during a rollout an agent could appear as unHealthy if there are not all the ready pods expected.

Moreover, please notice that following the APM case:

We report as Healthy also if a replica is running an old version of the agent, since we can safely assume that it is in the process to be upgraded.

@paologallinaharbur paologallinaharbur changed the title fix: health should ignore max_unhealthy [NR-365513] fix: health should ignore max_unavailable [NR-365513] Feb 6, 2025
vjripoll
vjripoll previously approved these changes Feb 6, 2025
Copy link
Contributor

@vjripoll vjripoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

sigilioso
sigilioso previously approved these changes Feb 6, 2025
Copy link
Contributor

@sigilioso sigilioso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

// I.e. we are reporting healthy also whenever there is an instance running an old version.
pub fn check_health_single_daemon_set(ds: &DaemonSet) -> Result<Health, HealthCheckerError> {
let name = client_utils::get_metadata_name(ds)?;
let status = Self::get_daemon_set_status(name.as_str(), ds)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

working on de deployment i see this same pattern, should we fail because of this?
not having status is an expected situation right?
I fear to pollute logs becase of this, perhaps we need to report unhealthy but not fail the health checker

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though that the status is supposed to be there 🤔 (probably there are scenarios I'm not aware of). If it is expected, I agree we should not fail (even if failing also reports unhealthy in the end)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen often that log?
By the way it fails, but at the end it is returning unhealthy, not beaking anything

let health = health_checker.check_health().unwrap_or_else(|err| {
    debug!(agent_id = %agent_id_clone, last_error = %err, "the configured health check failed");
    HealthWithStartTime::from_unhealthy(Unhealthy::from(err), sub_agent_start_time)
});

Copy link
Contributor

@sigilioso sigilioso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nits 🙂

Copy link
Contributor

@sigilioso sigilioso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@paologallinaharbur paologallinaharbur merged commit efaa257 into main Feb 10, 2025
26 checks passed
@paologallinaharbur paologallinaharbur deleted the fix/health branch February 10, 2025 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants