gcs_store, filesystem_store: lightweight check_health probes by amankrx · Pull Request #2361 · TraceMachina/nativelink

amankrx · 2026-05-22T14:01:05Z

Description

The default StoreDriver::check_health performs a full update_oneshot + has + get_part_unchunked roundtrip. Under any meaningful slow-store load (e.g. concurrent multi-GB blob reads saturating the network) that roundtrip queues behind production traffic and easily exceeds the per-indicator budget, causing /status to return 503 even when the pod is otherwise functional. Kubelet treats those as probe failures and eventually restarts the pod, dropping every connected client.

Type of change

Please delete options that aren't relevant.

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Added relevant tests and tested the branch in an actual environment.

Checklist

Updated documentation if needed
Tests added/amended
bazel test //... passes locally
PR is contained in a single commit, using git amend see some docs

This change is

palfrey

Please add test code for both of these

palfrey · 2026-05-22T16:25:37Z

+                    format!("GcsStore::check_health: object_exists errored: {e}").into(),
+                )
+            }
+            Err(_) => HealthStatus::new_failed(


This should also warn!

And it should be a timeout, not a Failed!

palfrey · 2026-05-22T16:27:23Z

+                    format!("FilesystemStore::check_health: stat errored: {e}").into(),
+                )
+            }
+            Err(_) => HealthStatus::new_failed(


Should be a timeout, not a Failed!

palfrey · 2026-05-22T16:28:29Z

 use nativelink_store::cas_utils::ZERO_BYTE_DIGESTS;
 use nativelink_store::gcs_client::client::GcsOperations;
-use nativelink_store::gcs_client::mocks::{MockGcsOperations, MockRequest};
+use nativelink_store::gcs_client::mocks::{FailureMode, MockGcsOperations, MockRequest};


It's a good idea, but now causing

error: unnecessary qualification --> nativelink-store/tests/gcs_store_test.rs:136:27 | 136 | .set_failure_mode(nativelink_store::gcs_client::mocks::FailureMode::NotFound) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | = note: `-D unused-qualifications` implied by `-D warnings` = help: to override `-D warnings` add `#[allow(unused_qualifications)]`

The default StoreDriver::check_health performs a full update_oneshot + has + get_part_unchunked roundtrip. Under any meaningful slow-store load (e.g. concurrent multi-GB blob reads saturating the network) that roundtrip queues behind production traffic and easily exceeds the per-indicator budget, causing /status to return 503 even when the pod is otherwise functional. Kubelet treats those as probe failures and eventually restarts the pod, dropping every connected client. Override check_health on both registered indicators with a probe that shares no path with production traffic: * GcsStore: a single object_exists() call against a fixed never-existing path. Metadata roundtrip only — independent of body-transfer bandwidth and the upload buffer pool. * FilesystemStore: a stat() of the configured content_path. A single syscall, microseconds on a healthy mount; bounded with a timeout so a hung NFS / EBS mount cannot wedge the indicator. Both paths are bounded by a 2 s ceiling so they remain well inside the HealthServer per-indicator budget.

palfrey requested changes May 22, 2026

View reviewed changes

amankrx force-pushed the lightweight-check-health-gcs-filesystem branch from 11415d3 to a0d5848 Compare May 22, 2026 15:40

amankrx requested a review from palfrey May 22, 2026 15:40

palfrey requested changes May 22, 2026

View reviewed changes

amankrx force-pushed the lightweight-check-health-gcs-filesystem branch from a0d5848 to 1327ec0 Compare May 23, 2026 13:24

amankrx force-pushed the lightweight-check-health-gcs-filesystem branch from 1327ec0 to 7ef50ac Compare May 23, 2026 13:27

MarcusSorealheis added 2 commits May 26, 2026 14:29

Merge branch 'main' into lightweight-check-health-gcs-filesystem

e5ebede

Merge branch 'main' into lightweight-check-health-gcs-filesystem

c44a0b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gcs_store, filesystem_store: lightweight check_health probes#2361

gcs_store, filesystem_store: lightweight check_health probes#2361
amankrx wants to merge 3 commits into
TraceMachina:mainfrom
amankrx:lightweight-check-health-gcs-filesystem

amankrx commented May 22, 2026 •

edited

Loading

Uh oh!

palfrey left a comment

Uh oh!

Uh oh!

Uh oh!

palfrey May 22, 2026

Uh oh!

palfrey May 22, 2026

Uh oh!

palfrey May 22, 2026

Uh oh!

palfrey May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

amankrx commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Checklist

Uh oh!

palfrey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

palfrey May 22, 2026

Choose a reason for hiding this comment

Uh oh!

palfrey May 22, 2026

Choose a reason for hiding this comment

Uh oh!

palfrey May 22, 2026

Choose a reason for hiding this comment

Uh oh!

palfrey May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amankrx commented May 22, 2026 •

edited

Loading