Skip to content

gcs_store, filesystem_store: lightweight check_health probes#2361

Open
amankrx wants to merge 3 commits into
TraceMachina:mainfrom
amankrx:lightweight-check-health-gcs-filesystem
Open

gcs_store, filesystem_store: lightweight check_health probes#2361
amankrx wants to merge 3 commits into
TraceMachina:mainfrom
amankrx:lightweight-check-health-gcs-filesystem

Conversation

@amankrx
Copy link
Copy Markdown
Collaborator

@amankrx amankrx commented May 22, 2026

Description

The default StoreDriver::check_health performs a full update_oneshot + has + get_part_unchunked roundtrip. Under any meaningful slow-store load (e.g. concurrent multi-GB blob reads saturating the network) that roundtrip queues behind production traffic and easily exceeds the per-indicator budget, causing /status to return 503 even when the pod is otherwise functional. Kubelet treats those as probe failures and eventually restarts the pod, dropping every connected client.

Type of change

Please delete options that aren't relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Added relevant tests and tested the branch in an actual environment.

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

Copy link
Copy Markdown
Member

@palfrey palfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add test code for both of these

@amankrx amankrx force-pushed the lightweight-check-health-gcs-filesystem branch from 11415d3 to a0d5848 Compare May 22, 2026 15:40
@amankrx amankrx requested a review from palfrey May 22, 2026 15:40
Comment thread nativelink-store/src/filesystem_store.rs Outdated
Comment thread nativelink-store/src/filesystem_store.rs Outdated
Comment thread nativelink-store/src/gcs_store.rs Outdated
format!("GcsStore::check_health: object_exists errored: {e}").into(),
)
}
Err(_) => HealthStatus::new_failed(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also warn!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it should be a timeout, not a Failed!

format!("FilesystemStore::check_health: stat errored: {e}").into(),
)
}
Err(_) => HealthStatus::new_failed(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a timeout, not a Failed!

use nativelink_store::cas_utils::ZERO_BYTE_DIGESTS;
use nativelink_store::gcs_client::client::GcsOperations;
use nativelink_store::gcs_client::mocks::{MockGcsOperations, MockRequest};
use nativelink_store::gcs_client::mocks::{FailureMode, MockGcsOperations, MockRequest};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea, but now causing

error: unnecessary qualification
   --> nativelink-store/tests/gcs_store_test.rs:136:27
    |
136 |         .set_failure_mode(nativelink_store::gcs_client::mocks::FailureMode::NotFound)
    |                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: `-D unused-qualifications` implied by `-D warnings`
    = help: to override `-D warnings` add `#[allow(unused_qualifications)]`

@amankrx amankrx force-pushed the lightweight-check-health-gcs-filesystem branch from a0d5848 to 1327ec0 Compare May 23, 2026 13:24
The default StoreDriver::check_health performs a full update_oneshot +
has + get_part_unchunked roundtrip. Under any meaningful slow-store
load (e.g. concurrent multi-GB blob reads saturating the network) that
roundtrip queues behind production traffic and easily exceeds the
per-indicator budget, causing /status to return 503 even when the
pod is otherwise functional. Kubelet treats those as probe failures
and eventually restarts the pod, dropping every connected client.

Override check_health on both registered indicators with a probe that
shares no path with production traffic:

  * GcsStore: a single object_exists() call against a fixed
    never-existing path. Metadata roundtrip only — independent of
    body-transfer bandwidth and the upload buffer pool.
  * FilesystemStore: a stat() of the configured content_path. A
    single syscall, microseconds on a healthy mount; bounded with a
    timeout so a hung NFS / EBS mount cannot wedge the indicator.

Both paths are bounded by a 2 s ceiling so they remain well inside
the HealthServer per-indicator budget.
@amankrx amankrx force-pushed the lightweight-check-health-gcs-filesystem branch from 1327ec0 to 7ef50ac Compare May 23, 2026 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants