Skip to content

add liveness and readiness probe#565

Open
yash97 wants to merge 2 commits into
aws:mainfrom
yash97:probes
Open

add liveness and readiness probe#565
yash97 wants to merge 2 commits into
aws:mainfrom
yash97:probes

Conversation

@yash97

@yash97 yash97 commented May 5, 2026

Copy link
Copy Markdown
Contributor

Issue #, if available:

Description of changes:

Testing,
Tested readiness probe by removing grpc socket. Sample logs when socket is removed.

{"level":"error","ts":"2026-05-05T14:36:43.331Z","caller":"healthz/healthz.go:59","msg":"grpc-socket check failed: grpc health check failed: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /var/run/aws-node/npa.sock: connect: no such file or directory\""}
{"level":"info","ts":"2026-05-05T14:36:43.331Z","logger":"controller-runtime.healthz","caller":"healthz/healthz.go:128","msg":"healthz check failed","statuses":[{},{}]}
{"level":"error","ts":"2026-05-05T14:36:53.332Z","caller":"healthz/healthz.go:59","msg":"grpc-socket check failed: grpc health check failed: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /var/run/aws-node/npa.sock: connect: no such file or directory\""}
{"level":"info","ts":"2026-05-05T14:36:53.332Z","logger":"controller-runtime.healthz","caller":"healthz/healthz.go:128","msg":"healthz check failed","statuses":[{},{}]}

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@yash97 yash97 requested a review from a team as a code owner May 5, 2026 17:52
@codecov-commenter

codecov-commenter commented May 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 37.73585% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 27.79%. Comparing base (8e2bd30) to head (9c85490).

Files with missing lines Patch % Lines
pkg/rpc/healthcheck.go 0.00% 29 Missing ⚠️
pkg/ebpf/healthcheck.go 83.33% 3 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #565      +/-   ##
==========================================
+ Coverage   27.63%   27.79%   +0.15%     
==========================================
  Files          24       26       +2     
  Lines        3351     3404      +53     
==========================================
+ Hits          926      946      +20     
- Misses       2327     2359      +32     
- Partials       98       99       +1     
Flag Coverage Δ
unittest 27.79% <37.73%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pkg/ebpf/healthcheck.go 83.33% <83.33%> (ø)
pkg/rpc/healthcheck.go 0.00% <0.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread main.go

// CNI makes rpc calls to NP agent regardless NP is enabled or not
// need to start rpc always
// todo: add a liveness probe to this gRPC server and remove closing based on this errCh, liveness probe will check and re-start this container

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we simplify the closing based on errCh happen currently? since we are adding liveness and readiness probes

we don't need closing based on errCh now as liveness probe will re-start the container in case RPC handler is not running

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now keeping it as it is. Once We update the manifest of node agent with liveness probe i will remove it.

Comment thread main.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Kubernetes-style liveness/readiness probe support to the network policy agent by wiring controller-runtime health checks that validate the agent’s gRPC Unix socket (and, when network policy is enabled, key eBPF prerequisites).

Changes:

  • Add a gRPC Unix-socket health check that performs a gRPC Health/Check RPC against the agent.
  • Add readiness checks for bpffs correctness (mounted + writable) and presence of required pinned BPF maps.
  • Register the new checks with the controller manager and add unit tests for the eBPF readiness checks.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
pkg/rpc/healthcheck.go Implements gRPC-socket health check used by healthz/readyz.
pkg/ebpf/healthcheck.go Adds bpffs + global map readiness checks.
pkg/ebpf/healthcheck_test.go Adds unit tests for the new eBPF readiness checks.
main.go Registers new healthz/readyz checks with controller-runtime manager.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/rpc/healthcheck.go
Comment on lines +28 to +35
const livenessCheckTimeout = 10 * time.Second

// NewGRPCSocketLivenessCheck returns a health check that issues a gRPC
// Health/Check RPC against the NPA Unix socket at socketPath.
func NewGRPCSocketLivenessCheck(socketPath string) func(_ *http.Request) error {
return func(_ *http.Request) error {
ctx, cancel := context.WithTimeout(context.Background(), livenessCheckTimeout)
defer cancel()
Comment thread main.go
Comment on lines +162 to +163
if err := mgr.AddReadyzCheck("grpc-socket", rpc.NewGRPCSocketLivenessCheck(npaSocketPath)); err != nil {
log.Errorf("unable to set up grpc-socket readiness check %v", err)
Comment thread pkg/rpc/healthcheck.go
Comment on lines +30 to +35
// NewGRPCSocketLivenessCheck returns a health check that issues a gRPC
// Health/Check RPC against the NPA Unix socket at socketPath.
func NewGRPCSocketLivenessCheck(socketPath string) func(_ *http.Request) error {
return func(_ *http.Request) error {
ctx, cancel := context.WithTimeout(context.Background(), livenessCheckTimeout)
defer cancel()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants