Skip to content

Backport of fix: prevent graceful shutdown errors from killing Envoy proxy immediately into release/2.0.x#1075

Merged
hc-github-team-consul-core merged 1 commit intorelease/2.0.xfrom
backport/spulluri/fixing_bugs_dp/lately-tolerant-termite
Apr 29, 2026
Merged

Backport of fix: prevent graceful shutdown errors from killing Envoy proxy immediately into release/2.0.x#1075
hc-github-team-consul-core merged 1 commit intorelease/2.0.xfrom
backport/spulluri/fixing_bugs_dp/lately-tolerant-termite

Conversation

@hc-github-team-consul-core
Copy link
Copy Markdown
Collaborator

Backport

This PR is auto-generated from #1072 to be assessed for backporting due to the inclusion of the label backport/2.0.

The below text is copied from the body of the original PR.


Problem

During graceful shutdown, any error from DumpConfig(), Drain(), or Quit() calls close(errorExitCh). This channel is monitored by the Run() select loop in consul_dataplane.go:

case <-cdp.lifecycleConfig.lifecycleServerExited():
    // Calls Quit() + Kill() immediately

This means a transient error during shutdown (e.g., Envoy not yet ready, network timeout, IPv6 URL issue) bypasses the entire grace period and kills Envoy at 0ms — exactly the symptom seen in TestConnectInject_ProxyLifecycleShutdown acceptance test failures where curl reports Connection refused immediately.

Sequence of events (before fix)

SIGTERM → gracefulShutdown()
  → DumpConfig() fails (any reason)
  → close(errorExitCh)                    ← fires immediately
  → Run() select: lifecycleServerExited()
  → proxy.Quit() + proxy.Kill()           ← Envoy killed at 0ms
  → all connections refused instantly

Root cause

errorExitCh is intended to signal that the lifecycle server itself has crashed (e.g., ListenAndServe fails). It should NOT be closed for transient errors during an already-in-progress graceful shutdown. The shutdown path should log errors and continue — the grace period must be honored regardless.

Fix

Removed close(errorExitCh) from all three error paths in gracefulShutdown():

  • DumpConfig() error
  • Drain() error
  • Quit() error

Errors are still logged as warnings. The shutdown continues through the full grace period as intended.

Testing

Added 3 unit tests in lifecycle_test.go that reproduce the bug deterministically:

Test Verifies
TestGracefulShutdown_DumpConfigError_DoesNotKillProxy DumpConfig() failure does not close errorExitCh
TestGracefulShutdown_DrainError_DoesNotKillProxy Drain() failure does not close errorExitCh
TestGracefulShutdown_QuitError_DoesNotKillProxy Quit() failure does not close errorExitCh

Also fixed pre-existing issues in the test mock:

  • mockProxy methods now return error fieldsDrain(), Quit(), DumpConfig() previously always returned nil, meaning error paths were never exercised
  • Fixed data race — mock counters changed from int to sync/atomic.Int32 to eliminate races flagged by go test -race

Related PRs

All three are needed together to fully fix TestConnectInject_ProxyLifecycleShutdown.


Overview of commits

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto approved Consul Bot automated PR

@hc-github-team-consul-core hc-github-team-consul-core merged commit 289b031 into release/2.0.x Apr 29, 2026
38 checks passed
panman90 added a commit that referenced this pull request Apr 29, 2026
* Backport of fix: prevent graceful shutdown errors from killing Envoy proxy immediately into release/2.0.x (#1075)

backport of commit 03475d3

Co-authored-by: santoshpulluri <santosh.pulluri@hashicorp.com>

* Backport of fix: use net.JoinHostPort for remaining Envoy admin URLs into release/2.0.x (#1071)

backport of commit fa7de22

Co-authored-by: santoshpulluri <santosh.pulluri@hashicorp.com>

---------

Co-authored-by: hc-github-team-consul-core <github-team-consul-core@hashicorp.com>
Co-authored-by: santoshpulluri <santosh.pulluri@hashicorp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants