Skip to content

fix(active-checks): use HTTP/1.1 for health check probes with automatic version negotiation#176

Merged
locao merged 4 commits intomasterfrom
walker.zhao/FTI-7389
Apr 8, 2026
Merged

fix(active-checks): use HTTP/1.1 for health check probes with automatic version negotiation#176
locao merged 4 commits intomasterfrom
walker.zhao/FTI-7389

Conversation

@findns94
Copy link
Copy Markdown
Contributor

@findns94 findns94 commented Mar 24, 2026

Summary

Upgrade active health check probes from HTTP/1.0 to HTTP/1.1 with automatic version negotiation for backward compatibility.

Problem

Active health checks previously sent HTTP/1.0 requests. Modern upstream servers that only support HTTP/1.1+ respond with 426 Upgrade Required, which is not in the default healthy or unhealthy status lists — causing health checks to silently become no-ops. Users had to work around this by adding 426 to checks.active.healthy.http_statuses.

Solution

  • Default health check probes now use HTTP/1.1 with Connection: close (probes are one-shot)
  • Automatic version negotiation handles backward compatibility:
    • 505 (HTTP Version Not Supported): retries with the other HTTP version
    • 426 (Upgrade Required) on HTTP/1.0: retries with HTTP/1.1
  • Per-target in-memory caching of the working HTTP version minimizes retry overhead (only the first probe per target may trigger a retry)
  • Self-healing: if a server changes its supported HTTP version (e.g., upgrades from HTTP/1.0-only to HTTP/1.1-only), the cache updates automatically on the next 505/426 response
  • Retry triggers respect user configuration — if 505 or 426 is configured as a healthy status, no retry occurs

Behavior by server type

Server Before After
HTTP/1.1+ (most servers) 200 via HTTP/1.0 200 via HTTP/1.1
HTTP/1.1 + HTTP/2 only 426 → silently ignored 200 via HTTP/1.1
HTTP/1.0 only 200 via HTTP/1.0 505 → auto-fallback → 200 via HTTP/1.0 (cached)
Genuinely unhealthy (500) 500 → unhealthy 500 → unhealthy (no retry)

Refactoring

run_single_check was decomposed into focused helpers for maintainability:

  • build_http_headers() — caches serialized user-configured headers
  • establish_connection() — TCP connect + optional TLS handshake
  • probe_http() — sends request, returns status code
  • negotiate_http_version() — retry logic + version caching

Test plan

  • prove -I. -r t/with_resty-events/20-http-version-negotiation.t — 9 tests covering: 505 auto-fallback, 426 auto-upgrade, version caching, permanent failure, normal HTTP/1.1, 426 on HTTP/1.1 (no retry), non-standard server (no retry), genuinely unhealthy (no retry), failed retry reports original status
  • prove -I. -r t/with_worker-events/20-http-version-negotiation.t — mirror tests for worker-events backend
  • prove -I. -r t/with_resty-events/18-req-headers.t — updated to expect HTTP/1.1 format
  • prove -I. -r t/with_worker-events/18-req-headers.t — mirror

Fix FTI-7389

Previously, active health check probes sent HTTP/1.0 requests, which
caused upstream servers that only support HTTP/1.1 to respond with
426 Upgrade Required, incorrectly marking healthy targets as unhealthy.

Switch the default probe request from HTTP/1.0 to HTTP/1.1 with
Connection: close header. HTTP/1.1 is backward-compatible with all
servers that accept HTTP/1.0, so this change requires no configuration
updates from users.

Refactored run_single_check() into focused helpers:
- build_http_headers(): builds and caches serialized header string
- establish_connection(): TCP connect + optional TLS handshake
- probe_http(): sends HTTP/1.1 GET and reports result

FTI-7389

Signed-off-by: Walker Zhao <walker.zhao@konghq.com>
@findns94 findns94 self-assigned this Mar 24, 2026
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 24, 2026

CLA assistant check
All committers have signed the CLA.

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 24, 2026

Luacheck Report

1 tests   1 ✅  0s ⏱️
1 suites  0 💤
1 files    0 ❌

Results for commit d7244d4.

♻️ This comment has been updated with latest results.

@findns94 findns94 changed the title fix(active-checks): use HTTP/1.1 for health check probes fix(active-checks): use HTTP/1.1 for health check probes with auto-fallback Mar 25, 2026
@findns94 findns94 requested review from bungle, Copilot and windmgc March 25, 2026 09:14
Comment thread lib/resty/healthcheck.lua Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the active HTTP healthcheck probe behavior to default to HTTP/1.1 (with Connection: close) and adds automatic HTTP/1.0/1.1 fallback/upgrade logic, with per-target in-memory caching of the working HTTP version.

Changes:

  • Refactors checker:run_single_check() into helper functions and introduces HTTP version negotiation + caching.
  • Updates existing request-header tests to expect HTTP/1.1 request format and Connection: close.
  • Adds new negotiation test suites (for both worker-events and resty-events) covering 505/426 fallback, caching, and failure cases.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
lib/resty/healthcheck.lua Implements HTTP/1.1-by-default probing, retry negotiation logic, and caching; refactors probe flow into helpers.
t/with_worker-events/18-req-headers.t Updates expected probe request line/headers to HTTP/1.1 + Connection: close.
t/with_resty-events/18-req-headers.t Same header expectation updates for the resty-events variant.
t/with_worker-events/20-http-version-negotiation.t New test coverage for HTTP version negotiation and caching with worker-events.
t/with_resty-events/20-http-version-negotiation.t New test coverage for HTTP version negotiation and caching with resty-events.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/resty/healthcheck.lua Outdated
…llback

Previously, active health check probes sent HTTP/1.0 requests. Upstream
servers that only support HTTP/1.1 would respond with 426 (Upgrade
Required), which is not in the default healthy/unhealthy status lists,
causing health checks to silently become no-ops for those targets.

Switch the default probe request from HTTP/1.0 to HTTP/1.1 with a
Connection: close header. Add bidirectional version auto-detection in
run_single_check() that automatically negotiates the HTTP version:

- On 505 (HTTP Version Not Supported): retry with the other version
- On 426 (Upgrade Required) while using HTTP/1.0: retry with HTTP/1.1
- On any non-healthy status with no cached version: retry with the
  other version to handle non-standard server implementations

The working HTTP version is cached per-target in memory to avoid
repeated retries. The cache self-heals when servers change their
supported HTTP version.

Refactored run_single_check() into focused helpers:
- build_http_headers(): builds and caches serialized header string
- establish_connection(): TCP connect + optional TLS handshake
- probe_http(): sends HTTP request and returns status code

FTI-7389

Signed-off-by: Walker Zhao <walker.zhao@konghq.com>
@findns94 findns94 force-pushed the walker.zhao/FTI-7389 branch from 50201fa to e862fab Compare March 25, 2026 09:57
@findns94 findns94 changed the title fix(active-checks): use HTTP/1.1 for health check probes with auto-fallback fix(active-checks): use HTTP/1.1 for health check probes with automatic version negotiation Mar 26, 2026
@findns94 findns94 requested a review from Copilot March 26, 2026 08:14
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/resty/healthcheck.lua Outdated
Comment thread lib/resty/healthcheck.lua Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/resty/healthcheck.lua Outdated
Comment thread lib/resty/healthcheck.lua Outdated
Comment on lines +1156 to +1159
-- 3. Any non-healthy + no cache -> try the other (first-probe discovery)
local should_retry = (status == 505)
or (status == 426 and http_version == "1.0")
or (not is_healthy and not has_cached_version)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The should_retry condition retries once on any non-healthy status when there is no cached version (or (not is_healthy and not has_cached_version)). This means first probes for genuinely unhealthy targets (e.g. 500/502/503) will still incur an extra connection, which contradicts the PR description/overhead table that claims 0 extra connections for genuinely unhealthy responses. Either update the PR description to match the implemented behavior, or tighten the retry condition (e.g., don’t retry when the status is already in the configured unhealthy.http_statuses).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@locao locao Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the behavior changed to retry on any non-healthy status?

In my opinion we should only retry on 426 or 505 statuses if user didn't configure those statuses as healthy statuses.

Suggested change
-- 3. Any non-healthy + no cache -> try the other (first-probe discovery)
local should_retry = (status == 505)
or (status == 426 and http_version == "1.0")
or (not is_healthy and not has_cached_version)
-- 3. 426 and 505 are considered unhealthy statuses + no cache -> try the other (first-probe discovery)
local should_retry = (not is_healthy and not has_cached_version) and
((status == 505) or
(status == 426 and http_version == "1.0"))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed has_cached_version and refactor the condition here. Now it will only retry when it is unhealthy and status code is 505 or 426, please review this PR again, thanks!
The commit is d7244d4.

Comment thread lib/resty/healthcheck.lua
…y retry logic

Extract the HTTP version negotiation logic from run_single_check() into
a dedicated negotiate_http_version() local function, reducing
run_single_check() from ~85 lines to ~20 lines with flat control flow.

Remove the unknown-status retry trigger and http_version_locked flag,
keeping only 3 clear retry triggers: 505, 426-on-1.0, and
non-healthy-with-no-cache. Unknown status codes are silently ignored
by report_http_status(), matching the original behavior.

Fix bad status line handling: probe_http() now returns 0 instead of
nil for malformed HTTP responses, preserving the original behavior
where report_http_status() treats 0 as unhealthy. Previously, the
refactor caused these to be silently ignored.

FTI-7389

Signed-off-by: Walker Zhao <walker.zhao@konghq.com>
@findns94 findns94 force-pushed the walker.zhao/FTI-7389 branch from a33e104 to e71b857 Compare March 26, 2026 09:19
…426 status codes

Remove overly broad trigger that retried on any non-healthy status for
uncached targets, which caused unnecessary retry connections for
genuinely unhealthy servers (e.g., 500). Now only standard HTTP version
negotiation codes (505 HTTP Version Not Supported, 426 Upgrade Required)
trigger a retry, gated on `not is_healthy` to respect user-configured
healthy status lists while preserving self-healing for cached targets.

Signed-off-by: Walker Zhao <walker.zhao@konghq.com>
@findns94 findns94 requested review from bungle and locao April 8, 2026 09:21
@locao locao merged commit 792aba2 into master Apr 8, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants