Skip to content

Handle nsx-proxy 403 response during manager recovery#1377

Open
zhengxiexie wants to merge 1 commit intovmware-tanzu:mainfrom
zhengxiexie:topic/zhengxie/main/fix_3638737_2
Open

Handle nsx-proxy 403 response during manager recovery#1377
zhengxiexie wants to merge 1 commit intovmware-tanzu:mainfrom
zhengxiexie:topic/zhengxie/main/fix_3638737_2

Conversation

@zhengxiexie
Copy link
Contributor

@zhengxiexie zhengxiexie commented Feb 11, 2026

✨ What's Changed

NSX Proxy 403 Forbidden Handling

  • Add NsxProxyForbiddenError type for detecting nsx-proxy HTML 403 responses during manager recovery
  • Register NsxProxyForbiddenError as a ground trigger to enable endpoint failover
  • Detect 403 non-JSON responses in InitErrorFromResponse and return the new error type
  • Add truncateBodyForLogging helper for safe body preview in logs

VIP Endpoint Priority

  • Prefer VIP endpoint (endpoints[0]) in selectEndpoint when it is not DOWN
  • Only fallback to individual manager endpoints when VIP is unavailable
  • Add log message when falling back from VIP to manager endpoint

Implementation Details

  • Added NsxProxyForbiddenError struct embedding managerErrorImpl in errors.go
  • Added CreateNsxProxyForbiddenError(host, bodyPreview) constructor
  • Updated groundTriggers to include "NsxProxyForbiddenError" alongside ConnectionError and Timeout
  • In InitErrorFromResponse, when JSON parsing fails for a 403 response with non-empty body, return NsxProxyForbiddenError instead of GeneralManagerError
  • In selectEndpoint, check VIP first; iterate only endpoints[1:] for fallback selection

🎯 Motivation

When NSX manager is recovering, nsx-proxy may return 403 Forbidden with an HTML body instead of JSON. The operator's InitErrorFromResponse treated this as a GeneralManagerError, which is not a ground trigger — so the operator kept retrying the same unhealthy endpoint without failover.

Issue: VPC cleanup hangs during supervisor disable because the operator is stuck on the recovering manager endpoint.

Solution:

  1. Recognize 403 non-JSON responses as NsxProxyForbiddenError (ground trigger) → endpoint marked DOWN → failover
  2. Prefer VIP endpoint in selectEndpoint so traffic routes through VIP when available, falling back to individual managers only when VIP is DOWN

✅ Testing

Unit Tests

  • All existing unit tests pass (make test ✅)
  • Added comprehensive unit test coverage:
    • TestCreateFunc - Validates CreateNsxProxyForbiddenError constructor via reflection
    • TestShouldGroundPoint - Confirms NsxProxyForbiddenError triggers endpoint failover
    • TestInitErrorFromResponse_403NonJSON - 5 test cases:
      • 403 HTML body → NsxProxyForbiddenError (ground trigger ✅, not retriable ✅)
      • 403 plain text body → NsxProxyForbiddenError
      • 403 JSON body → BadXSRFToken (existing behavior preserved)
      • 403 empty body → nil (no error)
      • 500 HTML body → not NsxProxyForbiddenError (only 403 triggers)
    • TestTruncateBodyForLogging - Short, exact-length, and long body truncation

🔄 Backward Compatibility

This change is fully backward compatible:

  • 403 responses with valid JSON body continue to be handled as before (e.g. BadXSRFToken)
  • Only 403 responses with non-JSON body are affected (previously GeneralManagerError, now NsxProxyForbiddenError)
  • selectEndpoint behavior is unchanged when only one endpoint exists; VIP priority only applies to multi-endpoint configurations

@zhengxiexie zhengxiexie force-pushed the topic/zhengxie/main/fix_3638737_2 branch 4 times, most recently from 154027a to cb299f4 Compare February 11, 2026 13:23
@codecov-commenter
Copy link

codecov-commenter commented Feb 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.66%. Comparing base (0334455) to head (d3c32e1).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1377      +/-   ##
==========================================
+ Coverage   76.65%   76.66%   +0.01%     
==========================================
  Files         151      151              
  Lines       21102    21120      +18     
==========================================
+ Hits        16175    16191      +16     
- Misses       3773     3774       +1     
- Partials     1154     1155       +1     
Flag Coverage Δ
unit-tests 76.66% <100.00%> (+0.01%) ⬆️
Files with missing lines Coverage Δ
pkg/nsx/transport.go 77.64% <100.00%> (-1.37%) ⬇️
pkg/nsx/util/errors.go 91.63% <100.00%> (+0.17%) ⬆️
pkg/nsx/util/utils.go 88.68% <100.00%> (+0.69%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

When NSX manager is recovering, nsx-proxy may return 403 Forbidden with
an HTML body instead of JSON. Previously this was treated as a generic
manager error, causing the operator to keep retrying the same endpoint.

Add NsxProxyForbiddenError as a ground trigger so the endpoint is marked
DOWN and failover to a healthy endpoint occurs. Also prefer VIP endpoint
in selectEndpoint and only fallback to individual managers when VIP is DOWN.

Bug: 3638737
Change-Id: I6007fa3bc80acdb9fa13d317c616b78b3d833fa8
@zhengxiexie zhengxiexie force-pushed the topic/zhengxie/main/fix_3638737_2 branch from cb299f4 to d3c32e1 Compare February 11, 2026 14:02
@zhengxiexie zhengxiexie changed the title Handle nsx-proxy 403 HTML response during manager recovery Handle nsx-proxy 403 response during manager recovery Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants