Skip to content

xDS: RDS unmarshal skips weight=0 clusters causing ADS validation failures #8865

@flyingyang

Description

@flyingyang
  • What version of gRPC are you using?
    google.golang.org/grpc v1.75.1

  • What version of Go are you using (go version)?
    go version go1.24.7 linux/amd64

  • What operating system (Linux, Windows, …) and version?
    Linux (Kubernetes containerized environment, Alpine/Debian base images)

  • What did you do?
    I'm implementing Blue-Green deployments using gRPC xDS (Aggregated Discovery Service) with weighted clusters. When configuring RDS with a weight=0 cluster for complete traffic isolation (100:0 or 0:100 splits), new gRPC client pods fail to start.

  Configuration:
  // RDS RouteConfiguration sent by control plane
  weighted_clusters {
    clusters {
      name: "peakbench-dev/peakbench-blue"
      weight { value: 100 }
    }
    clusters {
      name: "peakbench-dev/peakbench-green"
      weight { value: 0 }  // Weight=0 for complete isolation
    }
  }

Client behavior:

  1. Client receives RDS with both clusters (blue: 100, green: 0)
  2. Client parses RDS and extracts cluster names for CDS subscription
  3. Client requests CDS resources
  • What did you expect to see?

    According to the https://javadoc.io/static/io.envoyproxy.controlplane/api/0.1.24/io/envoyproxy/envoy/api/v2/route/WeightedCluster.ClusterWeight.html, cluster weight is "An integer between 0 and total_weight" - weight=0 is explicitly allowed.

    Expected behavior:

    1. Client should include both clusters in route.WeightedClusters map (including weight=0)
    2. Client should request CDS for: ["peakbench-dev/peakbench-blue", "peakbench-dev/peakbench-green"]
    3. ADS consistency validation should pass
    4. weighted_target load balancer should handle weight=0 correctly (0% traffic to that cluster)
    5. Client connection should succeed with READY state
  • What did you see instead?

    Actual behavior:

    From client debug logs:
    I0129 01:06:50 [OnStreamResponse] RouteConfiguration:
    clusters:{name:"peakbench-dev/peakbench-green" weight:{}} ← weight=0
    clusters:{name:"peakbench-dev/peakbench-blue" weight:{value:100}}

    I0129 01:06:50 [grpc-debug] CDS Request: resourceNames=[peakbench-dev/peakbench-blue] ← Only blue!

    W0129 01:06:50 [go-control-plane] ADS mode: not responding to request
    type.googleapis.com/envoy.config.cluster.v3.Cluster[peakbench-dev/peakbench-blue]:
    "peakbench-dev/peakbench-green" not listed

    Result: Client connection stuck in TRANSIENT_FAILURE state indefinitely.

  • Root cause analysis:

    In xds/internal/xdsclient/xdsresource/unmarshal_rds.go lines 321-323:

  for _, c := range wcs.Clusters {
      w := c.GetWeight().GetValue()
      if w == 0 {
          continue  // ← PROBLEM: Skips weight=0 clusters
      }
      totalWeight += uint64(w)
      // ...
      route.WeightedClusters[c.GetName()] = wc  // Never added for weight=0
  }

Because weight=0 clusters are skipped:

  1. They're not added to route.WeightedClusters map
  2. Client doesn't know to request them from CDS
  3. go-control-plane's ADS superset validation fails (cluster in snapshot but not in client request)

Impact:

  • ❌ Cannot cold-start gRPC clients with 100:0 or 0:100 traffic configurations

  • ✅ Running clients can transition TO/FROM 100:0 smoothly (because they already have the clusters cached)

  • ✅ HTTP/Envoy clients work fine at 100:0 (Envoy correctly handles weight=0)

  • Impact and why this is critical:

    Without this fix, we face a dilemma between cold start and smooth rollback:

    Choice A: Server-side skip weight=0 in RDS (single-cluster approach)
    100:0 → RDS: [blue:100] (only blue cluster)
    0:100 → RDS: [green:100] (only green cluster)

    • ✅ Cold start works (no ADS validation error)
    • ❌ Transitions cause data plane disruption:
      • When rolling back from 0:100 → 100:0, RDS changes from [green] to [blue]
      • weighted_target balancer removes green sub-balancer and adds blue sub-balancer
      • Causes connection draining and reconnection = data plane impact during emergency rollback
      • Fast rollback becomes impossible - defeats the purpose of Blue-Green deployment

    Choice B: Server-side include weight=0 in RDS (dual-cluster approach)
    100:0 → RDS: [blue:100, green:0] (both clusters)
    0:100 → RDS: [blue:0, green:100] (both clusters)

    • ✅ Transitions are smooth (weighted_target keeps all sub-balancers, only weights change)
    • ✅ Fast rollback works - critical for production incidents
    • ❌ Cold start fails (gRPC client skips weight=0 in unmarshal, ADS validation fails)

Why we need both to work:

  • Smooth rollback (dual-cluster) is critical for production reliability

  • Cold start (weight=0 support) is required for pod restarts, scaling, and new deployments

  • The bug forces us to choose between operational safety and deployment capability

  • Workaround applied:

    We patched the vendor code to remove the if w == 0 { continue } logic:

  // PATCH: Include weight=0 clusters for ADS consistency
  for _, c := range wcs.Clusters {
      w := c.GetWeight().GetValue()
      // Don't skip weight=0 - it's valid per xDS spec
      totalWeight += uint64(w)
      // ...
      route.WeightedClusters[c.GetName()] = wc  // Added even for weight=0
  }

  // PATCH: Validate cluster count instead of totalWeight
  if len(route.WeightedClusters) == 0 {
      return nil, nil, fmt.Errorf("route %+v, action %+v, has no cluster in WeightedCluster action", r, a)
  }

After the patch:

  • ✅ Client requests CDS for both clusters (including weight=0)
  • ✅ ADS validation passes
  • ✅ weighted_target correctly routes 0% traffic to weight=0 cluster
  • ✅ Cold start succeeds at 100:0 and 0:100
  • ✅ Fast rollback works without data plane impact

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions