Skip to content

Native-sidecar enabled worker nodes become bad hosts #26280

@kevintang2022

Description

@kevintang2022

When the config flag "native-sidecar=true" is added to worker machines, it leads to a bug where coordinator will continue trying to send queries to it, even if it is no longer part of the machine cluster. The issue is not very easy to reproduce, but it could happen when the worker node is moevd away from cluster and no longer the part of cluster. The coordinator should ideally ignore the old worker because its not in the worker tier, but because the worker also announces that it is a coordinatorSidecar, the nodeStatusService check check is bypassed and the coordinator does not ignore that worker.

The cause appears to be coming from the logic to filter the relevant nodes in DiscoveryNodeManager.

This logic makes it so that the worker nodes with sidecar enabled will announce to the coordinator that it is available and part of the cluster. In the case that the worker node is no longer part of the cluster, it can lead to query failures.

Because of this issue, the logic in DiscoveryNodeManager should be revisited or tweaked so that this does not happen.

Your Environment

  • Presto version used:
  • Storage (HDFS/S3/GCS..):
  • Data source and connector used:
  • Deployment (Cloud or On-prem):
  • Pastebin link to the complete debug logs:

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Screenshots (if appropriate)

Context

Metadata

Metadata

Labels

Type

No type

Projects

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions