Skip to content
Open
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
e533fa9
[Serve] Add HAProxy support for Ray Serve
eicherseiji Jan 29, 2026
60f7ab1
Add missing DRAINING_MESSAGE constant and BUILD targets for HAProxy t…
eicherseiji Jan 29, 2026
764dd04
Update BUILD files for HAProxy tests
eicherseiji Jan 29, 2026
856d1eb
Add missing NO_ROUTES_MESSAGE and NO_REPLICAS_MESSAGE constants
eicherseiji Jan 30, 2026
9d7a87e
Fix HAProxyManager not being used when RAY_SERVE_ENABLE_HAPROXY=1
eicherseiji Jan 30, 2026
20b6005
Add HAProxy to Dockerfile and _dump_ingress_cache_for_testing method
eicherseiji Jan 30, 2026
338fa19
Install HAProxy in serve CI build image
eicherseiji Jan 30, 2026
35de30c
Build HAProxy 2.8.12 from source in serve CI image
eicherseiji Jan 30, 2026
b9c72fa
Remove redundant pytestmark skipif from test_metrics_haproxy.py
eicherseiji Feb 2, 2026
76f508f
Improve HAProxy Dockerfile setup
eicherseiji Feb 2, 2026
2ec2991
Rename RAY_SERVE_ENABLE_HAPROXY to RAY_SERVE_ENABLE_HA_PROXY
eicherseiji Feb 2, 2026
4918537
Move HAProxy constants after DIRECT_INGRESS
eicherseiji Feb 2, 2026
a8c5c15
Merge branch 'master' into serve-haproxy-port
eicherseiji Feb 2, 2026
4bbf637
Fix Docker BASE_IMAGE error for HAProxy stage
eicherseiji Feb 3, 2026
a3b8e8c
Restore ARG defaults for backwards compatibility
eicherseiji Feb 3, 2026
a4982b1
Add USER root to haproxy-builder stage
eicherseiji Feb 3, 2026
99a101c
Move HAPRoxy build to base-deps
eicherseiji Feb 6, 2026
efb886a
Merge branch 'master' of https://github.com/ray-project/ray into serv…
eicherseiji Feb 6, 2026
abbb2ec
Merge branch 'master' into serve-haproxy-port
eicherseiji Feb 7, 2026
3e61b05
Hoist imports
eicherseiji Feb 7, 2026
31f90f2
Merge branch 'serve-haproxy-port' of https://github.com/eicherseiji/r…
eicherseiji Feb 7, 2026
8cc4bce
Leave HAProxyManager lazy
eicherseiji Feb 7, 2026
58a9d3c
Fix circular import between default_impl and proxy
eicherseiji Feb 7, 2026
d344c87
Add comment
eicherseiji Feb 7, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions ci/docker/serve.build.Dockerfile
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical HAProxy build steps

Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,46 @@ SHELL ["/bin/bash", "-ice"]

COPY . .

# Install HAProxy from source
RUN <<EOF
#!/bin/bash
set -euo pipefail

# Install HAProxy dependencies
sudo apt-get update && sudo apt-get install -y \
build-essential \
curl \
libc6-dev \
liblua5.3-dev \
libpcre3-dev \
libssl-dev \
socat \
wget \
zlib1g-dev \
&& sudo rm -rf /var/lib/apt/lists/*

# Create haproxy user and group
sudo groupadd -r haproxy
sudo useradd -r -g haproxy haproxy

# Download and compile HAProxy from official source
HAPROXY_VERSION="2.8.12"
HAPROXY_BUILD_DIR="$(mktemp -d)"
wget -O "${HAPROXY_BUILD_DIR}/haproxy.tar.gz" "https://www.haproxy.org/download/2.8/src/haproxy-${HAPROXY_VERSION}.tar.gz"
tar -xzf "${HAPROXY_BUILD_DIR}/haproxy.tar.gz" -C "${HAPROXY_BUILD_DIR}" --strip-components=1
make -C "${HAPROXY_BUILD_DIR}" TARGET=linux-glibc USE_OPENSSL=1 USE_ZLIB=1 USE_PCRE=1 USE_LUA=1 USE_PROMEX=1
sudo make -C "${HAPROXY_BUILD_DIR}" install
rm -rf "${HAPROXY_BUILD_DIR}"
Comment on lines +37 to +43
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe have a script for this rather than duplicating this everywhere?

also, can we get the binaries prebuilt rather than building from source?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Okay with you if we dedupe/incorporate prebuilt binary in a follow-up PR?


# Create HAProxy directories
sudo mkdir -p /etc/haproxy /run/haproxy /var/log/haproxy
sudo chown -R haproxy:haproxy /run/haproxy

# Allow the ray user to manage HAProxy files without password
echo "ray ALL=(ALL) NOPASSWD: /bin/cp * /etc/haproxy/*, /bin/touch /etc/haproxy/*, /usr/local/sbin/haproxy*" | sudo tee /etc/sudoers.d/haproxy-ray

EOF

RUN <<EOF
#!/bin/bash

Expand Down
55 changes: 55 additions & 0 deletions docker/ray/Dockerfile
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical HAProxy build steps

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we duplicating this here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are moving it into all (non-slim) ray images, you need to:

  • move these code into base-deps
  • remove the duplicated code in base-extra

this dockerfile is strictly preserved for installing ray wheel as the last step of image building, nothing else is allowed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, moved to base-deps

Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,63 @@

ARG BASE_IMAGE
ARG FULL_BASE_IMAGE=rayproject/ray-deps:nightly"$BASE_IMAGE"

# --- HAProxy Build Stage ---
FROM $BASE_IMAGE AS haproxy-builder

RUN <<EOF
#!/bin/bash
set -euo pipefail

apt-get update -y
apt-get install -y --no-install-recommends \
build-essential \
ca-certificates \
curl \
libc6-dev \
liblua5.3-dev \
libpcre3-dev \
libssl-dev \
zlib1g-dev
rm -rf /var/lib/apt/lists/*

# Install HAProxy from source
HAPROXY_VERSION="2.8.12"
BUILD_DIR=$(mktemp -d)
curl -sSfL -o "${BUILD_DIR}/haproxy.tar.gz" "https://www.haproxy.org/download/2.8/src/haproxy-${HAPROXY_VERSION}.tar.gz"
tar -xzf "${BUILD_DIR}/haproxy.tar.gz" -C "${BUILD_DIR}" --strip-components=1
make -C "${BUILD_DIR}" TARGET=linux-glibc USE_OPENSSL=1 USE_ZLIB=1 USE_PCRE=1 USE_LUA=1 USE_PROMEX=1 -j$(nproc)
make -C "${BUILD_DIR}" install SBINDIR=/usr/local/bin
rm -rf "${BUILD_DIR}"

EOF

# --- Return to main image ---
FROM $FULL_BASE_IMAGE

# Switch to root for system-level HAProxy setup
USER root

# Copy HAProxy binary from builder stage
COPY --from=haproxy-builder /usr/local/bin/haproxy /usr/local/bin/haproxy

# Install HAProxy runtime dependency and setup
RUN <<EOF
#!/bin/bash
set -euo pipefail

apt-get update -y
apt-get install -y --no-install-recommends socat liblua5.3-0

mkdir -p /etc/haproxy /run/haproxy /var/log/haproxy
chown -R ray:"$(id -gn ray)" /run/haproxy

rm -rf /var/lib/apt/lists/*

EOF

USER ray

ARG WHEEL_PATH
ARG FIND_LINKS_PATH=".whl"
ARG CONSTRAINTS_FILE="requirements_compiled.txt"
Expand Down
105 changes: 105 additions & 0 deletions python/ray/serve/_private/constants.py
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical mod. env var prefix

Original file line number Diff line number Diff line change
Expand Up @@ -569,12 +569,117 @@

# The message to return when the replica is healthy.
HEALTHY_MESSAGE = "success"
NO_ROUTES_MESSAGE = "Route table is not populated yet."
NO_REPLICAS_MESSAGE = "No replicas are available yet."
DRAINING_MESSAGE = "This node is being drained."

# Feature flag to enable a limited form of direct ingress where ingress applications
# listen on port 8000 (HTTP) and 9000 (gRPC). No proxies will be started.
RAY_SERVE_ENABLE_DIRECT_INGRESS = (
os.environ.get("RAY_SERVE_ENABLE_DIRECT_INGRESS", "0") == "1"
)

# Feature flag to use HAProxy.
RAY_SERVE_ENABLE_HA_PROXY = os.environ.get("RAY_SERVE_ENABLE_HA_PROXY", "0") == "1"

# HAProxy configuration defaults
# Maximum number of concurrent connections
RAY_SERVE_HAPROXY_MAXCONN = int(os.environ.get("RAY_SERVE_HAPROXY_MAXCONN", "20000"))

# Number of threads for HAProxy
RAY_SERVE_HAPROXY_NBTHREAD = int(os.environ.get("RAY_SERVE_HAPROXY_NBTHREAD", "4"))

# HAProxy configuration file location
RAY_SERVE_HAPROXY_CONFIG_FILE_LOC = os.environ.get(
"RAY_SERVE_HAPROXY_CONFIG_FILE_LOC", "/tmp/haproxy-serve/haproxy.cfg"
)

# HAProxy admin socket path
RAY_SERVE_HAPROXY_SOCKET_PATH = os.environ.get(
"RAY_SERVE_HAPROXY_SOCKET_PATH", "/tmp/haproxy-serve/admin.sock"
)

# Enable HAProxy optimized configuration (server state persistence, etc.)
# Disabled by default to prevent test suite interference
RAY_SERVE_ENABLE_HAPROXY_OPTIMIZED_CONFIG = (
os.environ.get("RAY_SERVE_ENABLE_HAPROXY_OPTIMIZED_CONFIG", "1") == "1"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment says "disabled by default" but default is enabled

Medium Severity

RAY_SERVE_ENABLE_HAPROXY_OPTIMIZED_CONFIG defaults to "1" (enabled), but the comment on the preceding line states "Disabled by default to prevent test suite interference." The same contradictory comment appears in HAProxyConfig. If the comment reflects the intended behavior, the default value is wrong and may cause test flakiness from server state persistence across test runs.

Additional Locations (1)

Fix in Cursor Fix in Web


# HAProxy server state path
RAY_SERVE_HAPROXY_SERVER_STATE_BASE = os.environ.get(
"RAY_SERVE_HAPROXY_SERVER_STATE_BASE", "/tmp/haproxy-serve"
)

# HAProxy server state path
RAY_SERVE_HAPROXY_SERVER_STATE_FILE = os.environ.get(
"RAY_SERVE_HAPROXY_SERVER_STATE_FILE", "/tmp/haproxy-serve/server-state"
)

# HAProxy hard stop after timeout
RAY_SERVE_HAPROXY_HARD_STOP_AFTER_S = int(
os.environ.get("RAY_SERVE_HAPROXY_HARD_STOP_AFTER_S", "120")
)

# HAProxy metrics export port
RAY_SERVE_HAPROXY_METRICS_PORT = int(
os.environ.get("RAY_SERVE_HAPROXY_METRICS_PORT", "9101")
)

# HAProxy log port
RAY_SERVE_HAPROXY_SYSLOG_PORT = int(
os.environ.get("RAY_SERVE_HAPROXY_SYSLOG_PORT", "514")
)

# HAProxy timeout configurations (in seconds, None = no timeout)
RAY_SERVE_HAPROXY_TIMEOUT_SERVER_S = (
int(os.environ.get("RAY_SERVE_HAPROXY_TIMEOUT_SERVER_S"))
if os.environ.get("RAY_SERVE_HAPROXY_TIMEOUT_SERVER_S")
else None
)

RAY_SERVE_HAPROXY_TIMEOUT_CONNECT_S = (
int(os.environ.get("RAY_SERVE_HAPROXY_TIMEOUT_CONNECT_S"))
if os.environ.get("RAY_SERVE_HAPROXY_TIMEOUT_CONNECT_S")
else None
)

# HAProxy timeout client
RAY_SERVE_HAPROXY_TIMEOUT_CLIENT_S = int(
os.environ.get("RAY_SERVE_HAPROXY_TIMEOUT_CLIENT_S", "3600")
)

# Number of consecutive failed server health checks that must occur
# before haproxy marks the server as down.
RAY_SERVE_HAPROXY_HEALTH_CHECK_FALL = int(
os.environ.get("RAY_SERVE_HAPROXY_HEALTH_CHECK_FALL", "2")
)

# Number of consecutive successful server health checks that must occur
# before haproxy marks the server as up.
RAY_SERVE_HAPROXY_HEALTH_CHECK_RISE = int(
os.environ.get("RAY_SERVE_HAPROXY_HEALTH_CHECK_RISE", "2")
)

# Time interval between each haproxy health check attempt. Also the
# timeout of each health check before being considered as failed.
RAY_SERVE_HAPROXY_HEALTH_CHECK_INTER = os.environ.get(
"RAY_SERVE_HAPROXY_HEALTH_CHECK_INTER", "5s"
)

# Time interval between each haproxy health check attempt when the server is in any of the transition states: UP - transitionally DOWN or DOWN - transitionally UP
RAY_SERVE_HAPROXY_HEALTH_CHECK_FASTINTER = os.environ.get(
"RAY_SERVE_HAPROXY_HEALTH_CHECK_FASTINTER", "250ms"
)

# Time interval between each haproxy health check attempt when the server is in the DOWN state
RAY_SERVE_HAPROXY_HEALTH_CHECK_DOWNINTER = os.environ.get(
"RAY_SERVE_HAPROXY_HEALTH_CHECK_DOWNINTER", "250ms"
)

# Direct ingress must be enabled if HAProxy is enabled
if RAY_SERVE_ENABLE_HA_PROXY:
RAY_SERVE_ENABLE_DIRECT_INGRESS = True

RAY_SERVE_DIRECT_INGRESS_MIN_HTTP_PORT = int(
os.environ.get("RAY_SERVE_DIRECT_INGRESS_MIN_HTTP_PORT", "30000")
)
Expand Down
59 changes: 53 additions & 6 deletions python/ray/serve/_private/controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
CONTROL_LOOP_INTERVAL_S,
RAY_SERVE_CONTROLLER_CALLBACK_IMPORT_PATH,
RAY_SERVE_ENABLE_DIRECT_INGRESS,
RAY_SERVE_ENABLE_HA_PROXY,
RAY_SERVE_RPC_LATENCY_WARNING_THRESHOLD_MS,
RECOVERING_LONG_POLL_BROADCAST_TIMEOUT_S,
SERVE_CONTROLLER_NAME,
Expand All @@ -48,7 +49,10 @@
from ray.serve._private.controller_health_metrics_tracker import (
ControllerHealthMetricsTracker,
)
from ray.serve._private.default_impl import create_cluster_node_info_cache
from ray.serve._private.default_impl import (
create_cluster_node_info_cache,
get_proxy_actor_class,
)
from ray.serve._private.deployment_info import DeploymentInfo
from ray.serve._private.deployment_state import (
DeploymentReplica,
Expand Down Expand Up @@ -187,8 +191,14 @@ async def __init__(
self.cluster_node_info_cache = create_cluster_node_info_cache(self.gcs_client)
self.cluster_node_info_cache.update()

self._ha_proxy_enabled = RAY_SERVE_ENABLE_HA_PROXY
self._direct_ingress_enabled = RAY_SERVE_ENABLE_DIRECT_INGRESS
if self._direct_ingress_enabled:
if self._ha_proxy_enabled:
logger.info(
"HAProxy is enabled in ServeController, replacing Serve proxy "
"with HAProxy."
)
elif self._direct_ingress_enabled:
logger.info(
"Direct ingress is enabled in ServeController, enabling proxy "
"on head node only."
Expand All @@ -203,6 +213,7 @@ async def __init__(
cluster_node_info_cache=self.cluster_node_info_cache,
logging_config=self.global_logging_config,
grpc_options=set_proxy_default_grpc_options(grpc_options),
proxy_actor_class=get_proxy_actor_class(),
)
# We modify the HTTP and gRPC options above, so delete them to avoid
del http_options, grpc_options
Expand Down Expand Up @@ -275,7 +286,9 @@ async def __init__(
] = []
self._refresh_autoscaling_deployments_cache()

self._last_broadcasted_target_groups: List[TargetGroup] = []
# Initialize to None (not []) to ensure the first broadcast always happens,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical

# even if target_groups is empty (e.g., route_prefix=None deployments).
self._last_broadcasted_target_groups: Optional[List[TargetGroup]] = None

def reconfigure_global_logging_config(self, global_logging_config: LoggingConfig):
if (
Expand Down Expand Up @@ -659,6 +672,29 @@ async def run_control_loop_step(
# get all alive replica ids and their node ids.
NodePortManager.prune(self._get_node_id_to_alive_replica_ids())

# HAProxy target group broadcasting
if self._ha_proxy_enabled:
self.broadcast_target_groups_if_changed()

def broadcast_target_groups_if_changed(self) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical

"""Broadcast target groups over long poll if they have changed.

Keeps an in-memory record of the last target groups that were broadcast
to determine if they have changed.
"""
target_groups: List[TargetGroup] = self.get_target_groups(
from_proxy_manager=True,
)

# Check if target groups have changed by comparing the objects directly
if self._last_broadcasted_target_groups == target_groups:
return

self.long_poll_host.notify_changed(
{LongPollNamespace.TARGET_GROUPS: target_groups}
)
self._last_broadcasted_target_groups = target_groups

def _create_control_loop_metrics(self):
self.node_update_duration_gauge_s = metrics.Gauge(
"serve_controller_node_update_duration_s",
Expand Down Expand Up @@ -1296,9 +1332,16 @@ def get_target_groups(
that have running replicas, we return target groups for direct ingress.
If there are multiple applications with no running replicas, we return
one target group per application with unique route prefix.
5. HAProxy is enabled and the caller is not an internal proxy manager. In
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical

this case, we return target groups containing the proxies (e.g. haproxy).
6. HAProxy is enabled and the caller is an internal proxy manager (e.g.
haproxy manager). In this case, we return target groups containing the
ingress replicas and possibly the Serve proxies.
"""
proxy_target_groups = self._get_proxy_target_groups()
if not self._direct_ingress_enabled:
if not self._direct_ingress_enabled or (
self._ha_proxy_enabled and not from_proxy_manager
):
return proxy_target_groups

# Get all applications and their metadata
Expand All @@ -1319,6 +1362,10 @@ def get_target_groups(
]

if not apps:
# When HAProxy is enabled and there are no apps, return empty target groups
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

identical

# so that all requests fall through to the default_backend (404)
if self._ha_proxy_enabled and from_proxy_manager:
return []
return proxy_target_groups

# Create target groups for each application
Expand Down Expand Up @@ -1428,7 +1475,7 @@ def _get_target_groups_for_app_with_no_running_replicas(
TargetGroup(
protocol=RequestProtocol.HTTP,
route_prefix=route_prefix,
targets=http_targets,
targets=[] if self._ha_proxy_enabled else http_targets,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical

app_name=app_name,
)
)
Expand All @@ -1437,7 +1484,7 @@ def _get_target_groups_for_app_with_no_running_replicas(
TargetGroup(
protocol=RequestProtocol.GRPC,
route_prefix=route_prefix,
targets=grpc_targets,
targets=[] if self._ha_proxy_enabled else grpc_targets,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical

app_name=app_name,
)
)
Expand Down
12 changes: 12 additions & 0 deletions python/ray/serve/_private/default_impl.py
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New get_proxy_actor_class is identical

Original file line number Diff line number Diff line change
Expand Up @@ -255,3 +255,15 @@ def get_controller_impl():
)(ServeController)

return controller_impl


def get_proxy_actor_class():
from ray.serve._private.constants import RAY_SERVE_ENABLE_HA_PROXY
from ray.serve._private.proxy import ProxyActor

if RAY_SERVE_ENABLE_HA_PROXY:
from ray.serve._private.haproxy import HAProxyManager

return HAProxyManager
else:
return ProxyActor
Loading
Loading