Skip to content

ray serve inference improvements#2042

Open
zulissimeta wants to merge 6 commits into
mainfrom
prefer_local_routing
Open

ray serve inference improvements#2042
zulissimeta wants to merge 6 commits into
mainfrom
prefer_local_routing

Conversation

@zulissimeta

@zulissimeta zulissimeta commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Three small changes to the ray serve deployment.

This PR

  1. By default local node routing is used for ingress via the http proxy, but not when using deployment handles. This PR tells the handle to prefer local routing so that inference stays within nodes when possible
  2. The atoms info validation requires a roundtrip to the model inference server. We can instead cache the validated atoms info objects.
  3. Add timeouts (default 10min, controlled by environment variable FAIRCHEM_BATCH_SERVER_TIMEOUT_S) to all of the .request() calls so that an indefinite hang (eg inference request when a server doesn't have enough resources to start inference service) eventually throws an error

@zulissimeta zulissimeta requested a review from lbluque June 14, 2026 18:44
@zulissimeta zulissimeta added enhancement New feature or request minor Minor version release labels Jun 14, 2026
@meta-cla meta-cla Bot added the cla signed label Jun 14, 2026
@zulissimeta zulissimeta changed the title prefer local routing in ray serve inference ray serve inference improvements (prefer local routing, and avoid atoms info validation roundtrip) Jun 15, 2026
@zulissimeta zulissimeta changed the title ray serve inference improvements (prefer local routing, and avoid atoms info validation roundtrip) ray serve inference improvements Jun 15, 2026

@lbluque lbluque left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zulissimeta! lg, just small comments.

handle = serve.get_app_handle(deployment_name)
# ``_prefer_local_routing`` must be set via ``_init()`` before any
# ``.options()`` or ``.remote()`` call initializes the handle.
handle._init(_prefer_local_routing=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# users on slow CPUs can lengthen it). ``None`` disables the
# bound entirely so behavior matches the pre-timeout default
# if someone needs it.
self._request_timeout_s = _resolve_batch_server_timeout()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth having this also be an explicit init kwarg with default to None which then searches for the env variable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed enhancement New feature or request minor Minor version release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants