Unify ray serve by lbluque · Pull Request #1931 · facebookresearch/fairchem

lbluque · 2026-03-26T23:33:24Z

Summary

Unified batch serving around BatchPredictServer + BatchServerPredictUnit, removing the parallel FAIRChemInferenceServer / FAIRChemInferenceClient / RayServeMLIPUnit architecture (~1,100 lines deleted across 3 files)
Added MultiplexedBatchPredictServer — a subclass of BatchPredictServer that uses @serve.multiplexed for on-demand model loading with LRU eviction, preserving the multi-model capability in a cleaner form
Added BatchServerPredictUnit.from_deployment_connection_info() classmethod to connect to already-running Ray Serve deployments by name, with optional multiplexed_model_id for multiplexed deployments — replacing the need for RayServeMLIPUnit
Moved wait_for_serve_ready() and get_ray_connection_info() into batch_predict_server.py as shared utilities
Extracted _init_ray_and_serve() and _build_deployment_options() helpers to share setup logic between setup_batch_predict_server() and the new setup_multiplexed_batch_predict_server()
Updated get_slurm_ray_cluster and get_local_ray_cluster to accept a predict_unit parameter and use setup_batch_predict_server() instead of the deleted start_serve()

What's removed

FAIRChemInferenceServer / FAIRChemInferenceClient / RayServeMLIPUnit: 3 files deleted entirely; their functionality is replaced by MultiplexedBatchPredictServer + BatchServerPredictUnit.from_deployment_connection_info()
Metadata serialization layer (RayServeTask, _cache_model_metadata, fetch_model_metadata): replaced by get_predict_unit_attribute which returns real objects via Ray serialization
SimpleNamespace inference_settings stub: replaced by fetching real InferenceSettings from the server

Test plan

pytest tests/core/calculate/test_batcher.py -c packages/fairchem-core/pyproject.toml -vv
pytest tests/core/units/mlip_unit/test_predict.py -c packages/fairchem-core/pyproject.toml -vv
pytest tests/core/units/mlip_unit/test_inference_serve.py -c packages/fairchem-core/pyproject.toml -vv (requires GPU — covers both single-model and multiplexed server tests)

TODO

Cleanup and consolidate tests

…il 2.55 comes out

…de of the cluster using an established serve deployment

Co-authored-by: lbluque <lbluque@meta.com>

Co-authored-by: Copilot <copilot@github.com>

…serve

+
+    def find_free_port():
+        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+            s.bind(("", 0))


zulissimeta and others added 23 commits January 30, 2026 03:58

add support for changing checkpoint in batch inference

c2fb717

add tests

0b60100

handle multiple inference servers

d5d5cc3

fix tests for concurrency (except case with multiple gpus)

22beaf4

lint

f3149fb

Merge branch 'main' into change_checkpoint

12cd346

first stab

4fccaa6

cleanup

662262e

Merge branch 'main' into rayserve_calculator

dd7daeb

small qol improvements for slurm

18b637b

merge

e866e9e

generalize the temp directory, use latest nightly for ray testing unt…

a9ea984

…il 2.55 comes out

updates to the ray batch serving to handle inference inside and outsi…

646b468

…de of the cluster using an established serve deployment

small fixes for tests

4e3704d

clean up tests

708fc2c

sane batch config defaults for local

e203892

random ports to allow multiple raylets on the same node

29bc127

add temp dir to cluster config

7e8b373

Merge branch 'main' into rayserve_calculator

17031da

Regress attrs patch (#1926)

835e324

Co-authored-by: lbluque <lbluque@meta.com>

Regress attrs patch (#1929)

623e4df

Co-authored-by: lbluque <lbluque@meta.com>

refactor/unify ray serve inference framework

f14bca9

remove files

538bf5c

meta-cla Bot added the cla signed label Mar 26, 2026

lbluque requested a review from zulissimeta March 26, 2026 23:33

lbluque marked this pull request as draft March 26, 2026 23:38

lbluque added 4 commits March 26, 2026 16:44

dont import in methods

d0fa3e5

multiplexed server

393f986

upstream

513280f

refactor batch server module

f9736e0

lbluque added 6 commits March 30, 2026 16:23

lints

15af26f

fix import

ec8eba0

do not hard-code atoms data validation

12a9347

make serve happy with a mixin

6ce0f7a

lint again

4bbef49

lint again

017c2ab

lbluque marked this pull request as ready for review April 9, 2026 22:11

try to fix multiplexed-server

165624a

lbluque changed the base branch from rayserve_calculator to main April 18, 2026 01:30

Merge branch 'main' into unify-ray-serve

9e78107

github-advanced-security AI found potential problems Apr 29, 2026

View reviewed changes

Comment thread src/fairchem/core/launchers/cluster/ray_cluster_utils.py Fixed

zulissimeta and others added 3 commits April 29, 2026 23:30

small fixes

6090c1d

Co-authored-by: Copilot <copilot@github.com>

small fix

ae8c6dd

ruff ruff

5513249

github-advanced-security AI found potential problems May 5, 2026

View reviewed changes

Comment thread src/fairchem/core/launchers/cluster/ray_cluster_utils.py Fixed

small fix for tests

3108127

lbluque commented May 7, 2026

View reviewed changes

Comment thread src/fairchem/core/launchers/cluster/ray_cluster.yaml Outdated

Comment thread src/fairchem/core/launchers/cluster/ray_cluster_utils.py

zulissimeta and others added 8 commits May 7, 2026 18:39

minor fix

77335bf

Merge branch 'main' into unify-ray-serve

206eaad

remove api ray functions

d223446

Merge remote-tracking branch 'origin/unify-ray-serve' into unify-ray-…

057537e

…serve

use hf checkpoint

ebd580a

fix batcher tests

6e5374e

set different rout_prefix for multiplexed server

c643970

shared predict unit fixture

5edc452

zulissimeta mentioned this pull request May 15, 2026

Fairchem ray serve Quantum-Accelerators/quacc#3228

Open

3 tasks

move ray init context managers to fairchem.core.calculste

15032c3

github-advanced-security AI found potential problems May 15, 2026

View reviewed changes

Comment thread src/fairchem/core/calculate/_ray_inference_cluster.py

def find_free_port():

with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:

s.bind(("", 0))

lbluque added 2 commits May 15, 2026 16:16

no hidden defaults

614fbf0

no hidden defaults

0d9ab1a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify ray serve#1931

Unify ray serve#1931
lbluque wants to merge 53 commits into
mainfrom
unify-ray-serve

lbluque commented Mar 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lbluque commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's removed

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lbluque commented Mar 26, 2026 •

edited

Loading