feat: FFT snapshot integration by ShubyM · Pull Request #116 · gke-labs/open-rl

ShubyM · 2026-06-06T00:37:27Z

This PR does two primary things. First we make the contract for the snapshot agent only dependent on process id (see #109 for initial impl). Second this also refactors what used to be called clock_cycle.py into training_request_processor.py which better matches what the code actually does. The gateway is responsible for putting tinker shaped training operations on queues, and the request processor drains those operations and executes them against concrete worker (see #113 for this split). With this shape we can have two different request processors: LoraTrainingRequestsProcessor and FFTTrainingRequestsProcessor. Both agree on the contract of what operations can come off the queue but can differ in how they compose operations with their workers, namely for FFT we need to use the snapshot agent to acquire a GPU lock before executing operations.

droot · 2026-06-08T17:43:15Z



-async def clock_cycle_loop(worker: TrainingWorker, model_id: str | None = None) -> None:
+async def clock_cycle_loop(


It might make sense to have this method anchored on a class that is explicitly initialized. I think we have accumulated bunch of config such as is-fft-enabled, redis-url, snapshot-agent-lock etc. that can be initialized at the instance creation. And that will also make this more testable where we can inject worker, snapshot agent etc.

Class could be called Trainer or FFTrainer that process FFT requests for a given model-id.

WDYT ?

droot · 2026-06-08T21:53:06Z

@ShubyM can you pl. update the description of this PR. I would also say refer to other PRs (refactors) for completeness. This will make it easy for others to follow.

/cc @chuangw6

ShubyM · 2026-06-08T22:31:43Z

@ShubyM can you pl. update the description of this PR. I would also say refer to other PRs (refactors) for completeness. This will make it easy for others to follow.

/cc @chuangw6

Updated the description and added the refactor we discussed, PTAL @droot

droot · 2026-06-08T21:53:56Z

-def main() -> None:
-  from clock_cycle import main as clock_cycle_main
+def start_request_processing_loop() -> None:
+  import training_requests_processor


is it possible to get rid of conditional import ?

Removed now!

droot · 2026-06-08T21:55:04Z

  loss_fn_inputs: dict[str, TensorData]
  model_input: list[int]

+  @field_validator("model_input", mode="before")


Pl. add a comment explaining "why"

I initially added internal types in this PR but chose to remove it later because I felt it bloated the change, will add back in a later PR.

droot · 2026-06-09T00:13:22Z

+    },
+    request_id=model_id,
+  )
+  req_id = await enqueue_worker_launch(command) if is_fft_enabled() else await enqueue(command)


I wonder if we simply enqueue the "create_model" training request and backend encapsulate the logic of whether to launch a worker or not etc. keeping the API gateway decoupled from the backend.

I agree this is the shape to target. The weirdness is coming from the fact that we have to dynamically spin up a new worker for each FFT run which is why we have a separate queue for doing so, will create an issue for this and think more about it

droot · 2026-06-09T00:13:35Z

+    },
+    request_id=model_id,
+  )
+  req_id = await enqueue_worker_launch(command) if is_fft_enabled() else await enqueue(command)


same comment as above.

droot · 2026-06-09T00:15:40Z

+      print("[WARNING] BASE_MODEL not provided. Cold-start penalty will apply on first request.")
+    is_ready = True
+
+  if not is_fft_enabled():


do we want to expose healthcheck for fft as well or not ?

Because we are dynamically launching workers for each job I think the meaning of a health check is different from that of a LoraWorker, will think more about this

droot · 2026-06-09T00:20:31Z

The change looks good to me. I have minor nits but nothing blocking. Feel free to merge and address nits in a follow up.

droot reviewed Jun 8, 2026

View reviewed changes

ShubyM added 2 commits June 8, 2026 15:17

Use PID-only snapshot agent protocol

9a1f51f

Process LoRA and FFT training requests separately

3cff3c3

ShubyM force-pushed the feat/fft-snapshot-integration branch from a02f48a to 3cff3c3 Compare June 8, 2026 22:19

ShubyM requested a review from droot June 8, 2026 22:45

small dispatch cleanup

c18703c

droot approved these changes Jun 9, 2026

View reviewed changes

ShubyM merged commit bc91e08 into gke-labs:main Jun 9, 2026
11 checks passed



		async def clock_cycle_loop(worker: TrainingWorker, model_id: str \| None = None) -> None:
		async def clock_cycle_loop(

Uh oh!

Conversation

ShubyM commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

droot commented Jun 8, 2026

Uh oh!

ShubyM commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShubyM Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

droot commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ShubyM commented Jun 6, 2026 •

edited

Loading

ShubyM commented Jun 8, 2026 •

edited

Loading

ShubyM Jun 9, 2026 •

edited

Loading