Skip to content

feat: FFT snapshot integration#116

Merged
ShubyM merged 3 commits into
gke-labs:mainfrom
ShubyM:feat/fft-snapshot-integration
Jun 9, 2026
Merged

feat: FFT snapshot integration#116
ShubyM merged 3 commits into
gke-labs:mainfrom
ShubyM:feat/fft-snapshot-integration

Conversation

@ShubyM

@ShubyM ShubyM commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

This PR does two primary things. First we make the contract for the snapshot agent only dependent on process id (see #109 for initial impl). Second this also refactors what used to be called clock_cycle.py into training_request_processor.py which better matches what the code actually does. The gateway is responsible for putting tinker shaped training operations on queues, and the request processor drains those operations and executes them against concrete worker (see #113 for this split). With this shape we can have two different request processors: LoraTrainingRequestsProcessor and FFTTrainingRequestsProcessor. Both agree on the contract of what operations can come off the queue but can differ in how they compose operations with their workers, namely for FFT we need to use the snapshot agent to acquire a GPU lock before executing operations.

Comment thread src/server/clock_cycle.py Outdated


async def clock_cycle_loop(worker: TrainingWorker, model_id: str | None = None) -> None:
async def clock_cycle_loop(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to have this method anchored on a class that is explicitly initialized. I think we have accumulated bunch of config such as is-fft-enabled, redis-url, snapshot-agent-lock etc. that can be initialized at the instance creation. And that will also make this more testable where we can inject worker, snapshot agent etc.

Class could be called Trainer or FFTrainer that process FFT requests for a given model-id.

WDYT ?

@droot

droot commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

@ShubyM can you pl. update the description of this PR. I would also say refer to other PRs (refactors) for completeness. This will make it easy for others to follow.

/cc @chuangw6

@ShubyM ShubyM force-pushed the feat/fft-snapshot-integration branch from a02f48a to 3cff3c3 Compare June 8, 2026 22:19
@ShubyM

ShubyM commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

@ShubyM can you pl. update the description of this PR. I would also say refer to other PRs (refactors) for completeness. This will make it easy for others to follow.

/cc @chuangw6

Updated the description and added the refactor we discussed, PTAL @droot

@ShubyM ShubyM requested a review from droot June 8, 2026 22:45
def main() -> None:
from clock_cycle import main as clock_cycle_main
def start_request_processing_loop() -> None:
import training_requests_processor

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to get rid of conditional import ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed now!

Comment thread src/server/training/trainer_worker.py Outdated
loss_fn_inputs: dict[str, TensorData]
model_input: list[int]

@field_validator("model_input", mode="before")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pl. add a comment explaining "why"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially added internal types in this PR but chose to remove it later because I felt it bloated the change, will add back in a later PR.

Comment thread src/server/gateway.py
},
request_id=model_id,
)
req_id = await enqueue_worker_launch(command) if is_fft_enabled() else await enqueue(command)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we simply enqueue the "create_model" training request and backend encapsulate the logic of whether to launch a worker or not etc. keeping the API gateway decoupled from the backend.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is the shape to target. The weirdness is coming from the fact that we have to dynamically spin up a new worker for each FFT run which is why we have a separate queue for doing so, will create an issue for this and think more about it

Comment thread src/server/gateway.py
},
request_id=model_id,
)
req_id = await enqueue_worker_launch(command) if is_fft_enabled() else await enqueue(command)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above.

print("[WARNING] BASE_MODEL not provided. Cold-start penalty will apply on first request.")
is_ready = True

if not is_fft_enabled():

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to expose healthcheck for fft as well or not ?

@ShubyM ShubyM Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we are dynamically launching workers for each job I think the meaning of a health check is different from that of a LoraWorker, will think more about this

@droot

droot commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

The change looks good to me. I have minor nits but nothing blocking. Feel free to merge and address nits in a follow up.

@ShubyM ShubyM merged commit bc91e08 into gke-labs:main Jun 9, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants