Merge F3 2.7 improvements to main by nvidianz · Pull Request #4280 · NVIDIA/NVFlare

nvidianz · 2026-03-09T18:48:03Z

Summary

This PR merges a set of F3 (fuel/f3) reliability and performance improvements from the 2.7 branch into main:

Send timeout & stall detection: SocketConnection now uses a select-based send loop with a configurable timeout (default 30s). HeartbeatMonitor detects stalled connections and optionally closes them after consecutive stall checks (sfm_send_stall_timeout, sfm_close_stalled_connection, sfm_send_stall_consecutive_checks config vars).
Streaming ACK progress timeout: TxTask tracks the last time ACK offset advanced and aborts if no progress is made within a configurable window (streaming_ack_progress_timeout, default 60s), preventing indefinite hangs from a stuck receiver.
Stream key collision fix: RxTask now keys its task map on (origin, sid) instead of just sid, preventing stream collisions when multiple origins use the same stream ID.
Separate callback thread pool: Blob callbacks are dispatched on a dedicated callback_thread_pool (64 threads) to avoid starving the main stream thread pool.
Retry on download timeout: download_object now retries on TIMEOUT responses with exponential backoff (up to max_retries=3 by default), with recovery/failure logging.
Memory release after download: CacheableObject.release() nulls base_obj after transaction_done_cb fires, allowing GC to immediately reclaim large objects (e.g. numpy dicts).
PASS_THROUGH header support: New MessageHeaderKey.PASS_THROUGH and FOBSContextKey.PASS_THROUGH keys enable lazy tensor downloading at forwarding nodes (foundation for pass-through architecture).
Shutdown race fix in ConnManager: stopped=True is set before executor shutdown; frame_mgr_executor.submit() calls are guarded against RuntimeError after shutdown.
Download stats logging: _Transaction and download_object now log elapsed time and total bytes on completion.

Test plan

Run existing unit tests: python3 -m pytest tests/unit_test/fuel/f3/ -v
Verify no regressions in streaming and cellnet tests
Test send-timeout behavior under a slow/blocked receiver
Verify stall detection logs warnings and closes connections when configured

🤖 Generated with Claude Code

greptile-apps · 2026-03-09T19:03:21Z

Greptile Summary

This PR backports a suite of F3 reliability and performance improvements from the 2.7 branch to main, covering: send timeout and stall detection at the socket/SFM layer, ACK progress tracking in TxTask, stream key collision fixes in RxTask, a dedicated callback thread pool for blob streaming, retry-with-backoff in download_object, memory release via CacheableObject.release(), the PASS_THROUGH header/context architecture for lazy tensor forwarding, and shutdown race fixes in ConnManager.

Key observations:

byte_streamer.py: The nested wait loop changes ack_wait semantics from a per-ACK-wait to a total window-clear budget. The outer while window > self.window_size loop is functionally equivalent to if (the inner loop always exits with window ≤ window_size), which may confuse readers. The ack_waiter.clear() placement can cause unnecessary extra waits up to ack_progress_check_interval (5 s) when an ACK arrives just before clear() is called.
download_service.py: total_bytes stats are double-counted on retry because resending the same state causes the producer to re-generate the same chunk, and the byte accumulation runs on every successful response without detecting retries.
blob_streamer.py: If blob_cb raises and stream.task is not present on the Stream object, the exception is logged but not propagated to the StreamFuture, leaving callers with no indication of failure.
All other changes (socket send timeout, stall detection, stream key deduplication, shutdown guards, PASS_THROUGH architecture) look correct and well-implemented.

Confidence Score: 4/5

Safe to merge with minor correctness concerns in the streaming ACK wait loop and retry byte-counting accuracy.
The majority of changes are well-structured bug fixes and reliability improvements. The three flagged issues (ack_wait semantics change, double-counted retry bytes, blob_cb error not propagating to future) are either minor logging inaccuracies or edge-case latency issues rather than data-corruption or availability bugs. The stream key collision fix, shutdown race fix, and send-timeout improvements are solid.
nvflare/fuel/f3/streaming/byte_streamer.py deserves the most attention — the nested wait loop's semantics change around ack_wait and the ack_waiter.clear() placement should be verified against expected behavior under slow receivers.

Important Files Changed

Filename	Overview
nvflare/fuel/f3/streaming/byte_streamer.py	Adds ACK progress timeout with nested wait loop; the outer `while` is effectively an `if`, `ack_wait` semantics changed to total window-clear budget, and `ack_waiter.clear()` placement can cause unnecessary 5s delays.
nvflare/fuel/f3/streaming/download_service.py	Adds retry-on-timeout with exponential backoff, download stats logging, and `release()` for GC; `total_bytes` is double-counted on retried chunks (minor logging inaccuracy).
nvflare/fuel/f3/streaming/blob_streamer.py	Dispatches blob callbacks to dedicated thread pool to avoid starvation; error in `blob_cb` may not propagate to the future if `stream.task` is absent.
nvflare/fuel/f3/drivers/socket_conn.py	Replaces `sendall` with a `select`-based send loop with a configurable timeout; error classification helpers are clean and well-structured.
nvflare/fuel/f3/sfm/conn_manager.py	Fixes shutdown race: `stopped=True` now set before executor shutdown, and all executor submits are guarded with `try/except RuntimeError`.
nvflare/fuel/f3/sfm/heartbeat_monitor.py	Adds stall detection with configurable timeout and close-on-stall policy; stale connection keys are cleaned up correctly after connections are removed.
nvflare/fuel/f3/sfm/sfm_conn.py	Adds `send_state_lock` and `send_started_at` tracking for stall detection; `get_send_stall_seconds()` is thread-safe and correctly returns 0 when no send is in progress.
nvflare/fuel/f3/streaming/byte_receiver.py	Stream key collision fix: `rx_task_map` now keyed on `(origin, sid)` tuple instead of `sid` alone, preventing stream aliasing when multiple origins use the same SID.
nvflare/fuel/f3/streaming/cacheable.py	Adds `release()` to null `base_obj` after transaction callback, enabling immediate GC; defensive guard in `_get_item` handles post-release requests correctly.
nvflare/fuel/f3/cellnet/utils.py	Normalizes `fobs_ctx` to a local copy to avoid mutating caller-owned context; injects `PASS_THROUGH` key when the message header is present.
nvflare/fuel/f3/streaming/stream_utils.py	Adds `callback_thread_pool` (64 threads) for blob callbacks; `stream_shutdown()` correctly shuts down `callback_thread_pool` before `stream_thread_pool`.
nvflare/fuel/f3/comm_config.py	Adds configuration accessors for new timeout/stall-detection variables; all new getters follow the established pattern.
nvflare/fuel/f3/cellnet/defs.py	Adds `PASS_THROUGH` header key with detailed documentation describing the pass-through tensor architecture.
nvflare/fuel/utils/fobs/init.py	Adds `PASS_THROUGH` and `DOWNLOAD_COMPLETE_CB` context keys with clear documentation of their role in the pass-through architecture.

Sequence Diagram

sequenceDiagram
    participant Sender as TxTask (ByteStreamer)
    participant SC as SocketConnection
    participant HB as HeartbeatMonitor
    participant Receiver as RxTask (ByteReceiver)
    participant DS as DownloadService

    Note over Sender,SC: Send with timeout (select-based)
    Sender->>SC: send_frame(buffer)
    SC->>SC: _send_with_timeout(frame, 30s)
    SC-->>Sender: CommError(TIMEOUT) if stalled > 30s

    Note over HB,SC: Stall detection
    loop every 5s tick
        HB->>SC: get_send_stall_seconds()
        alt stall_sec > sfm_send_stall_timeout (45s)
            HB->>HB: stall_counts[conn]++
            opt close_stalled_connection && count >= stall_consecutive_checks
                HB->>SC: conn.close()
            end
        else no stall
            HB->>HB: stall_counts[conn] = 0
            HB->>Receiver: send_heartbeat(PING)
        end
    end

    Note over Sender,Receiver: ACK progress tracking
    Sender->>Receiver: stream chunk (fire_and_forget)
    Receiver-->>Sender: ACK (offset)
    Sender->>Sender: last_ack_progress_ts = now
    alt no ACK progress for ack_progress_timeout (60s)
        Sender->>Sender: stop(StreamError)
    else total wait > ack_wait (300s)
        Sender->>Sender: stop(StreamError)
    end

    Note over Sender,DS: Object download with retry
    Sender->>DS: download_object(ref_id, state)
    DS-->>Sender: ReturnCode.TIMEOUT
    loop retry (max 3, backoff 2/4/8s)
        Sender->>DS: download_object(ref_id, same_state)
        DS-->>Sender: ReturnCode.OK + chunk
    end
    Sender->>Sender: consumer.consume(chunk)
    DS-->>Sender: ProduceRC.EOF
    Sender->>Sender: log elapsed + total_bytes

_{Last reviewed commit: 0b49ead}

greptile-apps · 2026-03-09T19:03:24Z

nvflare/fuel/f3/streaming/byte_streamer.py

+                while window > self.window_size:
+                    now = time.monotonic()
+                    if now - self.last_ack_progress_ts >= self.ack_progress_timeout:
+                        self.stop(StreamError(f"{self} ACK made no progress for {self.ack_progress_timeout} seconds"))
+                        return

-                window = self.offset - self.offset_ack
+                    elapsed = now - wait_start
+                    if elapsed >= self.ack_wait:
+                        self.stop(StreamError(f"{self} ACK timeouts after {self.ack_wait} seconds"))
+                        return
+
+                    self.ack_waiter.clear()
+                    wait_timeout = min(self.ack_progress_check_interval, self.ack_wait - elapsed)
+                    self.ack_waiter.wait(timeout=wait_timeout)
+                    window = self.offset - self.offset_ack


ack_wait now acts as a total window-clear budget, not a per-ACK wait

The original code waited up to ack_wait seconds for each individual ACK signal. With the nested loop, wait_start is set once when the outer while window > self.window_size: is first entered, so the elapsed >= self.ack_wait guard is now a total budget for the entire window-clearing period.

If multiple ACKs are needed to drain a large window (e.g., STREAM_WINDOW_SIZE = 16 MB with 1 MB ACKs), the old code permitted up to N * ack_wait aggregate wait. The new code caps the entire clearing sequence at one ack_wait (default 300 s). In practice the ack_progress_timeout (60 s) will usually fire first, but if slow but steady ACKs keep arriving — resetting last_ack_progress_ts repeatedly — only the ack_wait guard prevents an indefinite wait, and it now fires 16× sooner than before for that scenario.

Additionally, the outer while window > self.window_size: loop is functionally equivalent to if window > self.window_size: here: the inner loop always exits with window ≤ window_size (the condition that breaks it), so the outer while re-evaluates to False immediately. Consider replacing the outer while with if and documenting the intent, or resetting wait_start correctly inside the outer loop if the intent is to allow multiple stall periods.

greptile-apps · 2026-03-09T19:03:25Z

nvflare/fuel/f3/streaming/download_service.py

+        if data is not None:
+            total_bytes += sum(len(c) for c in data) if isinstance(data, list) else len(data)


total_bytes is inflated when retries re-produce the same chunk

On a TIMEOUT retry, current_state is kept the same, so the server re-invokes produce() for that state and returns the identical chunk. Both the client (line ~706) and the server (line ~496 in _handle_download) add the chunk length to total_bytes on every successful response, including retried ones.

This means a chunk that required 2 attempts to deliver is counted twice. The elapsed-time / byte-count summary logged at EOF will overstate the actual bytes transferred.

# Only accumulate bytes for new (non-retry) responses to keep total_bytes accurate. if consecutive_timeouts == 0 and data is not None: total_bytes += sum(len(c) for c in data) if isinstance(data, list) else len(data)

The same correction should be applied server-side in _handle_download — though the server has no visibility into client-side retry state, so an exact fix there may require a retry flag in the request payload.

greptile-apps · 2026-03-09T19:03:26Z

nvflare/fuel/f3/streaming/blob_streamer.py

+    def _run_blob_cb(self, future: StreamFuture, stream: Stream, args: tuple, kwargs: dict):
+        """Run blob_cb on the callback pool; preserve exception handling (log + task.stop) as in ByteReceiver."""
+        try:
+            self.blob_cb(future, *args, **kwargs)
+        except Exception as ex:
+            log.error(f"blob_cb threw: {ex}\n{secure_format_traceback()}")
+            if hasattr(stream, "task"):
+                stream.task.stop(StreamError(f"blob_cb threw: {ex}"))


Stream may not be stopped if blob_cb raises and stream.task is absent

stream.task is not a guaranteed attribute on the Stream base type; hasattr(stream, "task") will be False for any stream that was not enriched with this attribute, causing the task.stop() call to be silently skipped.

When blob_cb raises and the task is not stopped, _read_stream continues to run on stream_thread_pool — pulling chunks from the stream and writing them to blob_task.buffer — even though no consumer is active. The future is eventually resolved via blob_task.future.set_result(result), but any caller await-ing that future will receive the data with no indication that the callback failed.

Consider propagating the error to the future directly so callers can observe it:

def _run_blob_cb(self, future: StreamFuture, stream: Stream, args: tuple, kwargs: dict): try: self.blob_cb(future, *args, **kwargs) except Exception as ex: log.error(f"blob_cb threw: {ex}\n{secure_format_traceback()}") future.set_exception(StreamError(f"blob_cb threw: {ex}")) if hasattr(stream, "task"): stream.task.stop(StreamError(f"blob_cb threw: {ex}"))

greptile-apps · 2026-03-09T19:03:27Z

nvflare/fuel/f3/streaming/byte_streamer.py

+                    self.ack_waiter.clear()
+                    wait_timeout = min(self.ack_progress_check_interval, self.ack_wait - elapsed)
+                    self.ack_waiter.wait(timeout=wait_timeout)


ack_waiter.clear() may discard an already-arrived ACK signal

An ACK that was received after the previous ack_waiter.wait() returned (either by timeout or by signal) will have called ack_waiter.set(). The next iteration of the inner loop then calls ack_waiter.clear(), discarding that signal before reaching ack_waiter.wait(). The result is an unnecessary extra wait of up to ack_progress_check_interval (default 5 s) before the updated window is re-evaluated.

The standard remedy is to clear the event before entering the wait phase, but after sampling offset_ack, so any ACK that arrives after the sample but before the wait is not lost:

self.ack_waiter.clear() window = self.offset - self.offset_ack # re-sample after clear, before wait if window <= self.window_size: break self.ack_waiter.wait(timeout=wait_timeout) window = self.offset - self.offset_ack

This is a minor latency issue rather than a correctness bug, since offset_ack is updated regardless of the event state.

YuanTingHsieh · 2026-03-11T01:06:49Z

@nvidianz can you list the PRs related to this one?
For example I think part of #4167 #4171 #4172 is here.
Can you get all the changes?

nvidianz added 2 commits March 9, 2026 14:06

Merged changes to F3 back to main

6b980f7

Merged more changes from 2.7

14c55ca

nvidianz requested review from IsaacYangSLA, YuanTingHsieh, chesterxgchen and pcnudde March 9, 2026 18:50

Merge branch 'main' into merge-f3-2.7-to-main

0b49ead

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge F3 2.7 improvements to main#4280

Merge F3 2.7 improvements to main#4280
nvidianz wants to merge 3 commits intoNVIDIA:mainfrom
nvidianz:merge-f3-2.7-to-main

nvidianz commented Mar 9, 2026

Uh oh!

greptile-apps bot commented Mar 9, 2026

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

YuanTingHsieh commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if data is not None:
		total_bytes += sum(len(c) for c in data) if isinstance(data, list) else len(data)

Conversation

nvidianz commented Mar 9, 2026

Summary

Test plan

Uh oh!

greptile-apps bot commented Mar 9, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

YuanTingHsieh commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants