Skip to content

Merge F3 2.7 improvements to main#4280

Open
nvidianz wants to merge 3 commits intoNVIDIA:mainfrom
nvidianz:merge-f3-2.7-to-main
Open

Merge F3 2.7 improvements to main#4280
nvidianz wants to merge 3 commits intoNVIDIA:mainfrom
nvidianz:merge-f3-2.7-to-main

Conversation

@nvidianz
Copy link
Collaborator

@nvidianz nvidianz commented Mar 9, 2026

Summary

This PR merges a set of F3 (fuel/f3) reliability and performance improvements from the 2.7 branch into main:

  • Send timeout & stall detection: SocketConnection now uses a select-based send loop with a configurable timeout (default 30s). HeartbeatMonitor detects stalled connections and optionally closes them after consecutive stall checks (sfm_send_stall_timeout, sfm_close_stalled_connection, sfm_send_stall_consecutive_checks config vars).
  • Streaming ACK progress timeout: TxTask tracks the last time ACK offset advanced and aborts if no progress is made within a configurable window (streaming_ack_progress_timeout, default 60s), preventing indefinite hangs from a stuck receiver.
  • Stream key collision fix: RxTask now keys its task map on (origin, sid) instead of just sid, preventing stream collisions when multiple origins use the same stream ID.
  • Separate callback thread pool: Blob callbacks are dispatched on a dedicated callback_thread_pool (64 threads) to avoid starving the main stream thread pool.
  • Retry on download timeout: download_object now retries on TIMEOUT responses with exponential backoff (up to max_retries=3 by default), with recovery/failure logging.
  • Memory release after download: CacheableObject.release() nulls base_obj after transaction_done_cb fires, allowing GC to immediately reclaim large objects (e.g. numpy dicts).
  • PASS_THROUGH header support: New MessageHeaderKey.PASS_THROUGH and FOBSContextKey.PASS_THROUGH keys enable lazy tensor downloading at forwarding nodes (foundation for pass-through architecture).
  • Shutdown race fix in ConnManager: stopped=True is set before executor shutdown; frame_mgr_executor.submit() calls are guarded against RuntimeError after shutdown.
  • Download stats logging: _Transaction and download_object now log elapsed time and total bytes on completion.

Test plan

  • Run existing unit tests: python3 -m pytest tests/unit_test/fuel/f3/ -v
  • Verify no regressions in streaming and cellnet tests
  • Test send-timeout behavior under a slow/blocked receiver
  • Verify stall detection logs warnings and closes connections when configured

🤖 Generated with Claude Code

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR backports a suite of F3 reliability and performance improvements from the 2.7 branch to main, covering: send timeout and stall detection at the socket/SFM layer, ACK progress tracking in TxTask, stream key collision fixes in RxTask, a dedicated callback thread pool for blob streaming, retry-with-backoff in download_object, memory release via CacheableObject.release(), the PASS_THROUGH header/context architecture for lazy tensor forwarding, and shutdown race fixes in ConnManager.

Key observations:

  • byte_streamer.py: The nested wait loop changes ack_wait semantics from a per-ACK-wait to a total window-clear budget. The outer while window > self.window_size loop is functionally equivalent to if (the inner loop always exits with window ≤ window_size), which may confuse readers. The ack_waiter.clear() placement can cause unnecessary extra waits up to ack_progress_check_interval (5 s) when an ACK arrives just before clear() is called.
  • download_service.py: total_bytes stats are double-counted on retry because resending the same state causes the producer to re-generate the same chunk, and the byte accumulation runs on every successful response without detecting retries.
  • blob_streamer.py: If blob_cb raises and stream.task is not present on the Stream object, the exception is logged but not propagated to the StreamFuture, leaving callers with no indication of failure.
  • All other changes (socket send timeout, stall detection, stream key deduplication, shutdown guards, PASS_THROUGH architecture) look correct and well-implemented.

Confidence Score: 4/5

  • Safe to merge with minor correctness concerns in the streaming ACK wait loop and retry byte-counting accuracy.
  • The majority of changes are well-structured bug fixes and reliability improvements. The three flagged issues (ack_wait semantics change, double-counted retry bytes, blob_cb error not propagating to future) are either minor logging inaccuracies or edge-case latency issues rather than data-corruption or availability bugs. The stream key collision fix, shutdown race fix, and send-timeout improvements are solid.
  • nvflare/fuel/f3/streaming/byte_streamer.py deserves the most attention — the nested wait loop's semantics change around ack_wait and the ack_waiter.clear() placement should be verified against expected behavior under slow receivers.

Important Files Changed

Filename Overview
nvflare/fuel/f3/streaming/byte_streamer.py Adds ACK progress timeout with nested wait loop; the outer while is effectively an if, ack_wait semantics changed to total window-clear budget, and ack_waiter.clear() placement can cause unnecessary 5s delays.
nvflare/fuel/f3/streaming/download_service.py Adds retry-on-timeout with exponential backoff, download stats logging, and release() for GC; total_bytes is double-counted on retried chunks (minor logging inaccuracy).
nvflare/fuel/f3/streaming/blob_streamer.py Dispatches blob callbacks to dedicated thread pool to avoid starvation; error in blob_cb may not propagate to the future if stream.task is absent.
nvflare/fuel/f3/drivers/socket_conn.py Replaces sendall with a select-based send loop with a configurable timeout; error classification helpers are clean and well-structured.
nvflare/fuel/f3/sfm/conn_manager.py Fixes shutdown race: stopped=True now set before executor shutdown, and all executor submits are guarded with try/except RuntimeError.
nvflare/fuel/f3/sfm/heartbeat_monitor.py Adds stall detection with configurable timeout and close-on-stall policy; stale connection keys are cleaned up correctly after connections are removed.
nvflare/fuel/f3/sfm/sfm_conn.py Adds send_state_lock and send_started_at tracking for stall detection; get_send_stall_seconds() is thread-safe and correctly returns 0 when no send is in progress.
nvflare/fuel/f3/streaming/byte_receiver.py Stream key collision fix: rx_task_map now keyed on (origin, sid) tuple instead of sid alone, preventing stream aliasing when multiple origins use the same SID.
nvflare/fuel/f3/streaming/cacheable.py Adds release() to null base_obj after transaction callback, enabling immediate GC; defensive guard in _get_item handles post-release requests correctly.
nvflare/fuel/f3/cellnet/utils.py Normalizes fobs_ctx to a local copy to avoid mutating caller-owned context; injects PASS_THROUGH key when the message header is present.
nvflare/fuel/f3/streaming/stream_utils.py Adds callback_thread_pool (64 threads) for blob callbacks; stream_shutdown() correctly shuts down callback_thread_pool before stream_thread_pool.
nvflare/fuel/f3/comm_config.py Adds configuration accessors for new timeout/stall-detection variables; all new getters follow the established pattern.
nvflare/fuel/f3/cellnet/defs.py Adds PASS_THROUGH header key with detailed documentation describing the pass-through tensor architecture.
nvflare/fuel/utils/fobs/init.py Adds PASS_THROUGH and DOWNLOAD_COMPLETE_CB context keys with clear documentation of their role in the pass-through architecture.

Sequence Diagram

sequenceDiagram
    participant Sender as TxTask (ByteStreamer)
    participant SC as SocketConnection
    participant HB as HeartbeatMonitor
    participant Receiver as RxTask (ByteReceiver)
    participant DS as DownloadService

    Note over Sender,SC: Send with timeout (select-based)
    Sender->>SC: send_frame(buffer)
    SC->>SC: _send_with_timeout(frame, 30s)
    SC-->>Sender: CommError(TIMEOUT) if stalled > 30s

    Note over HB,SC: Stall detection
    loop every 5s tick
        HB->>SC: get_send_stall_seconds()
        alt stall_sec > sfm_send_stall_timeout (45s)
            HB->>HB: stall_counts[conn]++
            opt close_stalled_connection && count >= stall_consecutive_checks
                HB->>SC: conn.close()
            end
        else no stall
            HB->>HB: stall_counts[conn] = 0
            HB->>Receiver: send_heartbeat(PING)
        end
    end

    Note over Sender,Receiver: ACK progress tracking
    Sender->>Receiver: stream chunk (fire_and_forget)
    Receiver-->>Sender: ACK (offset)
    Sender->>Sender: last_ack_progress_ts = now
    alt no ACK progress for ack_progress_timeout (60s)
        Sender->>Sender: stop(StreamError)
    else total wait > ack_wait (300s)
        Sender->>Sender: stop(StreamError)
    end

    Note over Sender,DS: Object download with retry
    Sender->>DS: download_object(ref_id, state)
    DS-->>Sender: ReturnCode.TIMEOUT
    loop retry (max 3, backoff 2/4/8s)
        Sender->>DS: download_object(ref_id, same_state)
        DS-->>Sender: ReturnCode.OK + chunk
    end
    Sender->>Sender: consumer.consume(chunk)
    DS-->>Sender: ProduceRC.EOF
    Sender->>Sender: log elapsed + total_bytes
Loading

Last reviewed commit: 0b49ead

Comment on lines +120 to +134
while window > self.window_size:
now = time.monotonic()
if now - self.last_ack_progress_ts >= self.ack_progress_timeout:
self.stop(StreamError(f"{self} ACK made no progress for {self.ack_progress_timeout} seconds"))
return

window = self.offset - self.offset_ack
elapsed = now - wait_start
if elapsed >= self.ack_wait:
self.stop(StreamError(f"{self} ACK timeouts after {self.ack_wait} seconds"))
return

self.ack_waiter.clear()
wait_timeout = min(self.ack_progress_check_interval, self.ack_wait - elapsed)
self.ack_waiter.wait(timeout=wait_timeout)
window = self.offset - self.offset_ack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack_wait now acts as a total window-clear budget, not a per-ACK wait

The original code waited up to ack_wait seconds for each individual ACK signal. With the nested loop, wait_start is set once when the outer while window > self.window_size: is first entered, so the elapsed >= self.ack_wait guard is now a total budget for the entire window-clearing period.

If multiple ACKs are needed to drain a large window (e.g., STREAM_WINDOW_SIZE = 16 MB with 1 MB ACKs), the old code permitted up to N * ack_wait aggregate wait. The new code caps the entire clearing sequence at one ack_wait (default 300 s). In practice the ack_progress_timeout (60 s) will usually fire first, but if slow but steady ACKs keep arriving — resetting last_ack_progress_ts repeatedly — only the ack_wait guard prevents an indefinite wait, and it now fires 16× sooner than before for that scenario.

Additionally, the outer while window > self.window_size: loop is functionally equivalent to if window > self.window_size: here: the inner loop always exits with window ≤ window_size (the condition that breaks it), so the outer while re-evaluates to False immediately. Consider replacing the outer while with if and documenting the intent, or resetting wait_start correctly inside the outer loop if the intent is to allow multiple stall periods.

Comment on lines +705 to +706
if data is not None:
total_bytes += sum(len(c) for c in data) if isinstance(data, list) else len(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_bytes is inflated when retries re-produce the same chunk

On a TIMEOUT retry, current_state is kept the same, so the server re-invokes produce() for that state and returns the identical chunk. Both the client (line ~706) and the server (line ~496 in _handle_download) add the chunk length to total_bytes on every successful response, including retried ones.

This means a chunk that required 2 attempts to deliver is counted twice. The elapsed-time / byte-count summary logged at EOF will overstate the actual bytes transferred.

# Only accumulate bytes for new (non-retry) responses to keep total_bytes accurate.
if consecutive_timeouts == 0 and data is not None:
    total_bytes += sum(len(c) for c in data) if isinstance(data, list) else len(data)

The same correction should be applied server-side in _handle_download — though the server has no visibility into client-side retry state, so an exact fix there may require a retry flag in the request payload.

Comment on lines +103 to +110
def _run_blob_cb(self, future: StreamFuture, stream: Stream, args: tuple, kwargs: dict):
"""Run blob_cb on the callback pool; preserve exception handling (log + task.stop) as in ByteReceiver."""
try:
self.blob_cb(future, *args, **kwargs)
except Exception as ex:
log.error(f"blob_cb threw: {ex}\n{secure_format_traceback()}")
if hasattr(stream, "task"):
stream.task.stop(StreamError(f"blob_cb threw: {ex}"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stream may not be stopped if blob_cb raises and stream.task is absent

stream.task is not a guaranteed attribute on the Stream base type; hasattr(stream, "task") will be False for any stream that was not enriched with this attribute, causing the task.stop() call to be silently skipped.

When blob_cb raises and the task is not stopped, _read_stream continues to run on stream_thread_pool — pulling chunks from the stream and writing them to blob_task.buffer — even though no consumer is active. The future is eventually resolved via blob_task.future.set_result(result), but any caller await-ing that future will receive the data with no indication that the callback failed.

Consider propagating the error to the future directly so callers can observe it:

def _run_blob_cb(self, future: StreamFuture, stream: Stream, args: tuple, kwargs: dict):
    try:
        self.blob_cb(future, *args, **kwargs)
    except Exception as ex:
        log.error(f"blob_cb threw: {ex}\n{secure_format_traceback()}")
        future.set_exception(StreamError(f"blob_cb threw: {ex}"))
        if hasattr(stream, "task"):
            stream.task.stop(StreamError(f"blob_cb threw: {ex}"))

Comment on lines +131 to +133
self.ack_waiter.clear()
wait_timeout = min(self.ack_progress_check_interval, self.ack_wait - elapsed)
self.ack_waiter.wait(timeout=wait_timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack_waiter.clear() may discard an already-arrived ACK signal

An ACK that was received after the previous ack_waiter.wait() returned (either by timeout or by signal) will have called ack_waiter.set(). The next iteration of the inner loop then calls ack_waiter.clear(), discarding that signal before reaching ack_waiter.wait(). The result is an unnecessary extra wait of up to ack_progress_check_interval (default 5 s) before the updated window is re-evaluated.

The standard remedy is to clear the event before entering the wait phase, but after sampling offset_ack, so any ACK that arrives after the sample but before the wait is not lost:

self.ack_waiter.clear()
window = self.offset - self.offset_ack   # re-sample after clear, before wait
if window <= self.window_size:
    break
self.ack_waiter.wait(timeout=wait_timeout)
window = self.offset - self.offset_ack

This is a minor latency issue rather than a correctness bug, since offset_ack is updated regardless of the event state.

@YuanTingHsieh
Copy link
Collaborator

@nvidianz can you list the PRs related to this one?
For example I think part of #4167 #4171 #4172 is here.
Can you get all the changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants