Skip to content

Latest commit

 

History

History
303 lines (202 loc) · 7.23 KB

File metadata and controls

303 lines (202 loc) · 7.23 KB

Timeout Troubleshooting Guide

This guide covers the most common timeout-related job failures and how to resolve them. For a comprehensive reference of all timeouts, see :ref:`timeouts_programming_guide`.

Symptom: Client fails to receive tasks from server; logs show "timeout" during task fetch.

Common Causes:

  • Large model weights take too long to transfer
  • Network latency exceeds default timeout
  • Tensor streaming timeout exceeds task fetch timeout

Solution: Set get_task_timeout in client config:

recipe.add_client_config({
    "get_task_timeout": 300,  # 5 minutes
})

Applies to: Client API with subprocess launcher (ScriptRunner, ClientAPILauncherExecutor)

Symptom: Job fails before training starts with "external_pre_init_timeout" error.

This timeout controls how long NVFlare waits for your external training script to call flare.init(). When using Client API, NVFlare launches your script as a subprocess and waits for it to connect back.

Common Causes:

  • Large models (LLMs) take time to load before flare.init() is called
  • Heavy library imports (PyTorch, TensorFlow, transformers)
  • Slow disk I/O reading model weights

Solution: Increase external_pre_init_timeout in the executor configuration:

from nvflare.app_common.executors.client_api_launcher_executor import ClientAPILauncherExecutor

executor = ClientAPILauncherExecutor(
    external_pre_init_timeout=600,  # 10 minutes for LLMs
    ...
)

Symptom: Client marked as dead; logs show "heartbeat timeout" or "client not responding".

Common Causes:

  • Long-running training blocks heartbeat thread
  • Network issues causing missed heartbeats
  • Client overwhelmed with compute

Solution: Adjust heartbeat settings:

# In executor configuration
heartbeat_timeout = 300.0   # 5 minutes
heartbeat_interval = 10.0   # Send every 10 seconds

Rule: heartbeat_interval must be less than heartbeat_timeout.

Symptom: Training interrupted before completion; logs show task timeout.

Common Causes:

  • Training round takes longer than expected
  • Data loading is slow
  • Hardware is slower than anticipated

Solution: Set appropriate task timeout in controller:

# ScatterAndGather controller
controller = ScatterAndGather(
    train_timeout=7200,  # 2 hours per round
    wait_time_after_min_received=60,
)

# Or via ModelController
controller = FedAvg(
    num_rounds=100,
    timeout=7200,  # 2 hours per round
)

Symptom: Training completes but result submission fails.

Common Causes:

  • Large model results take time to transfer
  • Network congestion

Solution: Set submit_task_result_timeout:

recipe.add_client_config({
    "submit_task_result_timeout": 300,  # 5 minutes
})

Symptom: Model evaluation fails or times out during cross-site validation.

Solution: Adjust evaluation timeouts:

from nvflare.app_common.np.recipes import NumpyCrossSiteEvalRecipe

recipe = NumpyCrossSiteEvalRecipe(
    submit_model_timeout=900,      # 15 min for model submission
    validation_timeout=7200,       # 2 hours for validation
)
Timeout Default When to Increase
get_task_timeout None Large models, slow networks, tensor streaming
submit_task_result_timeout None Large result payloads
external_pre_init_timeout (Client API subprocess only) 60-300s LLMs, heavy imports before flare.init()
heartbeat_timeout 60-300s Long training iterations, slow networks
train_timeout 0 Long training rounds
validation_timeout 6000s Large validation datasets
progress_timeout 3600s Complex multi-round workflows
# Client-side timeouts (applies to all clients)
recipe.add_client_config({
    "get_task_timeout": 300,
    "submit_task_result_timeout": 300,
})

# Or for specific clients
recipe.add_client_config({
    "get_task_timeout": 600,
}, clients=["site-1", "site-2"])

application.conf (job-level):

get_task_timeout = 300.0
submit_task_result_timeout = 300.0

# Server startup/dead-job safety flags
strict_start_job_reply_check = false
sync_client_jobs_require_previous_report = true

Server-side safety flags guidance (see :ref:`server_startup_dead_job_safety_flags` for full details):

  • strict_start_job_reply_check (default false): keep default for backward-compatible startup behavior; set to true to enforce stricter START_JOB reply checks.
  • sync_client_jobs_require_previous_report (default true): keep enabled to avoid false dead-job reports caused by transient startup or sync races.

comm_config.json (system-level, in startup kit):

{
  "heartbeat_interval": 10,
  "streaming_read_timeout": 600
}
recipe.add_client_config({
    "get_task_timeout": 120,
})
recipe.add_client_config({
    "get_task_timeout": 600,
    "submit_task_result_timeout": 600,
})
recipe.add_client_config({
    "get_task_timeout": 1200,
    "submit_task_result_timeout": 1200,
})
# Longer communication timeouts
recipe.add_client_config({
    "get_task_timeout": 600,
    "submit_task_result_timeout": 600,
})

System-level (comm_config.json in startup kit):

{
  "heartbeat_interval": 15,
  "streaming_read_timeout": 600
}
  1. Check logs for "timeout" messages to identify which timeout triggered
  2. Enable debug logging to see detailed timing information
  3. Monitor heartbeat status in admin console
  4. Start with longer timeouts during development, then optimize

For timeout hierarchies, relationships, and all available timeout parameters, see the comprehensive :ref:`timeouts_programming_guide`.