This guide covers the most common timeout-related job failures and how to resolve them. For a comprehensive reference of all timeouts, see :ref:`timeouts_programming_guide`.
Table of Contents
Symptom: Client fails to receive tasks from server; logs show "timeout" during task fetch.
Common Causes:
- Large model weights take too long to transfer
- Network latency exceeds default timeout
- Tensor streaming timeout exceeds task fetch timeout
Solution: Set get_task_timeout in client config:
recipe.add_client_config({
"get_task_timeout": 300, # 5 minutes
})Applies to: Client API with subprocess launcher (ScriptRunner, ClientAPILauncherExecutor)
Symptom: Job fails before training starts with "external_pre_init_timeout" error.
This timeout controls how long NVFlare waits for your external training script to call flare.init().
When using Client API, NVFlare launches your script as a subprocess and waits for it to connect back.
Common Causes:
- Large models (LLMs) take time to load before
flare.init()is called - Heavy library imports (PyTorch, TensorFlow, transformers)
- Slow disk I/O reading model weights
Solution: Increase external_pre_init_timeout in the executor configuration:
from nvflare.app_common.executors.client_api_launcher_executor import ClientAPILauncherExecutor
executor = ClientAPILauncherExecutor(
external_pre_init_timeout=600, # 10 minutes for LLMs
...
)Symptom: Client marked as dead; logs show "heartbeat timeout" or "client not responding".
Common Causes:
- Long-running training blocks heartbeat thread
- Network issues causing missed heartbeats
- Client overwhelmed with compute
Solution: Adjust heartbeat settings:
# In executor configuration
heartbeat_timeout = 300.0 # 5 minutes
heartbeat_interval = 10.0 # Send every 10 secondsRule: heartbeat_interval must be less than heartbeat_timeout.
Symptom: Training interrupted before completion; logs show task timeout.
Common Causes:
- Training round takes longer than expected
- Data loading is slow
- Hardware is slower than anticipated
Solution: Set appropriate task timeout in controller:
# ScatterAndGather controller
controller = ScatterAndGather(
train_timeout=7200, # 2 hours per round
wait_time_after_min_received=60,
)
# Or via ModelController
controller = FedAvg(
num_rounds=100,
timeout=7200, # 2 hours per round
)Symptom: Training completes but result submission fails.
Common Causes:
- Large model results take time to transfer
- Network congestion
Solution: Set submit_task_result_timeout:
recipe.add_client_config({
"submit_task_result_timeout": 300, # 5 minutes
})Symptom: Model evaluation fails or times out during cross-site validation.
Solution: Adjust evaluation timeouts:
from nvflare.app_common.np.recipes import NumpyCrossSiteEvalRecipe
recipe = NumpyCrossSiteEvalRecipe(
submit_model_timeout=900, # 15 min for model submission
validation_timeout=7200, # 2 hours for validation
)| Timeout | Default | When to Increase |
|---|---|---|
| get_task_timeout | None | Large models, slow networks, tensor streaming |
| submit_task_result_timeout | None | Large result payloads |
| external_pre_init_timeout (Client API subprocess only) | 60-300s | LLMs, heavy imports before flare.init() |
| heartbeat_timeout | 60-300s | Long training iterations, slow networks |
| train_timeout | 0 | Long training rounds |
| validation_timeout | 6000s | Large validation datasets |
| progress_timeout | 3600s | Complex multi-round workflows |
# Client-side timeouts (applies to all clients)
recipe.add_client_config({
"get_task_timeout": 300,
"submit_task_result_timeout": 300,
})
# Or for specific clients
recipe.add_client_config({
"get_task_timeout": 600,
}, clients=["site-1", "site-2"])application.conf (job-level):
get_task_timeout = 300.0 submit_task_result_timeout = 300.0 # Server startup/dead-job safety flags strict_start_job_reply_check = false sync_client_jobs_require_previous_report = true
Server-side safety flags guidance (see :ref:`server_startup_dead_job_safety_flags` for full details):
strict_start_job_reply_check(defaultfalse): keep default for backward-compatible startup behavior; set totrueto enforce stricter START_JOB reply checks.sync_client_jobs_require_previous_report(defaulttrue): keep enabled to avoid false dead-job reports caused by transient startup or sync races.
comm_config.json (system-level, in startup kit):
{
"heartbeat_interval": 10,
"streaming_read_timeout": 600
}recipe.add_client_config({
"get_task_timeout": 120,
})recipe.add_client_config({
"get_task_timeout": 600,
"submit_task_result_timeout": 600,
})recipe.add_client_config({
"get_task_timeout": 1200,
"submit_task_result_timeout": 1200,
})# Longer communication timeouts
recipe.add_client_config({
"get_task_timeout": 600,
"submit_task_result_timeout": 600,
})System-level (comm_config.json in startup kit):
{
"heartbeat_interval": 15,
"streaming_read_timeout": 600
}- Check logs for "timeout" messages to identify which timeout triggered
- Enable debug logging to see detailed timing information
- Monitor heartbeat status in admin console
- Start with longer timeouts during development, then optimize
For timeout hierarchies, relationships, and all available timeout parameters, see the comprehensive :ref:`timeouts_programming_guide`.