Skip to content

[Bug] @timeout decorator crashes on Windows — uses SIGALRM which is Unix-only #2981

@adity1raut

Description

@adity1raut

Description

The @timeout decorator uses signal.SIGALRM and signal.alarm() to enforce step
time limits. Both of these are Unix-only and do not exist on Windows. Any flow
that uses @timeout will raise an AttributeError on Windows at task execution time.

The same Unix-only pattern also exists in the card CLI:

metaflow/plugins/timeout_decorator.py lines 73–74:

signal.signal(signal.SIGALRM, self._sigalrm_handler)
signal.alarm(self.secs)

metaflow/plugins/cards/card_cli.py lines 164–166:

signal.signal(signal.SIGALRM, raise_timeout)
signal.alarm(time)

There is no platform guard anywhere in these files. The AttributeError is raised at
task execution time (inside task_pre_step), not at import or decoration time,
so Windows users receive no warning until the flow is actually running.


Steps to Reproduce

On a Windows machine:

# my_flow.py
from metaflow import FlowSpec, step, timeout

class MyFlow(FlowSpec):
    @timeout(seconds=30)
    @step
    def start(self):
        import time
        time.sleep(5)
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == "__main__":
    MyFlow()
python my_flow.py run

Runtime: local, Windows

Where evidence shows up: task logs / parent console

Before (error / log snippet)
AttributeError: module 'signal' has no attribute 'SIGALRM'

  File "metaflow/plugins/timeout_decorator.py", line 73, in task_pre_step
    signal.signal(signal.SIGALRM, self._sigalrm_handler)

The error appears mid-run when the step task starts executing, not at import time.
The flow has already initialized and begun before the crash — misleading for users.

After (expected behavior)
# Option A — timeout works via threading.Timer on Windows
Metaflow [1234/start/1 (pid 5678)] Task is starting.
Metaflow [1234/start/1 (pid 5678)] Step timed out after 30 seconds.

# Option B — clear error at decoration time, not mid-run
AttributeError: @timeout is not supported on Windows.
Please use a cloud compute backend (Kubernetes, Batch) where tasks run on Linux.

Current Behavior

task_pre_step in TimeoutDecorator unconditionally calls signal.SIGALRM and
signal.alarm():

# timeout_decorator.py:70–74
if ubf_context != UBF_CONTROL and retry_count <= max_user_code_retries:
    self.step_name = step_name
    signal.signal(signal.SIGALRM, self._sigalrm_handler)   # AttributeError on Windows
    signal.alarm(self.secs)                                 # AttributeError on Windows

There is no sys.platform check, no hasattr(signal, 'SIGALRM') guard, and no
documentation noting that @timeout is Linux/macOS-only.


Expected Behavior

One of two paths (maintainer's choice):

Option A — Cross-platform implementation using threading.Timer

import sys
import threading

if sys.platform == "win32":
    # SIGALRM is not available on Windows — use a daemon thread timer instead.
    def _start_timeout(self):
        self._timer = threading.Timer(self.secs, self._timeout_handler)
        self._timer.daemon = True
        self._timer.start()

    def _cancel_timeout(self):
        if hasattr(self, "_timer"):
            self._timer.cancel()
else:
    def _start_timeout(self):
        signal.signal(signal.SIGALRM, self._sigalrm_handler)
        signal.alarm(self.secs)

    def _cancel_timeout(self):
        signal.alarm(0)

threading.Timer fires _timeout_handler on a background thread, which raises
TimeoutException in the main thread using ctypes.pythonapi.PyThreadState_SetAsyncExc.

Option B — Fail fast with a clear error at step_init time

def step_init(self, flow, graph, step, decos, environment, flow_datastore, logger):
    import sys
    if sys.platform == "win32":
        raise MetaflowException(
            "@timeout is not supported on Windows because it relies on "
            "POSIX signals (SIGALRM). Use a Linux-based compute backend "
            "(e.g. @kubernetes or @batch) to use @timeout."
        )
    ...

Option B is lower risk and gives a clear, early error message instead of a cryptic
AttributeError mid-run.


Root Cause

signal.SIGALRM is defined in POSIX (Unix/Linux/macOS) but is absent from
the Windows signal module
. Python's standard library documents this clearly:

signal.SIGALRM — Not available on Windows.
signal.alarm(time) — Not available on Windows.

The decorator was written assuming a Unix task execution environment, which is
reasonable for cloud compute (Kubernetes pods, Batch containers all run Linux).
However, local execution runs directly on the user's OS, and Metaflow supports
local runs on Windows. The @timeout decorator is particularly likely to be used
in local development/testing, exactly where Windows users will hit this.


Impact

Scenario Effect
Windows user adds @timeout to any step AttributeError mid-run, flow fails
Windows user tries local development with @timeout Broken — cannot test locally
No warning at decoration or import time Error appears late, confusing to diagnose
card_cli.py also uses SIGALRM Card rendering also broken on Windows

Affected Files

File Lines Issue
metaflow/plugins/timeout_decorator.py L73–74, L79 SIGALRM + alarm() with no platform guard
metaflow/plugins/cards/card_cli.py L164–166, L173 Same pattern in card timeout context manager

Out of Scope

  • Fixing @timeout behavior on cloud backends (Kubernetes, Batch) — those always
    run Linux containers, so SIGALRM works correctly there.
  • Changing the timeout behavior for macOS/Linux — no change needed.

Environment

  • OS: Windows 10 / Windows 11
  • Metaflow version: current master
  • Python: 3.8+

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions