Skip to content

Conversation

@natinew77-creator
Copy link

Summary

Fixes #311

When using Acme distributed experiments with external schedulers like Ray Tune's ASHA scheduler, the scheduler may terminate trials early. However, the Launchpad processes spawned by the experiment are not automatically terminated, leaving orphan processes running.

Problem

As described in #311, when Ray Tune's ASHA scheduler terminates a trial, the mp.Process running the Launchpad program is killed, but the child processes spawned by Launchpad continue running as orphans. This happens because the termination signal is not forwarded to the Launchpad processes.

Solution

Added two new utilities to acme/utils/lp_utils.py:

1. LaunchpadProgramStopper (Context Manager)

A context manager that registers signal handlers for SIGTERM and SIGINT. When these signals are received, it calls lp.stop() to gracefully terminate all Launchpad processes.

2. launch_with_termination_handler() (Convenience Function)

A wrapper around lp.launch() that automatically uses the LaunchpadProgramStopper context manager.

Usage

from acme.utils import lp_utils

def train_function(config):
    experiment = build_experiment_config(config)
    program = experiments.make_distributed_experiment(
        experiment=experiment, num_actors=1)
    # Use the new utility instead of lp.launch()
    lp_utils.launch_with_termination_handler(program)

tuner = tune.Tuner(
    train_function,
    tune_config=tune.TuneConfig(scheduler=ASHAScheduler(...)),
)

Or using the context manager directly:

with lp_utils.LaunchpadProgramStopper():
    lp.launch(program, lp.LaunchType.LOCAL_MULTI_PROCESSING)

Testing

  • Verified syntax is valid with python3 -m py_compile
  • Follows the existing signal handling patterns used in acme/utils/signals.py

When using Acme distributed experiments with external schedulers like
Ray Tune's ASHA scheduler, the scheduler may terminate trials early.
However, the Launchpad processes spawned by the experiment are not
automatically terminated, leaving orphan processes running.

This commit adds:

1. LaunchpadProgramStopper: A context manager that registers signal
   handlers for SIGTERM and SIGINT. When these signals are received,
   it calls lp.stop() to gracefully terminate all Launchpad processes.

2. launch_with_termination_handler(): A convenience function that wraps
   lp.launch() with the LaunchpadProgramStopper context manager.

Example usage with Ray Tune:

    def train_function(config):
        experiment = build_experiment_config(config)
        program = experiments.make_distributed_experiment(
            experiment=experiment, num_actors=1)
        launch_with_termination_handler(program)

    tuner = tune.Tuner(
        train_function,
        tune_config=tune.TuneConfig(scheduler=ASHAScheduler(...)),
    )

Fixes google-deepmind#311
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use ACME with Ray Tune ASHAScheduler

1 participant