Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Trials hang when using a scheduler #253

Open
@dcfidalgo

Description

Hi there!
I first encountered this issue when trying to do a PBT on a Multi-node DDP setup (4GPUs per node, each node is a population member), but I could not consistently reproduce it.
But now I managed to reproduce the same behavior using an ASHA scheduler: As soon as the ASHA scheduler terminates a trial, the subsequent trials simply hang in the RUNNING status and never terminate.

== Status ==
Current time: 2023-03-17 10:12:33 (running for 00:00:41.50)
Memory usage on this node: 154.0/250.9 GiB 
Using AsyncHyperBand: num_stopped=1
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: -1.25
Resources requested: 3.0/4 CPUs, 0/0 GPUs, 0.0/64.44 GiB heap, 0.0/31.61 GiB objects
Result logdir: /dcfidalgo/ray_results/train_func_2023-03-17_10-11-51
Number of trials: 3/3 (1 RUNNING, 2 TERMINATED)
+------------------------+------------+---------------------+------------+--------+------------------+------------+
| Trial name             | status     | loc                 |   val_loss |   iter |   total time (s) |   val_loss |
|------------------------+------------+---------------------+------------+--------+------------------+------------|
| train_func_c1436_00002 | RUNNING    | 10.181.103.72:74356 |          3 |        |                  |            |
| train_func_c1436_00000 | TERMINATED | 10.181.103.72:74356 |          1 |      1 |          6.91809 |          1 |
| train_func_c1436_00001 | TERMINATED | 10.181.103.72:74356 |          2 |      1 |          6.20699 |          2 |
+------------------------+------------+---------------------+------------+--------+------------------+------------+

I could trace back the issue to a hanging ray.get call when trying to get the self._master_addr here. But I simply cannot figure out what the underlying cause is ...

A minimal script to reproduce the issue:

import torch
import ray
from ray import tune
from ray.tune.schedulers import AsyncHyperBandScheduler

from ray_lightning import RayStrategy
from ray_lightning.tests.utils import BoringModel, get_trainer
from ray_lightning.tune import TuneReportCallback, get_tune_resources


class AnotherBoringModel(BoringModel):
    def __init__(self, val_loss: float):
        super().__init__()
        self._val_loss = torch.tensor(val_loss)

    def validation_step(self, batch, batch_idx):
        self.log("val_loss", self._val_loss)
        return {"x": self._val_loss}


address_info = ray.init(num_cpus=4)


strategy = RayStrategy(num_workers=2, use_gpu=False)
callbacks = [TuneReportCallback(on="validation_end")]


def train_func(config):
    model = AnotherBoringModel(config["val_loss"])
    trainer = get_trainer(
        "./",
        callbacks=callbacks,
        strategy=strategy,
        checkpoint_callback=False,
        max_epochs=1)
    trainer.fit(model)


tune.run(
    train_func,
    config={"val_loss": tune.grid_search([1., 2., 3.])},
    resources_per_trial=get_tune_resources(
        num_workers=strategy.num_workers, use_gpu=strategy.use_gpu),
    num_samples=1,
    scheduler=AsyncHyperBandScheduler(metric="val_loss", mode="min")
)

If you remove the scheduler, the above script terminates without issues.

A corresponding conda env:

name: schedulerbug
channels:
  - pytorch
dependencies:
  - python=3.9
  - pytorch==1.11.0
  - cpuonly
  - pip
  - pip:
    - pytorch-lightning==1.6.4
    - ray[tune]==2.3.0
    - git+https://github.com/ray-project/ray_lightning.git@main

Is someone experiencing the same issue? Any kind of help would be very much appreciated! 😃
Have a great day!

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions