Skip to content

Resource allocation on SLURM cluster #616

Open
@SamTov

Description

@SamTov

Describe the issue:
This is likely a misunderstanding of how to correctly use Dask to deploy cluster jobs. However, the terminology in the documentation suggests what should be happening, so I also see this as a type of bug as the functionality is so different to what one might expect.

I am trying to train a large number of machine learning models on a SLURM cluster. Each Node has 64 cores and 4 GPUs. I want to run each of my model with 1 GPU and 16 cores so I can, theoretically, get 4 models on each Node and maximise my resources.

My input script is summarised as follows:

def train(index: int):
    """
    Run the experiment.
    """

    class Network(nn.Module):
        """
        Perceptron network.
        """

        @nn.compact
        def __call__(self, x):
            """
            Call method for the network
            """
            x = nn.Dense(2, use_bias=False)(x)
            return nn.sigmoid(x)
        
    generator = DecisionBoundaryGenerator(ds_size, discriminator="line",  one_hot=True)
    model = nl.models.FlaxModel(
        flax_module=Network(), 
        optimizer=optax.adam(0.01), 
        input_shape=(1, 2)
    )
    # Prepare the recorders
    train_recorder = nl.training_recording.JaxRecorder(
        name=f"ce-perceptron/train_recorder_{index}",
        loss=True,
        entropy=True,
        trace=True,
        accuracy=True,
        magnitude_variance=True,
        update_rate=1,
    )
    test_recorder = nl.training_recording.JaxRecorder(
        name=f"ce-perceptron/test_recorder_{index}", loss=True, accuracy=True, update_rate=1,
    )
    train_recorder.instantiate_recorder(data_set=generator.train_ds)
    test_recorder.instantiate_recorder(data_set=generator.test_ds)

    trainer = nl.training_strategies.SimpleTraining(
        model=model,
        loss_fn=nl.loss_functions.CrossEntropyLoss(),
        accuracy_fn=nl.accuracy_functions.LabelAccuracy(),
        recorders=[train_recorder, test_recorder],
    )

    _ = trainer.train_model(
        train_ds=generator.train_ds,
        test_ds=generator.test_ds,
        batch_size=128,
        epochs=5000,
    )

indices = np.linspace(1, 20, 20, dtype=int)

cluster = SLURMCluster(
    cores=16,
    processes=1,
    memory="64GB",
    queue="Anonymised",
    walltime="01:00:00",
    death_timeout="15s",
    worker_extra_args=["--resources GPU=1"],
    log_directory=f'./ce-perceptron/dask-logs',
    job_script_prologue=["module load devel/cuda/12.1"],
    job_extra_directives=["--gres=gpu:1"]
)

cluster.scale(5)

client = Client(cluster)

results =  [client.submit(train, index,  resources={"GPU": 1}) for index in indices]

My expected behaviour is that Dask submits five workers to the queue. Each worker takes a network to train with a given index, trains it on 16 cores and 1 GPU, and starts the next one when the training is finished. What happens, however, is that four workers are submitted to the queue, and only one of them starts to take networks and train them sequentially. The other workers are just idle.

I have tried increasing the number of processes, which I would think means running multiple network trainings on a single worker and splitting the resources. But this is also not correct as, in this case, it gives each process its own GPU despite the worker theoretically only having access to one. It also only runs on a single worker; the others are left idling.

I have also tried using map instead of submit. In this case, the workers die, or they try to run as many network trainings as possible on a single worker.

Finally, I have also tried using adapt, which is preferential to my workflow. However, when I do so, all of my workers keep dying with no logs produced in an endless cycle.

Even though I am reasonably familiar with clusters, especially SLURM clusters, as I mentioned above, I think I am missing something about how the API is supposed to work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions