Skip to content

[QST][Bug?] Can I fit/evaluate many XGBoost models on the same cluster? #8623

Open
@rjzamora

Description

@rjzamora

Description of possible bug

When I try to fit/evaluate many xgboost.dask models in parallel, one or more of the fit/evaluation futures hangs forever. The hang only occurs when threads_per_worker=1 (which is the default for GPU-enabled clusters in dask-cuda).

Is this a bug, or is the reproducer shown below known to be wrong or dangerous for a specific reason?

Reproducer

from dask.distributed import LocalCluster, Client, as_completed
from dask.delayed import delayed
from dask_ml.datasets import make_classification
from dask_ml.model_selection import train_test_split
from xgboost.dask import DaskXGBClassifier


n_workers = 8
threads_per_worker = 1  ###  Using >1 avoids the hang  ###
n_trials = n_workers


if __name__ == "__main__":
    
    # Start up the dask cluster
    cluster = LocalCluster(
        n_workers=n_workers,
        threads_per_worker=threads_per_worker,
    )
    client = Client(cluster)

    # Mimic an HPO objective function (fit/predict with xgboost)
    def objective(random_state):
        X_param, y_param = make_classification(
            n_samples=1000,
            n_features=20,
            chunks=100,
            n_informative=4,
            random_state=random_state,
        )

        X_train, X_valid, y_train, y_valid = train_test_split(
            X_param, y_param, random_state=random_state
        )

        classifier = DaskXGBClassifier(
            **{
                'objective': 'binary:logistic',
                'max_depth': 4,
                'eta': 0.01,
                'subsample': 0.5,
                'min_child_weight': 0.5,
            }
        )

        classifier.fit(X_train, y_train)
        y_pred = classifier.predict(X_valid)
        return

    # Submit and compute many delayed jobs at once
    jobs = []
    delayed_objective = delayed(objective)
    for rs in range(0, n_trials):
        jobs.append(delayed_objective(rs))
    for i, future in enumerate(as_completed(client.compute(jobs))):
        print(f"Job {i} done.")
    print("All jobs done!")

Further background

There is a great blog article from Coiled that demonstrates Optuna-based HPO with XGBoost and dask. The article states: "the current xgboost.dask implementation takes over the entire Dask cluster, so running many of these at once is problematic." Note that the "problematic" practice described in that article is exactly what the reproducer above is doing. With that said, it is not clear to me why one or more workers might hang.

NOTE: I realize this is probably more of an xgboost issue than a distributed issue. However, it seems clear that significant dask/distributed knowledge is needed to pin down the actual problem. Any and all help, advice, or intuition is greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions