[QST][Bug?] Can I fit/evaluate many XGBoost models on the same cluster?

**Description of possible bug**

When I try to fit/evaluate many `xgboost.dask` models in parallel, one or more of the fit/evaluation futures hangs forever. The hang only occurs when `threads_per_worker=1` (which is the default for GPU-enabled clusters in dask-cuda).

Is this a bug, or is the reproducer shown below known to be wrong or dangerous for a specific reason?

**Reproducer**

```python
from dask.distributed import LocalCluster, Client, as_completed
from dask.delayed import delayed
from dask_ml.datasets import make_classification
from dask_ml.model_selection import train_test_split
from xgboost.dask import DaskXGBClassifier


n_workers = 8
threads_per_worker = 1  ###  Using >1 avoids the hang  ###
n_trials = n_workers


if __name__ == "__main__":
    
    # Start up the dask cluster
    cluster = LocalCluster(
        n_workers=n_workers,
        threads_per_worker=threads_per_worker,
    )
    client = Client(cluster)

    # Mimic an HPO objective function (fit/predict with xgboost)
    def objective(random_state):
        X_param, y_param = make_classification(
            n_samples=1000,
            n_features=20,
            chunks=100,
            n_informative=4,
            random_state=random_state,
        )

        X_train, X_valid, y_train, y_valid = train_test_split(
            X_param, y_param, random_state=random_state
        )

        classifier = DaskXGBClassifier(
            **{
                'objective': 'binary:logistic',
                'max_depth': 4,
                'eta': 0.01,
                'subsample': 0.5,
                'min_child_weight': 0.5,
            }
        )

        classifier.fit(X_train, y_train)
        y_pred = classifier.predict(X_valid)
        return

    # Submit and compute many delayed jobs at once
    jobs = []
    delayed_objective = delayed(objective)
    for rs in range(0, n_trials):
        jobs.append(delayed_objective(rs))
    for i, future in enumerate(as_completed(client.compute(jobs))):
        print(f"Job {i} done.")
    print("All jobs done!")
```

**Further background**

There is [a great blog article from Coiled](https://docs.coiled.io/user_guide/usage/dask/xgboost-hpo.html) that demonstrates Optuna-based HPO with XGBoost and dask. The article states: "the current `xgboost.dask` implementation takes over the entire Dask cluster, so running many of these at once is problematic." Note that the "problematic" practice described in that article is exactly what the reproducer above is doing. With that said, it is not clear to me **why** one or more workers might hang.

**NOTE**: I realize this is probably more of an `xgboost` issue than a `distributed` issue. However, it seems clear that significant `dask/distributed` knowledge is needed to pin down the actual problem. Any and all help, advice, or intuition is greatly appreciated!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[QST][Bug?] Can I fit/evaluate many XGBoost models on the same cluster? #8623

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[QST][Bug?] Can I fit/evaluate many XGBoost models on the same cluster? #8623

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions