Description
Description of possible bug
When I try to fit/evaluate many xgboost.dask
models in parallel, one or more of the fit/evaluation futures hangs forever. The hang only occurs when threads_per_worker=1
(which is the default for GPU-enabled clusters in dask-cuda).
Is this a bug, or is the reproducer shown below known to be wrong or dangerous for a specific reason?
Reproducer
from dask.distributed import LocalCluster, Client, as_completed
from dask.delayed import delayed
from dask_ml.datasets import make_classification
from dask_ml.model_selection import train_test_split
from xgboost.dask import DaskXGBClassifier
n_workers = 8
threads_per_worker = 1 ### Using >1 avoids the hang ###
n_trials = n_workers
if __name__ == "__main__":
# Start up the dask cluster
cluster = LocalCluster(
n_workers=n_workers,
threads_per_worker=threads_per_worker,
)
client = Client(cluster)
# Mimic an HPO objective function (fit/predict with xgboost)
def objective(random_state):
X_param, y_param = make_classification(
n_samples=1000,
n_features=20,
chunks=100,
n_informative=4,
random_state=random_state,
)
X_train, X_valid, y_train, y_valid = train_test_split(
X_param, y_param, random_state=random_state
)
classifier = DaskXGBClassifier(
**{
'objective': 'binary:logistic',
'max_depth': 4,
'eta': 0.01,
'subsample': 0.5,
'min_child_weight': 0.5,
}
)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_valid)
return
# Submit and compute many delayed jobs at once
jobs = []
delayed_objective = delayed(objective)
for rs in range(0, n_trials):
jobs.append(delayed_objective(rs))
for i, future in enumerate(as_completed(client.compute(jobs))):
print(f"Job {i} done.")
print("All jobs done!")
Further background
There is a great blog article from Coiled that demonstrates Optuna-based HPO with XGBoost and dask. The article states: "the current xgboost.dask
implementation takes over the entire Dask cluster, so running many of these at once is problematic." Note that the "problematic" practice described in that article is exactly what the reproducer above is doing. With that said, it is not clear to me why one or more workers might hang.
NOTE: I realize this is probably more of an xgboost
issue than a distributed
issue. However, it seems clear that significant dask/distributed
knowledge is needed to pin down the actual problem. Any and all help, advice, or intuition is greatly appreciated!