-
Notifications
You must be signed in to change notification settings - Fork 35
Documentation: train + RayDMatrix vs XGBoostTrainer + Dataset #309
Description
Could you please provide some clarification on the differences and/or how to choose between using xgboost_ray.train + xgboost_ray.RayDMatrix or ray.train.xgboost.XGBoostTrainer + ray.data.Dataset?
My use case is running Ray Tune on Azure Databricks, which operates on Spark. According to the Databricks docs, one creates a Ray Cluster using the Ray on Spark API, and creates a Ray Dataset from Parquet files.
Below are the questions I would like clarification on. Any help you could provide would be greatly appreciated.
Data
According to the README.md one can create a RayDMatrix from either Parquet files or a Ray Dataset:
Lines 450 to 465 in e904925
| ### Data sources | |
| The following data sources can be used with a `RayDMatrix` object. | |
| | Type | Centralized loading | Distributed loading | | |
| |------------------------------------------------------------------|---------------------|---------------------| | |
| | Numpy array | Yes | No | | |
| | Pandas dataframe | Yes | No | | |
| | Single CSV | Yes | No | | |
| | Multi CSV | Yes | Yes | | |
| | Single Parquet | Yes | No | | |
| | Multi Parquet | Yes | Yes | | |
| | [Ray Dataset](https://docs.ray.io/en/latest/data/dataset.html) | Yes | Yes | | |
| | [Petastorm](https://github.com/uber/petastorm) | Yes | Yes | | |
| | [Dask dataframe](https://docs.dask.org/en/latest/dataframe.html) | Yes | Yes | | |
| | [Modin dataframe](https://modin.readthedocs.io/en/latest/) | Yes | Yes | |
So if using
xgboost_ray, should I
- Create a Ray
Datasetfrom Parquet files, then create aRayDMatrixfrom thatDataset
or - Create the
RayDMatrixdirectly from Parquet files
Training
Should I use Ray Tune with XGBoostTrainer or with xgboost_ray.train, running on this Ray on Spark Cluster?
I also intend to implement CV with early stopping. Since tune-sklearn is now deprecated, I understand that I'll need to implement this myself. As explained in ray-project/ray#21848 (comment), this can be done with ray.tune.stopper.TrialPlateauStopper. But according to #301 we can also use XGBoost's native xgb.callback.EarlyStopping. Which approach would you recommend? Can TrialPlateauStopper be used with xgboost_ray?
Thank you very much for any help you can offer.