This directory contains shell scripts for running larger scale benchmarks on Databricks AWS hosted Spark service using the Databricks CLI. You will need a Databricks AWS account to run them. The benchmarks use datasets synthetically generated using gen_data.py. For convenience, these have been precomputed and currently stored in the public S3 bucket spark-rapids-ml-bm-datasets-public. The benchmark scripts are currently configured to read the data from there.
-
Install latest databricks-cli on your local workstation. Note that Databricks has deprecated the legacy python based cli in favor of a self contained executable. Make sure the new version is first on the executables PATH.
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh -
Generate an access token for your Databricks workspace in the
User Settingssection of the workspace UI. -
Configure the access token for the Databricks CLI. If you have multiple workspaces, you should use a distinct
profilename for this one, e.g.aws, or else it will overwrite your currentDEFAULTprofile. This profile needs to be supplied on all invocations of thedatabrickscli via the--profileoption. For safety, the instructions below assume you are using a newawsprofile.export DB_PROFILE=aws databricks configure --token --profile $DB_PROFILE # Host: <copy-and-paste databricks workspace url> # Token: <copy-and-paste access token from UI>
-
Next, in this directory, run the following to upload the files required to run the benchmarks:
# change below to desired dbfs location WITHOUT DBFS URI for uploading benchmarking related files export BENCHMARK_HOME=/path/to/benchmark/files/in/dbfs # need separate directory for cluster init script as databricks requires these to be stored in the workspace and not dbfs # ex: /Users/<databricks-user-name>/benchmark export WS_BENCHMARK_HOME=/path/to/benchmark/files/in/workspace ./setup.sh
This will create and copy the files into a DBFS directory at the path specified by
BENCHMARK_HOMEand a cluster init script to the workspace directory specified byWS_BENCHMARK_HOME. The script will not overwrite existing files and instead simply print the error message returned from databricks. If overwrite is desired, first deleted the files and/or directories usingdatabricks fs rm [-r] <dbfs path>for the dbfs files anddatabricks workspace delete [--recursive] <workspace path>for the workspace files. Note: ExportBENCHMARK_HOME,WS_BENCHMARK_HOMEandDB_PROFILEin any new/different shell in which subsequent steps may be run.
-
The running time of each individual benchmark run can be limited by the
TIME_LIMITenvironment variable. The cpu kmeans benchmark takes over 9000 seconds (ie., > 2 hours) to complete. If not set, the default is3600seconds.export TIME_LIMIT=3600 -
The benchmarks can be run as
./run_benchmark.sh [cpu|gpu|gpu_etl] [[12.2|13.3|14.3]] >> benchmark_log
The script creates a cpu or gpu cluster, respectively using the cluster specs in cpu_cluster_spec, gpu_cluster_spec, gpu_etl_cluster_spec, depending on the supplied argument. In gpu and gpu_etl mode each algorithm benchmark is run 3 times, and similarly in cpu mode, except for kmeans and random forest classifier and regressor which are each run 1 time due to their long running times. gpu_etl mode also uses the spark-rapids gpu accelerated plugin.
An optional databricks runtime version can be supplied as a second argument, with 13.3 being the default if not specified. Runtimes higher than 13.3 are only compatible with cpu and gpu modes (i.e. not gpu_etl) as they are not yet supported by the spark-rapids plugin.
-
The file
benchmark_logwill have the fit/train/transform running times and accuracy scores. A simple convenience script has been provided to extract timing information for each run:./process_bm_log.sh benchmark_log
-
Cancelling a run: Hit
Ctrl-Cand then cancel the run with the last printedrunid(check usingtail benchmark_log) by executing:
databricks jobs cancel-run <runid> --profile $DB_PROFILE-
The created clusters are configured to terminate after 30 min, but can be manually terminated or deleted via the Databricks UI.
-
Monitor progress periodically in case of a possible hang, to avoid incurring cloud costs in such cases.