Skip to content

[BUG] Does spark-rapids-ml support "unbalanced" configurations? #906

@an-ys

Description

@an-ys

In the latest reply in #853, I mentioned that I faced a weird issue regarding some overhead for some applications.

It seems like for KMeans and PCA, Spark is assigning the tasks to the wrong executors in standalone mode. This happens when I use five GPUs (i.e., 4 GPUs on one node, node A, and 1 GPU on another node, node B), but it doesn't seem to happen when I use four GPUs. This seems to happen all the time for smaller sizes, but for other sizes, there are some cases when it is faster than running with GPUs when running the benchmark multiple times. Do note that my parquet is divided into ~30MiB files, but I don't think it should matter much.

When running KMeans on a cluster of two nodes, with one node having 4 GPUs and the other 1 GPU, it tries to run most of the tasks in only one executor (the lone GPU in the other node). It runs some tasks in the node with four GPUs, but it only uses one GPU there. As a result, instead of having node-level/process-level tasks in other stages, the application has tasks with the locality level of "any", and it ends up running the tasks somewhat sequentially since the maximum number of concurrent tasks for one GPU is 2.

When running PCA on 6 GPUs (either with 2 unbalanced nodes or 3 nodes of 2 gpus), it runs all the tasks in only one node (the one with 2 GPUs) for some stages instead of all 6 executors. This can be seen below. If you check the second image, you would see that most of the work is done by two executors in a single node (IP address ends with 101).

Image

Image

As mentioned in the second paragraph, there are some cases where it assigns the tasks correctly, but it does not most of the time. This picture below shows what happens when Spark assigns the tasks correctly or incorrectly. The slowest configuration is about 2 times slower than the fastest configuration since Spark has to wait for the first two tasks to finish before assigning the last two tasks as the max # of concurrent tasks is set to 2.

Image

(Also, the directory is called spark-rapids-ml-24.08-benchmark, but I have been updating the files to match the ones in the repo. I used Spark RAPIDS 25.02 and Spark RAPIDS ML 25.02.)

This does not seem to happen in other workloads (i.e., linear regression, logistic regression, and random forest classifier).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions