-
Notifications
You must be signed in to change notification settings - Fork 31
Description
I am trying to run the Linear Regression, KMeans, and PCA examples on a cluster of 2 nodes, each with 4 GPUs, but some of the executors in the examples always get stuck in the barrier when the cuML function is called (i.e., I get 6+2/8, 4+4/8, and 5+3/8, where 2, 4, and 3 executors are stuck in LinReg, KMeans, and PCA respectively). I also tried runing a KMeans application that deals with a large amount of data, so I do not think the problem is related to the small dataset.
I checked the logs for the executor that successfully ran the task and the executor that got stuck. The executor that got stuck initialized cuML These logs are from running the LinReg example in the Python directory of this repo. The executors that are stuck have RUNNING | NODE_LOCAL as the status while the successful executors have SUCCESS PROCESS_LOCAL.
I am using Spark RAPIDS ML branch-23.10 (daedfe56edae33c565af5e06179e992cf8fec93e and f651978), Spark 3.5.0 on standalone mode, and Hadoop 3.3.6 on a cluster of 2 nodes, each with 4 Titan-V GPUs.
Successful Executor
23/09/27 19:42:59 INFO TorrentBroadcast: Started reading broadcast variable 3 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:42:59 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 9.3 KiB, free 47.8 GiB)
23/09/27 19:42:59 INFO TorrentBroadcast: Reading broadcast variable 3 took 13 ms
23/09/27 19:42:59 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 19.5 KiB, free 47.8 GiB)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 192, boot = -749, init = 941, finish = 0
23/09/27 19:43:00 INFO PythonRunner: Times: total = 203, boot = -723, init = 926, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 4.0 in stage 4.0 (TID 389). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Finished task 36.0 in stage 4.0 (TID 421). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 440
23/09/27 19:43:00 INFO Executor: Running task 55.0 in stage 4.0 (TID 440)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 220, boot = -692, init = 912, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 443
23/09/27 19:43:00 INFO Executor: Running task 58.0 in stage 4.0 (TID 443)
23/09/27 19:43:00 INFO Executor: Finished task 44.0 in stage 4.0 (TID 429). 2004 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 446
23/09/27 19:43:00 INFO Executor: Running task 61.0 in stage 4.0 (TID 446)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 238, boot = -679, init = 917, finish = 0
23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = -767, init = 1006, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 12.0 in stage 4.0 (TID 397). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Finished task 20.0 in stage 4.0 (TID 405). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 453
23/09/27 19:43:00 INFO Executor: Running task 68.0 in stage 4.0 (TID 453)
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 454
23/09/27 19:43:00 INFO Executor: Running task 69.0 in stage 4.0 (TID 454)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 280, boot = -698, init = 978, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 28.0 in stage 4.0 (TID 413). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 466
23/09/27 19:43:00 INFO Executor: Running task 81.0 in stage 4.0 (TID 466)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 159, boot = -7, init = 166, finish = 0
23/09/27 19:43:00 INFO PythonRunner: Times: total = 164, boot = -14, init = 178, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 55.0 in stage 4.0 (TID 440). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Finished task 58.0 in stage 4.0 (TID 443). 2004 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 473
23/09/27 19:43:00 INFO Executor: Running task 88.0 in stage 4.0 (TID 473)
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 474
23/09/27 19:43:00 INFO Executor: Running task 89.0 in stage 4.0 (TID 474)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 173, boot = -3, init = 176, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 68.0 in stage 4.0 (TID 453). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 479
23/09/27 19:43:00 INFO Executor: Running task 94.0 in stage 4.0 (TID 479)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 244, boot = -4, init = 248, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 61.0 in stage 4.0 (TID 446). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO PythonRunner: Times: total = 194, boot = 8, init = 186, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 489
23/09/27 19:43:00 INFO Executor: Finished task 81.0 in stage 4.0 (TID 466). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Running task 104.0 in stage 4.0 (TID 489)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 249, boot = -5, init = 254, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 494
23/09/27 19:43:00 INFO Executor: Running task 109.0 in stage 4.0 (TID 494)
23/09/27 19:43:00 INFO Executor: Finished task 69.0 in stage 4.0 (TID 454). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 499
23/09/27 19:43:00 INFO Executor: Running task 114.0 in stage 4.0 (TID 499)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 215, boot = 1, init = 214, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 89.0 in stage 4.0 (TID 474). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 507
23/09/27 19:43:00 INFO Executor: Running task 122.0 in stage 4.0 (TID 507)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 272, boot = 15, init = 256, finish = 1
23/09/27 19:43:00 INFO Executor: Finished task 88.0 in stage 4.0 (TID 473). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = 6, init = 233, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 515
23/09/27 19:43:00 INFO Executor: Running task 130.0 in stage 4.0 (TID 515)
23/09/27 19:43:00 INFO Executor: Finished task 94.0 in stage 4.0 (TID 479). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 519
23/09/27 19:43:00 INFO Executor: Running task 134.0 in stage 4.0 (TID 519)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 240, boot = -7, init = 247, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 114.0 in stage 4.0 (TID 499). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO PythonRunner: Times: total = 274, boot = 0, init = 274, finish = 0
23/09/27 19:43:00 INFO PythonRunner: Times: total = 259, boot = -7, init = 266, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 536
23/09/27 19:43:00 INFO Executor: Running task 151.0 in stage 4.0 (TID 536)
23/09/27 19:43:00 INFO Executor: Finished task 104.0 in stage 4.0 (TID 489). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Finished task 109.0 in stage 4.0 (TID 494). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 537
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 538
23/09/27 19:43:00 INFO Executor: Running task 152.0 in stage 4.0 (TID 537)
23/09/27 19:43:00 INFO Executor: Running task 153.0 in stage 4.0 (TID 538)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 269, boot = 9, init = 260, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 122.0 in stage 4.0 (TID 507). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 547
23/09/27 19:43:00 INFO Executor: Running task 162.0 in stage 4.0 (TID 547)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 246, boot = -10, init = 256, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 134.0 in stage 4.0 (TID 519). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 560
23/09/27 19:43:00 INFO Executor: Running task 175.0 in stage 4.0 (TID 560)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 297, boot = 6, init = 290, finish = 1
23/09/27 19:43:00 INFO Executor: Finished task 130.0 in stage 4.0 (TID 515). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 568
23/09/27 19:43:00 INFO Executor: Running task 183.0 in stage 4.0 (TID 568)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 241, boot = 3, init = 238, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 151.0 in stage 4.0 (TID 536). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 570
23/09/27 19:43:00 INFO Executor: Running task 185.0 in stage 4.0 (TID 570)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = 7, init = 232, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 152.0 in stage 4.0 (TID 537). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 571
23/09/27 19:43:00 INFO Executor: Running task 186.0 in stage 4.0 (TID 571)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 258, boot = 14, init = 244, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 153.0 in stage 4.0 (TID 538). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 574
23/09/27 19:43:01 INFO Executor: Running task 189.0 in stage 4.0 (TID 574)
23/09/27 19:43:01 INFO PythonRunner: Times: total = 215, boot = 15, init = 200, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 162.0 in stage 4.0 (TID 547). 2004 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 162, boot = -6, init = 168, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 185.0 in stage 4.0 (TID 570). 2047 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 230, boot = -5, init = 235, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 175.0 in stage 4.0 (TID 560). 2047 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 154, boot = 0, init = 154, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 189.0 in stage 4.0 (TID 574). 2004 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 244, boot = 15, init = 229, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 183.0 in stage 4.0 (TID 568). 2047 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 219, boot = 7, init = 212, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 186.0 in stage 4.0 (TID 571). 2047 bytes result sent to driver
23/09/27 19:43:01 INFO UCX: UCX context created
23/09/27 19:43:01 INFO UCX: UCX Worker created
23/09/27 19:43:02 INFO UCX: Started UcpListener on /<master_ip>:57306
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Registering executor BlockManagerId(1, <master_ip>, 32805, Some(rapids=57306)) with driver
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(0, <master_ip>, 41505, Some(rapids=62205))
23/09/27 19:43:02 INFO UCX: Creating connection for executorId 0
23/09/27 19:43:02 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@331ae0ff, peerExecutorId=0) started
23/09/27 19:43:02 INFO UCX: Got UcpListener request from /<master_ip>:46124
23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=139640732815552, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46124) for /<master_ip>:46124
23/09/27 19:43:02 INFO UCX: Got UcpListener request from /<master_ip>:46148
23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=139640732815616, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46148) for /<master_ip>:46148
23/09/27 19:43:02 INFO UCX: Got UcpListener request from /<master_ip>:46136
23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=139640732815680, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46136) for /<master_ip>:46136
23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281]
23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281]
23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281]
23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 3: UcpEndpoint(id=139640732815616, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46148)
23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 2: UcpEndpoint(id=139640732815680, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46136)
23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 0: UcpEndpoint(id=139640732815552, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46124)
23/09/27 19:43:02 INFO CoarseGrainedExecutorBackend: Got assigned task 577
23/09/27 19:43:02 INFO Executor: Running task 0.0 in stage 6.0 (TID 577)
23/09/27 19:43:02 INFO MapOutputTrackerWorker: Updating epoch to 4 and clearing cache
23/09/27 19:43:02 INFO TorrentBroadcast: Started reading broadcast variable 4 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:43:02 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 13.2 KiB, free 47.8 GiB)
23/09/27 19:43:02 INFO TorrentBroadcast: Reading broadcast variable 4 took 7 ms
23/09/27 19:43:02 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 29.0 KiB, free 47.8 GiB)
23/09/27 19:43:04 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 2, fetching them
23/09/27 19:43:04 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:35063)
23/09/27 19:43:04 INFO MapOutputTrackerWorker: Got the map output locations
23/09/27 19:43:04 INFO ShuffleBlockFetcherIterator: Getting 2 (194.0 B) non-empty blocks including 0 (0.0 B) local and 1 (97.0 B) host-local and 0 (0.0 B) push-merged-local and 1 (97.0 B) remote blocks
23/09/27 19:43:04 INFO TransportClientFactory: Successfully created connection to /<worker_ip_from_non_master_node>:38739 after 2 ms (0 ms spent in bootstraps)
23/09/27 19:43:04 INFO ShuffleBlockFetcherIterator: Started 1 remote fetches in 21 ms
23/09/27 19:43:04 INFO TransportClientFactory: Successfully created connection to /<master_ip>:36401 after 1 ms (0 ms spent in bootstraps)
23/09/27 19:43:04 INFO CodeGenerator: Code generated in 48.927407 ms
23/09/27 19:43:04 INFO CodeGenerator: Code generated in 19.687474 ms
23/09/27 19:43:04 INFO Executor: Finished task 0.0 in stage 6.0 (TID 577). 4021 bytes result sent to driver
23/09/27 19:43:05 INFO CoarseGrainedExecutorBackend: Got assigned task 584
23/09/27 19:43:05 INFO Executor: Running task 6.0 in stage 9.0 (TID 584)
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Updating epoch to 5 and clearing cache
23/09/27 19:43:05 INFO TorrentBroadcast: Started reading broadcast variable 5 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 28.7 KiB, free 47.8 GiB)
23/09/27 19:43:05 INFO TorrentBroadcast: Reading broadcast variable 5 took 17 ms
23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 59.1 KiB, free 47.8 GiB)
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 3, fetching them
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:35063)
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Got the map output locations
23/09/27 19:43:05 INFO ShuffleBlockFetcherIterator: Getting 0 (0.0 B) non-empty blocks including 0 (0.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
23/09/27 19:43:05 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
23/09/27 19:43:05 INFO CodeGenerator: Code generated in 14.846185 ms
23/09/27 19:43:05 INFO PythonRunner: Times: total = 139, boot = -4325, init = 4464, finish = 0
23/09/27 19:43:05 INFO Executor: Finished task 6.0 in stage 9.0 (TID 584). 6740 bytes result sent to driver
23/09/27 19:43:07 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(3, <master_ip>, 36401, Some(rapids=29068))
23/09/27 19:43:07 INFO UCX: Creating connection for executorId 3
23/09/27 19:43:07 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@331ae0ff, peerExecutorId=3) started
23/09/27 19:43:07 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(2, <master_ip>, 36445, Some(rapids=50893))
23/09/27 19:43:07 INFO UCX: Creating connection for executorId 2
23/09/27 19:43:07 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@331ae0ff, peerExecutorId=2) started
23/09/27 19:46:00 INFO RapidsShuffleInternalManager: Unregistering shuffle 1 from shuffle buffer catalog
23/09/27 19:46:00 WARN ShuffleBufferCatalog: Ignoring unregister of unknown shuffle 1
23/09/27 19:46:32 ERROR UCX: UcpListener detected an error for executorId 2: UCXError(-25,Connection reset by remote peer)
23/09/27 19:46:32 WARN UCX: Removing endpoint UcpEndpoint(id=139640732815680, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46136) for 2
23/09/27 19:46:32 WARN UCX: Removed stale client connection for 2
23/09/27 19:46:32 ERROR UCX: Error while closing ep. Ignoring.
org.openucx.jucx.UcxException: Connection reset by remote peer
at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingNative(Native Method)
at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingFlush(UcpEndpoint.java:441)
at com.nvidia.spark.rapids.shuffle.ucx.UCX$UcpEndpointManager.$anonfun$closeEndpointOnWorkerThread$1(UCX.scala:904)
at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5(UCX.scala:188)
at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5$adapted(UCX.scala:182)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$2(UCX.scala:182)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at com.nvidia.spark.rapids.GpuDeviceManager$$anon$1.$anonfun$newThread$1(GpuDeviceManager.scala:490)
at java.base/java.lang.Thread.run(Thread.java:833)
23/09/27 19:46:32 ERROR UCX: UcpListener detected an error for executorId 3: UCXError(-25,Connection reset by remote peer)
23/09/27 19:46:32 WARN UCX: Removing endpoint UcpEndpoint(id=139640732815616, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46148) for 3
23/09/27 19:46:32 WARN UCX: Removed stale client connection for 3
23/09/27 19:46:32 ERROR UCX: Error while closing ep. Ignoring.
org.openucx.jucx.UcxException: Connection reset by remote peer
at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingNative(Native Method)
at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingFlush(UcpEndpoint.java:441)
at com.nvidia.spark.rapids.shuffle.ucx.UCX$UcpEndpointManager.$anonfun$closeEndpointOnWorkerThread$1(UCX.scala:904)
at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5(UCX.scala:188)
at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5$adapted(UCX.scala:182)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$2(UCX.scala:182)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at com.nvidia.spark.rapids.GpuDeviceManager$$anon$1.$anonfun$newThread$1(GpuDeviceManager.scala:490)
at java.base/java.lang.Thread.run(Thread.java:833)
23/09/27 19:46:32 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
23/09/27 19:46:32 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
tdown
Killed Executor
23/09/27 19:42:59 INFO MapOutputTrackerWorker: Updating epoch to 2 and clearing cache
23/09/27 19:43:00 INFO TorrentBroadcast: Started reading broadcast variable 3 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:43:00 INFO TransportClientFactory: Successfully created connection to /<master_ip>:32805 after 2 ms (0 ms spent in bootstraps)
23/09/27 19:43:00 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 9.3 KiB, free 47.8 GiB)
23/09/27 19:43:00 INFO TorrentBroadcast: Reading broadcast variable 3 took 163 ms
23/09/27 19:43:00 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 19.5 KiB, free 47.8 GiB)
23/09/27 19:43:01 INFO UCX: UCX context created
23/09/27 19:43:01 INFO UCX: UCX Worker created
23/09/27 19:43:02 INFO UCX: Started UcpListener on /<master_ip>:50893
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Registering executor BlockManagerId(2, <master_ip>, 36445, Some(rapids=50893)) with driver
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(1, <master_ip>, 32805, Some(rapids=57306))
23/09/27 19:43:02 INFO UCX: Creating connection for executorId 1
23/09/27 19:43:02 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@6d246c2, peerExecutorId=1) started
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(0, <master_ip>, 41505, Some(rapids=62205))
23/09/27 19:43:02 INFO UCX: Creating connection for executorId 0
23/09/27 19:43:02 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@6d246c2, peerExecutorId=0) started
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(3, <master_ip>, 36401, Some(rapids=29068))
23/09/27 19:43:02 INFO UCX: Creating connection for executorId 3
23/09/27 19:43:02 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@6d246c2, peerExecutorId=3) started
23/09/27 19:43:02 INFO UCX: Got UcpListener request from /<master_ip>:57878
23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=140137543848256, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:57878) for /<master_ip>:57878
23/09/27 19:43:02 INFO CodeGenerator: Code generated in 370.253869 ms
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1141, boot = 834, init = 307, finish = 0
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1057, boot = 845, init = 212, finish = 0
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1121, boot = 840, init = 281, finish = 0
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1152, boot = 852, init = 300, finish = 0
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1129, boot = 824, init = 305, finish = 0
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1141, boot = 829, init = 312, finish = 0
23/09/27 19:43:02 INFO Executor: Finished task 29.0 in stage 4.0 (TID 414). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO Executor: Finished task 13.0 in stage 4.0 (TID 398). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO Executor: Finished task 5.0 in stage 4.0 (TID 390). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO Executor: Finished task 37.0 in stage 4.0 (TID 422). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO Executor: Finished task 45.0 in stage 4.0 (TID 430). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO Executor: Finished task 21.0 in stage 4.0 (TID 406). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281]
23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 3: UcpEndpoint(id=140137543848256, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:57878)
23/09/27 19:43:05 INFO CoarseGrainedExecutorBackend: Got assigned task 578
23/09/27 19:43:05 INFO Executor: Running task 2.0 in stage 9.0 (TID 578)
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Updating epoch to 5 and clearing cache
23/09/27 19:43:05 INFO TorrentBroadcast: Started reading broadcast variable 5 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:43:05 INFO TransportClientFactory: Successfully created connection to master/<master_ip>:33961 after 2 ms (0 ms spent in bootstraps)
23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 28.7 KiB, free 47.8 GiB)
23/09/27 19:43:05 INFO TorrentBroadcast: Reading broadcast variable 5 took 20 ms
23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 59.1 KiB, free 47.8 GiB)
23/09/27 19:43:07 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 3, fetching them
23/09/27 19:43:07 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:35063)
23/09/27 19:43:07 INFO MapOutputTrackerWorker: Got the map output locations
23/09/27 19:43:07 INFO ShuffleBlockFetcherIterator: Getting 1 (72.0 B) non-empty blocks including 0 (0.0 B) local and 1 (72.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
23/09/27 19:43:07 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 15 ms
23/09/27 19:43:07 INFO CodeGenerator: Code generated in 19.422264 ms
2023-09-27 19:43:08,459 - spark_rapids_ml.regression.LinearRegression - INFO - Initializing cuml context
23/09/27 19:43:12 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) has entered the global sync, current barrier epoch is 0.
23/09/27 19:44:12 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1695811392156, has been waiting for 60 seconds, current barrier epoch is 0.
23/09/27 19:45:12 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1695811392156, has been waiting for 120 seconds, current barrier epoch is 0.
23/09/27 19:46:00 INFO RapidsShuffleInternalManager: Unregistering shuffle 1 from shuffle buffer catalog
23/09/27 19:46:00 WARN ShuffleBufferCatalog: Ignoring unregister of unknown shuffle 1
23/09/27 19:46:12 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1695811392156, has been waiting for 180 seconds, current barrier epoch is 0.
23/09/27 19:46:31 INFO Executor: Executor is trying to kill task 2.0 in stage 9.0 (TID 578), reason: Stage cancelled: Job 5 cancelled as part of cancellation of all jobs
23/09/27 19:46:32 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[accept-connections,5,main]
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:267)
at org.apache.spark.BarrierTaskContext.$anonfun$runBarrier$3(BarrierTaskContext.scala:94)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.BarrierTaskContext.runBarrier(BarrierTaskContext.scala:94)
at org.apache.spark.BarrierTaskContext.allGather(BarrierTaskContext.scala:179)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.barrierAndServe(PythonRunner.scala:490)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anon$1.run(PythonRunner.scala:321)
23/09/27 19:46:32 INFO Executor: Executor killed task 2.0 in stage 9.0 (TID 578), reason: Stage cancelled: Job 5 cancelled as part of cancellation of all jobs
23/09/27 19:46:32 INFO RapidsBufferCatalog: Closing storage
23/09/27 19:46:32 INFO UCXShuffleTransport: UCX transport closing
23/09/27 19:46:32 WARN UCX: UCX is shutting down
23/09/27 19:46:32 INFO UCX: De-registering UCX 3 memory buffers.
23/09/27 19:46:32 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
23/09/27 19:46:32 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
tdown
Here is the spark.conf containing the related options. I tried to disable the options related to UDFs (Scala UDF, UDF compiler, etc.), but it did not do much.
`spark.conf`
spark.master spark://master:7077
# Resource-related configs
spark.executor.instances 8
spark.executor.cores 6
spark.executor.memory 80G
spark.driver.memory 80G
spark.executor.memoryOverhead 1G
# Task-related
spark.default.parallelism 192
spark.sql.shuffle.partitions 192
spark.driver.maxResultSize 30G
spark.sql.files.maxPartitionBytes 4096m
# spark.sql.files.maxPartitionBytes 8192m
spark.sql.execution.sortBeforeRepartition false
spark.sql.adaptive.enabled true
# GPU-related Configs
spark.executor.resource.gpu.amount 1
spark.executor.resource.gpu.discoveryScript /usr/lib/spark/scripts/gpu/getGpusResources.sh
spark.executor.resources.discoveryPlugin com.nvidia.spark.ExclusiveModeGpuDiscoveryPlugin
spark.plugins com.nvidia.spark.SQLPlugin
spark.rapids.memory.gpu.debug STDOUT
spark.rapids.memory.gpu.pool NONE
spark.rapids.memory.pinnedPool.size 20G
spark.rapids.shuffle.multiThreaded.reader.threads 24
spark.rapids.shuffle.multiThreaded.writer.threads 24
spark.rapids.sql.concurrentGpuTasks 2
spark.rapids.sql.enabled true
spark.rapids.sql.exec.CollectLimitExec true
spark.rapids.sql.explain all
spark.rapids.sql.expression.ScalaUDF true
spark.rapids.sql.metrics.level DEBUG
spark.rapids.sql.rowBasedUDF.enabled true
spark.rapids.sql.udfCompiler.enabled true
spark.shuffle.manager com.nvidia.spark.rapids.spark350.RapidsShuffleManager
spark.task.resource.gpu.amount 0.166
spark.sql.cache.serializer com.nvidia.spark.ParquetCachedBatchSerializer
spark.rapids.shuffle.mode UCX
spark.shuffle.service.enabled false
spark.dynamicAllocation.enabled false
spark.executorEnv.UCX_ERROR_SIGNALS
spark.executorEnv.UCX_MEMTYPE_CACHE n
spark.executorEnv.UCX_IB_RX_QUEUE_LEN 1024
spark.executorEnv.UCX_TLS cuda_copy,cuda_ipc,rc,tcp
spark.executorEnv.UCX_RNDV_SCHEME put_zcopy
spark.executorEnv.UCX_MAX_RNDV_RAILS 1
spark.executorEnv.UCX_IB_GPU_DIRECT_RDMA n