Skip to content

[QST] Dealing with executors that are stuck in the barrier #453

@an-ys

Description

@an-ys

I am trying to run the Linear Regression, KMeans, and PCA examples on a cluster of 2 nodes, each with 4 GPUs, but some of the executors in the examples always get stuck in the barrier when the cuML function is called (i.e., I get 6+2/8, 4+4/8, and 5+3/8, where 2, 4, and 3 executors are stuck in LinReg, KMeans, and PCA respectively). I also tried runing a KMeans application that deals with a large amount of data, so I do not think the problem is related to the small dataset.

I checked the logs for the executor that successfully ran the task and the executor that got stuck. The executor that got stuck initialized cuML These logs are from running the LinReg example in the Python directory of this repo. The executors that are stuck have RUNNING | NODE_LOCAL as the status while the successful executors have SUCCESS PROCESS_LOCAL.

I am using Spark RAPIDS ML branch-23.10 (daedfe56edae33c565af5e06179e992cf8fec93e and f651978), Spark 3.5.0 on standalone mode, and Hadoop 3.3.6 on a cluster of 2 nodes, each with 4 Titan-V GPUs.

Successful Executor
23/09/27 19:42:59 INFO TorrentBroadcast: Started reading broadcast variable 3 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:42:59 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 9.3 KiB, free 47.8 GiB)
23/09/27 19:42:59 INFO TorrentBroadcast: Reading broadcast variable 3 took 13 ms
23/09/27 19:42:59 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 19.5 KiB, free 47.8 GiB)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 192, boot = -749, init = 941, finish = 0
23/09/27 19:43:00 INFO PythonRunner: Times: total = 203, boot = -723, init = 926, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 4.0 in stage 4.0 (TID 389). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Finished task 36.0 in stage 4.0 (TID 421). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 440
23/09/27 19:43:00 INFO Executor: Running task 55.0 in stage 4.0 (TID 440)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 220, boot = -692, init = 912, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 443
23/09/27 19:43:00 INFO Executor: Running task 58.0 in stage 4.0 (TID 443)
23/09/27 19:43:00 INFO Executor: Finished task 44.0 in stage 4.0 (TID 429). 2004 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 446
23/09/27 19:43:00 INFO Executor: Running task 61.0 in stage 4.0 (TID 446)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 238, boot = -679, init = 917, finish = 0
23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = -767, init = 1006, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 12.0 in stage 4.0 (TID 397). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Finished task 20.0 in stage 4.0 (TID 405). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 453
23/09/27 19:43:00 INFO Executor: Running task 68.0 in stage 4.0 (TID 453)
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 454
23/09/27 19:43:00 INFO Executor: Running task 69.0 in stage 4.0 (TID 454)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 280, boot = -698, init = 978, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 28.0 in stage 4.0 (TID 413). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 466
23/09/27 19:43:00 INFO Executor: Running task 81.0 in stage 4.0 (TID 466)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 159, boot = -7, init = 166, finish = 0
23/09/27 19:43:00 INFO PythonRunner: Times: total = 164, boot = -14, init = 178, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 55.0 in stage 4.0 (TID 440). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Finished task 58.0 in stage 4.0 (TID 443). 2004 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 473
23/09/27 19:43:00 INFO Executor: Running task 88.0 in stage 4.0 (TID 473)
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 474
23/09/27 19:43:00 INFO Executor: Running task 89.0 in stage 4.0 (TID 474)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 173, boot = -3, init = 176, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 68.0 in stage 4.0 (TID 453). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 479
23/09/27 19:43:00 INFO Executor: Running task 94.0 in stage 4.0 (TID 479)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 244, boot = -4, init = 248, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 61.0 in stage 4.0 (TID 446). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO PythonRunner: Times: total = 194, boot = 8, init = 186, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 489
23/09/27 19:43:00 INFO Executor: Finished task 81.0 in stage 4.0 (TID 466). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Running task 104.0 in stage 4.0 (TID 489)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 249, boot = -5, init = 254, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 494
23/09/27 19:43:00 INFO Executor: Running task 109.0 in stage 4.0 (TID 494)
23/09/27 19:43:00 INFO Executor: Finished task 69.0 in stage 4.0 (TID 454). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 499
23/09/27 19:43:00 INFO Executor: Running task 114.0 in stage 4.0 (TID 499)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 215, boot = 1, init = 214, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 89.0 in stage 4.0 (TID 474). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 507
23/09/27 19:43:00 INFO Executor: Running task 122.0 in stage 4.0 (TID 507)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 272, boot = 15, init = 256, finish = 1
23/09/27 19:43:00 INFO Executor: Finished task 88.0 in stage 4.0 (TID 473). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = 6, init = 233, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 515
23/09/27 19:43:00 INFO Executor: Running task 130.0 in stage 4.0 (TID 515)
23/09/27 19:43:00 INFO Executor: Finished task 94.0 in stage 4.0 (TID 479). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 519
23/09/27 19:43:00 INFO Executor: Running task 134.0 in stage 4.0 (TID 519)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 240, boot = -7, init = 247, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 114.0 in stage 4.0 (TID 499). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO PythonRunner: Times: total = 274, boot = 0, init = 274, finish = 0
23/09/27 19:43:00 INFO PythonRunner: Times: total = 259, boot = -7, init = 266, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 536
23/09/27 19:43:00 INFO Executor: Running task 151.0 in stage 4.0 (TID 536)
23/09/27 19:43:00 INFO Executor: Finished task 104.0 in stage 4.0 (TID 489). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Finished task 109.0 in stage 4.0 (TID 494). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 537
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 538
23/09/27 19:43:00 INFO Executor: Running task 152.0 in stage 4.0 (TID 537)
23/09/27 19:43:00 INFO Executor: Running task 153.0 in stage 4.0 (TID 538)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 269, boot = 9, init = 260, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 122.0 in stage 4.0 (TID 507). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 547
23/09/27 19:43:00 INFO Executor: Running task 162.0 in stage 4.0 (TID 547)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 246, boot = -10, init = 256, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 134.0 in stage 4.0 (TID 519). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 560
23/09/27 19:43:00 INFO Executor: Running task 175.0 in stage 4.0 (TID 560)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 297, boot = 6, init = 290, finish = 1
23/09/27 19:43:00 INFO Executor: Finished task 130.0 in stage 4.0 (TID 515). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 568
23/09/27 19:43:00 INFO Executor: Running task 183.0 in stage 4.0 (TID 568)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 241, boot = 3, init = 238, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 151.0 in stage 4.0 (TID 536). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 570
23/09/27 19:43:00 INFO Executor: Running task 185.0 in stage 4.0 (TID 570)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = 7, init = 232, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 152.0 in stage 4.0 (TID 537). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 571
23/09/27 19:43:00 INFO Executor: Running task 186.0 in stage 4.0 (TID 571)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 258, boot = 14, init = 244, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 153.0 in stage 4.0 (TID 538). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 574
23/09/27 19:43:01 INFO Executor: Running task 189.0 in stage 4.0 (TID 574)
23/09/27 19:43:01 INFO PythonRunner: Times: total = 215, boot = 15, init = 200, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 162.0 in stage 4.0 (TID 547). 2004 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 162, boot = -6, init = 168, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 185.0 in stage 4.0 (TID 570). 2047 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 230, boot = -5, init = 235, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 175.0 in stage 4.0 (TID 560). 2047 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 154, boot = 0, init = 154, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 189.0 in stage 4.0 (TID 574). 2004 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 244, boot = 15, init = 229, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 183.0 in stage 4.0 (TID 568). 2047 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 219, boot = 7, init = 212, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 186.0 in stage 4.0 (TID 571). 2047 bytes result sent to driver
23/09/27 19:43:01 INFO UCX: UCX context created
23/09/27 19:43:01 INFO UCX: UCX Worker created
23/09/27 19:43:02 INFO UCX: Started UcpListener on /<master_ip>:57306
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Registering executor BlockManagerId(1, <master_ip>, 32805, Some(rapids=57306)) with driver
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(0, <master_ip>, 41505, Some(rapids=62205))
23/09/27 19:43:02 INFO UCX: Creating connection for executorId 0
23/09/27 19:43:02 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@331ae0ff, peerExecutorId=0) started
23/09/27 19:43:02 INFO UCX: Got UcpListener request from /<master_ip>:46124
23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=139640732815552, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46124) for /<master_ip>:46124
23/09/27 19:43:02 INFO UCX: Got UcpListener request from /<master_ip>:46148
23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=139640732815616, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46148) for /<master_ip>:46148
23/09/27 19:43:02 INFO UCX: Got UcpListener request from /<master_ip>:46136
23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=139640732815680, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46136) for /<master_ip>:46136
23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281]
23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281]
23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281]
23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 3: UcpEndpoint(id=139640732815616, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46148)
23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 2: UcpEndpoint(id=139640732815680, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46136)
23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 0: UcpEndpoint(id=139640732815552, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46124)
23/09/27 19:43:02 INFO CoarseGrainedExecutorBackend: Got assigned task 577
23/09/27 19:43:02 INFO Executor: Running task 0.0 in stage 6.0 (TID 577)
23/09/27 19:43:02 INFO MapOutputTrackerWorker: Updating epoch to 4 and clearing cache
23/09/27 19:43:02 INFO TorrentBroadcast: Started reading broadcast variable 4 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:43:02 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 13.2 KiB, free 47.8 GiB)
23/09/27 19:43:02 INFO TorrentBroadcast: Reading broadcast variable 4 took 7 ms
23/09/27 19:43:02 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 29.0 KiB, free 47.8 GiB)
23/09/27 19:43:04 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 2, fetching them
23/09/27 19:43:04 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:35063)
23/09/27 19:43:04 INFO MapOutputTrackerWorker: Got the map output locations
23/09/27 19:43:04 INFO ShuffleBlockFetcherIterator: Getting 2 (194.0 B) non-empty blocks including 0 (0.0 B) local and 1 (97.0 B) host-local and 0 (0.0 B) push-merged-local and 1 (97.0 B) remote blocks
23/09/27 19:43:04 INFO TransportClientFactory: Successfully created connection to /<worker_ip_from_non_master_node>:38739 after 2 ms (0 ms spent in bootstraps)
23/09/27 19:43:04 INFO ShuffleBlockFetcherIterator: Started 1 remote fetches in 21 ms
23/09/27 19:43:04 INFO TransportClientFactory: Successfully created connection to /<master_ip>:36401 after 1 ms (0 ms spent in bootstraps)
23/09/27 19:43:04 INFO CodeGenerator: Code generated in 48.927407 ms
23/09/27 19:43:04 INFO CodeGenerator: Code generated in 19.687474 ms
23/09/27 19:43:04 INFO Executor: Finished task 0.0 in stage 6.0 (TID 577). 4021 bytes result sent to driver
23/09/27 19:43:05 INFO CoarseGrainedExecutorBackend: Got assigned task 584
23/09/27 19:43:05 INFO Executor: Running task 6.0 in stage 9.0 (TID 584)
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Updating epoch to 5 and clearing cache
23/09/27 19:43:05 INFO TorrentBroadcast: Started reading broadcast variable 5 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 28.7 KiB, free 47.8 GiB)
23/09/27 19:43:05 INFO TorrentBroadcast: Reading broadcast variable 5 took 17 ms
23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 59.1 KiB, free 47.8 GiB)
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 3, fetching them
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:35063)
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Got the map output locations
23/09/27 19:43:05 INFO ShuffleBlockFetcherIterator: Getting 0 (0.0 B) non-empty blocks including 0 (0.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
23/09/27 19:43:05 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
23/09/27 19:43:05 INFO CodeGenerator: Code generated in 14.846185 ms
23/09/27 19:43:05 INFO PythonRunner: Times: total = 139, boot = -4325, init = 4464, finish = 0
23/09/27 19:43:05 INFO Executor: Finished task 6.0 in stage 9.0 (TID 584). 6740 bytes result sent to driver
23/09/27 19:43:07 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(3, <master_ip>, 36401, Some(rapids=29068))
23/09/27 19:43:07 INFO UCX: Creating connection for executorId 3
23/09/27 19:43:07 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@331ae0ff, peerExecutorId=3) started
23/09/27 19:43:07 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(2, <master_ip>, 36445, Some(rapids=50893))
23/09/27 19:43:07 INFO UCX: Creating connection for executorId 2
23/09/27 19:43:07 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@331ae0ff, peerExecutorId=2) started
23/09/27 19:46:00 INFO RapidsShuffleInternalManager: Unregistering shuffle 1 from shuffle buffer catalog
23/09/27 19:46:00 WARN ShuffleBufferCatalog: Ignoring unregister of unknown shuffle 1
23/09/27 19:46:32 ERROR UCX: UcpListener detected an error for executorId 2: UCXError(-25,Connection reset by remote peer)
23/09/27 19:46:32 WARN UCX: Removing endpoint UcpEndpoint(id=139640732815680, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46136) for 2
23/09/27 19:46:32 WARN UCX: Removed stale client connection for 2
23/09/27 19:46:32 ERROR UCX: Error while closing ep. Ignoring.
org.openucx.jucx.UcxException: Connection reset by remote peer
	at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingNative(Native Method)
	at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingFlush(UcpEndpoint.java:441)
	at com.nvidia.spark.rapids.shuffle.ucx.UCX$UcpEndpointManager.$anonfun$closeEndpointOnWorkerThread$1(UCX.scala:904)
	at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5(UCX.scala:188)
	at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5$adapted(UCX.scala:182)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$2(UCX.scala:182)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at com.nvidia.spark.rapids.GpuDeviceManager$$anon$1.$anonfun$newThread$1(GpuDeviceManager.scala:490)
	at java.base/java.lang.Thread.run(Thread.java:833)
23/09/27 19:46:32 ERROR UCX: UcpListener detected an error for executorId 3: UCXError(-25,Connection reset by remote peer)
23/09/27 19:46:32 WARN UCX: Removing endpoint UcpEndpoint(id=139640732815616, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46148) for 3
23/09/27 19:46:32 WARN UCX: Removed stale client connection for 3
23/09/27 19:46:32 ERROR UCX: Error while closing ep. Ignoring.
org.openucx.jucx.UcxException: Connection reset by remote peer
	at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingNative(Native Method)
	at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingFlush(UcpEndpoint.java:441)
	at com.nvidia.spark.rapids.shuffle.ucx.UCX$UcpEndpointManager.$anonfun$closeEndpointOnWorkerThread$1(UCX.scala:904)
	at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5(UCX.scala:188)
	at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5$adapted(UCX.scala:182)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$2(UCX.scala:182)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at com.nvidia.spark.rapids.GpuDeviceManager$$anon$1.$anonfun$newThread$1(GpuDeviceManager.scala:490)
	at java.base/java.lang.Thread.run(Thread.java:833)
23/09/27 19:46:32 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
23/09/27 19:46:32 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
tdown

Killed Executor
23/09/27 19:42:59 INFO MapOutputTrackerWorker: Updating epoch to 2 and clearing cache
23/09/27 19:43:00 INFO TorrentBroadcast: Started reading broadcast variable 3 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:43:00 INFO TransportClientFactory: Successfully created connection to /<master_ip>:32805 after 2 ms (0 ms spent in bootstraps)
23/09/27 19:43:00 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 9.3 KiB, free 47.8 GiB)
23/09/27 19:43:00 INFO TorrentBroadcast: Reading broadcast variable 3 took 163 ms
23/09/27 19:43:00 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 19.5 KiB, free 47.8 GiB)
23/09/27 19:43:01 INFO UCX: UCX context created
23/09/27 19:43:01 INFO UCX: UCX Worker created
23/09/27 19:43:02 INFO UCX: Started UcpListener on /<master_ip>:50893
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Registering executor BlockManagerId(2, <master_ip>, 36445, Some(rapids=50893)) with driver
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(1, <master_ip>, 32805, Some(rapids=57306))
23/09/27 19:43:02 INFO UCX: Creating connection for executorId 1
23/09/27 19:43:02 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@6d246c2, peerExecutorId=1) started
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(0, <master_ip>, 41505, Some(rapids=62205))
23/09/27 19:43:02 INFO UCX: Creating connection for executorId 0
23/09/27 19:43:02 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@6d246c2, peerExecutorId=0) started
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(3, <master_ip>, 36401, Some(rapids=29068))
23/09/27 19:43:02 INFO UCX: Creating connection for executorId 3
23/09/27 19:43:02 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@6d246c2, peerExecutorId=3) started
23/09/27 19:43:02 INFO UCX: Got UcpListener request from /<master_ip>:57878
23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=140137543848256, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:57878) for /<master_ip>:57878
23/09/27 19:43:02 INFO CodeGenerator: Code generated in 370.253869 ms
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1141, boot = 834, init = 307, finish = 0
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1057, boot = 845, init = 212, finish = 0
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1121, boot = 840, init = 281, finish = 0
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1152, boot = 852, init = 300, finish = 0
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1129, boot = 824, init = 305, finish = 0
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1141, boot = 829, init = 312, finish = 0
23/09/27 19:43:02 INFO Executor: Finished task 29.0 in stage 4.0 (TID 414). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO Executor: Finished task 13.0 in stage 4.0 (TID 398). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO Executor: Finished task 5.0 in stage 4.0 (TID 390). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO Executor: Finished task 37.0 in stage 4.0 (TID 422). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO Executor: Finished task 45.0 in stage 4.0 (TID 430). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO Executor: Finished task 21.0 in stage 4.0 (TID 406). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281]
23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 3: UcpEndpoint(id=140137543848256, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:57878)
23/09/27 19:43:05 INFO CoarseGrainedExecutorBackend: Got assigned task 578
23/09/27 19:43:05 INFO Executor: Running task 2.0 in stage 9.0 (TID 578)
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Updating epoch to 5 and clearing cache
23/09/27 19:43:05 INFO TorrentBroadcast: Started reading broadcast variable 5 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:43:05 INFO TransportClientFactory: Successfully created connection to master/<master_ip>:33961 after 2 ms (0 ms spent in bootstraps)
23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 28.7 KiB, free 47.8 GiB)
23/09/27 19:43:05 INFO TorrentBroadcast: Reading broadcast variable 5 took 20 ms
23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 59.1 KiB, free 47.8 GiB)
23/09/27 19:43:07 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 3, fetching them
23/09/27 19:43:07 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:35063)
23/09/27 19:43:07 INFO MapOutputTrackerWorker: Got the map output locations
23/09/27 19:43:07 INFO ShuffleBlockFetcherIterator: Getting 1 (72.0 B) non-empty blocks including 0 (0.0 B) local and 1 (72.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
23/09/27 19:43:07 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 15 ms
23/09/27 19:43:07 INFO CodeGenerator: Code generated in 19.422264 ms
2023-09-27 19:43:08,459 - spark_rapids_ml.regression.LinearRegression - INFO - Initializing cuml context
23/09/27 19:43:12 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) has entered the global sync, current barrier epoch is 0.
23/09/27 19:44:12 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1695811392156, has been waiting for 60 seconds, current barrier epoch is 0.
23/09/27 19:45:12 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1695811392156, has been waiting for 120 seconds, current barrier epoch is 0.
23/09/27 19:46:00 INFO RapidsShuffleInternalManager: Unregistering shuffle 1 from shuffle buffer catalog
23/09/27 19:46:00 WARN ShuffleBufferCatalog: Ignoring unregister of unknown shuffle 1
23/09/27 19:46:12 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1695811392156, has been waiting for 180 seconds, current barrier epoch is 0.
23/09/27 19:46:31 INFO Executor: Executor is trying to kill task 2.0 in stage 9.0 (TID 578), reason: Stage cancelled: Job 5 cancelled as part of cancellation of all jobs
23/09/27 19:46:32 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[accept-connections,5,main]
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:267)
	at org.apache.spark.BarrierTaskContext.$anonfun$runBarrier$3(BarrierTaskContext.scala:94)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.BarrierTaskContext.runBarrier(BarrierTaskContext.scala:94)
	at org.apache.spark.BarrierTaskContext.allGather(BarrierTaskContext.scala:179)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.barrierAndServe(PythonRunner.scala:490)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anon$1.run(PythonRunner.scala:321)
23/09/27 19:46:32 INFO Executor: Executor killed task 2.0 in stage 9.0 (TID 578), reason: Stage cancelled: Job 5 cancelled as part of cancellation of all jobs
23/09/27 19:46:32 INFO RapidsBufferCatalog: Closing storage
23/09/27 19:46:32 INFO UCXShuffleTransport: UCX transport closing
23/09/27 19:46:32 WARN UCX: UCX is shutting down
23/09/27 19:46:32 INFO UCX: De-registering UCX 3 memory buffers.
23/09/27 19:46:32 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
23/09/27 19:46:32 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
tdown

Here is the spark.conf containing the related options. I tried to disable the options related to UDFs (Scala UDF, UDF compiler, etc.), but it did not do much.

`spark.conf`
spark.master	spark://master:7077

# Resource-related configs
spark.executor.instances	8
spark.executor.cores	6
spark.executor.memory	80G
spark.driver.memory	80G
spark.executor.memoryOverhead	1G

# Task-related
spark.default.parallelism	192
spark.sql.shuffle.partitions	192
spark.driver.maxResultSize	30G
spark.sql.files.maxPartitionBytes	4096m
# spark.sql.files.maxPartitionBytes	8192m
spark.sql.execution.sortBeforeRepartition false
spark.sql.adaptive.enabled	true

# GPU-related Configs
spark.executor.resource.gpu.amount	1
spark.executor.resource.gpu.discoveryScript	/usr/lib/spark/scripts/gpu/getGpusResources.sh
spark.executor.resources.discoveryPlugin	com.nvidia.spark.ExclusiveModeGpuDiscoveryPlugin

spark.plugins	com.nvidia.spark.SQLPlugin
spark.rapids.memory.gpu.debug	STDOUT
spark.rapids.memory.gpu.pool	NONE
spark.rapids.memory.pinnedPool.size	20G
spark.rapids.shuffle.multiThreaded.reader.threads	24
spark.rapids.shuffle.multiThreaded.writer.threads	24
spark.rapids.sql.concurrentGpuTasks	2
spark.rapids.sql.enabled	true
spark.rapids.sql.exec.CollectLimitExec	true
spark.rapids.sql.explain	all
spark.rapids.sql.expression.ScalaUDF	true
spark.rapids.sql.metrics.level	DEBUG
spark.rapids.sql.rowBasedUDF.enabled true
spark.rapids.sql.udfCompiler.enabled	true
spark.shuffle.manager	com.nvidia.spark.rapids.spark350.RapidsShuffleManager
spark.task.resource.gpu.amount	0.166
spark.sql.cache.serializer com.nvidia.spark.ParquetCachedBatchSerializer
spark.rapids.shuffle.mode UCX
spark.shuffle.service.enabled false
spark.dynamicAllocation.enabled false
spark.executorEnv.UCX_ERROR_SIGNALS
spark.executorEnv.UCX_MEMTYPE_CACHE n
spark.executorEnv.UCX_IB_RX_QUEUE_LEN 1024
spark.executorEnv.UCX_TLS cuda_copy,cuda_ipc,rc,tcp
spark.executorEnv.UCX_RNDV_SCHEME put_zcopy
spark.executorEnv.UCX_MAX_RNDV_RAILS 1
spark.executorEnv.UCX_IB_GPU_DIRECT_RDMA n

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions