Skip to content

[BUG] XGBoost-Examples agaricus-gpu.ipynb failed to bootstrap the communication group. #512

@pxLi

Description

@pxLi

Describe the bug
branch-25.04. failed 2 times in a row in examples_GitHub-notebook, run: 999, 1000

[2025-03-19T21:39:54.065Z] [21:39:47] WARNING: /workspace/src/collective/socket.cc:143: Failed to connect to:100.102.147.248:52303 Error:
[2025-03-19T21:39:54.065Z] - [socket.h:79|21:39:47]: Poll error condition:Operation now in progress code:115 system error:Operation now in progress
[2025-03-19T21:39:54.065Z] - [socket.h:348|21:39:47]: Socket error. Connection refused
[2025-03-19T21:39:54.065Z] [21:39:47] WARNING: /workspace/src/collective/socket.cc:149: Retrying connection to 100.102.147.248 for the 1 time.
[2025-03-19T21:39:54.065Z] [21:39:49] WARNING: /workspace/src/collective/socket.cc:143: Failed to connect to:100.102.147.248:52303 Error:
[2025-03-19T21:39:54.065Z] - [socket.cc:160|21:39:49]: connect failed. system error:Software caused connection abort
[2025-03-19T21:39:54.065Z] [21:39:49] WARNING: /workspace/src/collective/socket.cc:149: Retrying connection to 100.102.147.248 for the 2 time.
[2025-03-19T21:39:54.065Z] [21:39:53] WARNING: /workspace/src/collective/socket.cc:143: Failed to connect to:100.102.147.248:52303 Error:
[2025-03-19T21:39:54.065Z] - [socket.h:79|21:39:53]: Poll error condition:Operation now in progress code:115 system error:Operation now in progress
[2025-03-19T21:39:54.065Z] - [socket.h:348|21:39:53]: Socket error. Connection refused
[2025-03-19T21:39:54.065Z] 25/03/19 21:39:53 ERROR Executor: Exception in task 0.0 in stage 3.3 (TID 5)
[2025-03-19T21:39:54.065Z] ml.dmlc.xgboost4j.java.XGBoostError: [21:39:53] /workspace/src/collective/result.cc:78: 
[2025-03-19T21:39:54.065Z] - [comm.cc:220|21:39:53]: Failed to bootstrap the communication group.
[2025-03-19T21:39:54.065Z] - [comm.cc:239|21:39:53]: Bootstrap failed.
[2025-03-19T21:39:54.065Z] - [comm.cc:41|21:39:53]: Failed to connect to the tracker.
[2025-03-19T21:39:54.065Z] - [socket.cc:189|21:39:53]: Failed to connect to 100.102.147.248:52303
[2025-03-19T21:39:54.065Z] - [socket.h:79|21:39:53]: Poll error condition:Operation now in progress code:115 Operation now in progress
[2025-03-19T21:39:54.065Z] - [socket.h:348|21:39:53]: Socket error. Connection refused
[2025-03-19T21:39:54.065Z] Stack trace:
[2025-03-19T21:39:54.065Z]   [bt] (0) /tmp/libxgboost4j6794688438011251026.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6c) [0x7fce9443958c]
[2025-03-19T21:39:54.065Z]   [bt] (1) /tmp/libxgboost4j6794688438011251026.so(xgboost::collective::SafeColl(xgboost::collective::Result const&)+0x81) [0x7fce944f5ea1]
[2025-03-19T21:39:54.065Z]   [bt] (2) /tmp/libxgboost4j6794688438011251026.so(xgboost::collective::RabitComm::RabitComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::chrono::duration<long, std::ratio<1l, 1l> >, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, xgboost::StringView)+0x543) [0x7fce944c89d3]
[2025-03-19T21:39:54.065Z]   [bt] (3) /tmp/libxgboost4j6794688438011251026.so(xgboost::collective::CommGroup::Create(xgboost::Json)+0x1759) [0x7fce944d6ae9]
[2025-03-19T21:39:54.065Z]   [bt] (4) /tmp/libxgboost4j6794688438011251026.so(xgboost::collective::GlobalCommGroupInit(xgboost::Json)+0x60) [0x7fce944d81b0]
[2025-03-19T21:39:54.065Z]   [bt] (5) /tmp/libxgboost4j6794688438011251026.so(XGCommunicatorInit+0x64) [0x7fce94488af4]
[2025-03-19T21:39:54.065Z]   [bt] (6) /tmp/libxgboost4j6794688438011251026.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_CommunicatorInit+0xf8) [0x7fce951c7b88]
[2025-03-19T21:39:54.065Z]   [bt] (7) [0x7fd899017da7]
[2025-03-19T21:39:54.065Z] 
[2025-03-19T21:39:54.065Z] 
[2025-03-19T21:39:54.065Z] 	at ml.dmlc.xgboost4j.java.Communicator.checkCall(Communicator.java:57)
[2025-03-19T21:39:54.065Z] 	at ml.dmlc.xgboost4j.java.Communicator.init(Communicator.java:71)
[2025-03-19T21:39:54.065Z] 	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$train$2(XGBoost.scala:255)
[2025-03-19T21:39:54.065Z] 	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
[2025-03-19T21:39:54.065Z] 	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
[2025-03-19T21:39:54.065Z] 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2025-03-19T21:39:54.065Z] 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[2025-03-19T21:39:54.065Z] 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[2025-03-19T21:39:54.065Z] 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2025-03-19T21:39:54.065Z] 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[2025-03-19T21:39:54.065Z] 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[2025-03-19T21:39:54.065Z] 	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
[2025-03-19T21:39:54.065Z] 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
[2025-03-19T21:39:54.065Z] 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
[2025-03-19T21:39:54.065Z] 	at org.apache.spark.scheduler.Task.run(Task.scala:131)
[2025-03-19T21:39:54.065Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
[2025-03-19T21:39:54.065Z] 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
[2025-03-19T21:39:54.065Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
[2025-03-19T21:39:54.065Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2025-03-19T21:39:54.065Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2025-03-19T21:39:54.065Z] 	at java.lang.Thread.run(Thread.java:750)
[2025-03-19T21:39:54.065Z] 25/03/19 21:39:53 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown

Steps/Code to reproduce bug

SPARK_CONF_DIR=/home/jenkins/agent/workspace/examples_GitHub-notebook/GPU \
rapids-examples/test-notebook.sh notebook-examples/examples/XGBoost-Examples/agaricus/notebooks/scala/agaricus-gpu.ipynb

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions