-
Notifications
You must be signed in to change notification settings - Fork 62
Open
Description
Describe the bug
branch-25.04. failed 2 times in a row in examples_GitHub-notebook, run: 999, 1000
[2025-03-19T21:39:54.065Z] [21:39:47] WARNING: /workspace/src/collective/socket.cc:143: Failed to connect to:100.102.147.248:52303 Error:
[2025-03-19T21:39:54.065Z] - [socket.h:79|21:39:47]: Poll error condition:Operation now in progress code:115 system error:Operation now in progress
[2025-03-19T21:39:54.065Z] - [socket.h:348|21:39:47]: Socket error. Connection refused
[2025-03-19T21:39:54.065Z] [21:39:47] WARNING: /workspace/src/collective/socket.cc:149: Retrying connection to 100.102.147.248 for the 1 time.
[2025-03-19T21:39:54.065Z] [21:39:49] WARNING: /workspace/src/collective/socket.cc:143: Failed to connect to:100.102.147.248:52303 Error:
[2025-03-19T21:39:54.065Z] - [socket.cc:160|21:39:49]: connect failed. system error:Software caused connection abort
[2025-03-19T21:39:54.065Z] [21:39:49] WARNING: /workspace/src/collective/socket.cc:149: Retrying connection to 100.102.147.248 for the 2 time.
[2025-03-19T21:39:54.065Z] [21:39:53] WARNING: /workspace/src/collective/socket.cc:143: Failed to connect to:100.102.147.248:52303 Error:
[2025-03-19T21:39:54.065Z] - [socket.h:79|21:39:53]: Poll error condition:Operation now in progress code:115 system error:Operation now in progress
[2025-03-19T21:39:54.065Z] - [socket.h:348|21:39:53]: Socket error. Connection refused
[2025-03-19T21:39:54.065Z] 25/03/19 21:39:53 ERROR Executor: Exception in task 0.0 in stage 3.3 (TID 5)
[2025-03-19T21:39:54.065Z] ml.dmlc.xgboost4j.java.XGBoostError: [21:39:53] /workspace/src/collective/result.cc:78:
[2025-03-19T21:39:54.065Z] - [comm.cc:220|21:39:53]: Failed to bootstrap the communication group.
[2025-03-19T21:39:54.065Z] - [comm.cc:239|21:39:53]: Bootstrap failed.
[2025-03-19T21:39:54.065Z] - [comm.cc:41|21:39:53]: Failed to connect to the tracker.
[2025-03-19T21:39:54.065Z] - [socket.cc:189|21:39:53]: Failed to connect to 100.102.147.248:52303
[2025-03-19T21:39:54.065Z] - [socket.h:79|21:39:53]: Poll error condition:Operation now in progress code:115 Operation now in progress
[2025-03-19T21:39:54.065Z] - [socket.h:348|21:39:53]: Socket error. Connection refused
[2025-03-19T21:39:54.065Z] Stack trace:
[2025-03-19T21:39:54.065Z] [bt] (0) /tmp/libxgboost4j6794688438011251026.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6c) [0x7fce9443958c]
[2025-03-19T21:39:54.065Z] [bt] (1) /tmp/libxgboost4j6794688438011251026.so(xgboost::collective::SafeColl(xgboost::collective::Result const&)+0x81) [0x7fce944f5ea1]
[2025-03-19T21:39:54.065Z] [bt] (2) /tmp/libxgboost4j6794688438011251026.so(xgboost::collective::RabitComm::RabitComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::chrono::duration<long, std::ratio<1l, 1l> >, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, xgboost::StringView)+0x543) [0x7fce944c89d3]
[2025-03-19T21:39:54.065Z] [bt] (3) /tmp/libxgboost4j6794688438011251026.so(xgboost::collective::CommGroup::Create(xgboost::Json)+0x1759) [0x7fce944d6ae9]
[2025-03-19T21:39:54.065Z] [bt] (4) /tmp/libxgboost4j6794688438011251026.so(xgboost::collective::GlobalCommGroupInit(xgboost::Json)+0x60) [0x7fce944d81b0]
[2025-03-19T21:39:54.065Z] [bt] (5) /tmp/libxgboost4j6794688438011251026.so(XGCommunicatorInit+0x64) [0x7fce94488af4]
[2025-03-19T21:39:54.065Z] [bt] (6) /tmp/libxgboost4j6794688438011251026.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_CommunicatorInit+0xf8) [0x7fce951c7b88]
[2025-03-19T21:39:54.065Z] [bt] (7) [0x7fd899017da7]
[2025-03-19T21:39:54.065Z]
[2025-03-19T21:39:54.065Z]
[2025-03-19T21:39:54.065Z] at ml.dmlc.xgboost4j.java.Communicator.checkCall(Communicator.java:57)
[2025-03-19T21:39:54.065Z] at ml.dmlc.xgboost4j.java.Communicator.init(Communicator.java:71)
[2025-03-19T21:39:54.065Z] at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$train$2(XGBoost.scala:255)
[2025-03-19T21:39:54.065Z] at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
[2025-03-19T21:39:54.065Z] at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
[2025-03-19T21:39:54.065Z] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2025-03-19T21:39:54.065Z] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[2025-03-19T21:39:54.065Z] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[2025-03-19T21:39:54.065Z] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2025-03-19T21:39:54.065Z] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[2025-03-19T21:39:54.065Z] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[2025-03-19T21:39:54.065Z] at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
[2025-03-19T21:39:54.065Z] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
[2025-03-19T21:39:54.065Z] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
[2025-03-19T21:39:54.065Z] at org.apache.spark.scheduler.Task.run(Task.scala:131)
[2025-03-19T21:39:54.065Z] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
[2025-03-19T21:39:54.065Z] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
[2025-03-19T21:39:54.065Z] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
[2025-03-19T21:39:54.065Z] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2025-03-19T21:39:54.065Z] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2025-03-19T21:39:54.065Z] at java.lang.Thread.run(Thread.java:750)
[2025-03-19T21:39:54.065Z] 25/03/19 21:39:53 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
Steps/Code to reproduce bug
SPARK_CONF_DIR=/home/jenkins/agent/workspace/examples_GitHub-notebook/GPU \
rapids-examples/test-notebook.sh notebook-examples/examples/XGBoost-Examples/agaricus/notebooks/scala/agaricus-gpu.ipynb
Expected behavior
A clear and concise description of what you expected to happen.
Environment details (please complete the following information)
- Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
- Spark configuration settings related to the issue
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels