-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Failure in example notebook:
examples/XGBoost-Examples/agaricus/notebooks/scala/agaricus-gpu.ipynb
[2025-04-01T21:43:12.641Z]
[Stage 3:> (0 + 1) / 1]
25/04/01 21:43:12 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 2) (100.104.100.217 executor 0): ml.dmlc.xgboost4j.java.XGBoostError: [21:43:12] /workspace/src/common/device_vector.cu:23: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory
[2025-04-01T21:43:12.641Z] - Free memory: 5.875MB
[2025-04-01T21:43:12.641Z] - Requested memory: 4.33984MB
[2025-04-01T21:43:12.641Z]
Full StackTrace
[2025-04-01T21:43:31.859Z] ml.dmlc.xgboost4j.java.XGBoostError: [21:43:12] /workspace/src/common/device_vector.cu:23: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory[2025-04-01T21:43:31.859Z] - Free memory: 5.875MB
[2025-04-01T21:43:31.859Z] - Requested memory: 4.33984MB
[2025-04-01T21:43:31.859Z]
[2025-04-01T21:43:31.859Z] Stack trace:
[2025-04-01T21:43:31.859Z] [bt] (0) /tmp/libxgboost4j5960392861403856956.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6c) [0x7f1eefb2ecfc]
[2025-04-01T21:43:31.859Z] [bt] (1) /tmp/libxgboost4j5960392861403856956.so(dh::detail::ThrowOOMError(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, unsigned long)+0x493) [0x7f1ef0312143]
[2025-04-01T21:43:31.859Z] [bt] (2) /tmp/libxgboost4j5960392861403856956.so(+0x440247) [0x7f1eef9ed247]
[2025-04-01T21:43:31.859Z] [bt] (3) /tmp/libxgboost4j5960392861403856956.so(void xgboost::common::ProcessSlidingWindowxgboost::data::CudfAdapterBatch(xgboost::Context const*, xgboost::data::CudfAdapterBatch const&, xgboost::MetaInfo const&, unsigned long, unsigned long, unsigned long, float, xgboost::common::SketchContainer*, int)+0x9e9) [0x7f1ef053bae9]
[2025-04-01T21:43:31.859Z] [bt] (4) /tmp/libxgboost4j5960392861403856956.so(void xgboost::common::AdapterDeviceSketchxgboost::data::CudfAdapterBatch(xgboost::Context const*, xgboost::data::CudfAdapterBatch, int, xgboost::MetaInfo const&, float, xgboost::common::SketchContainer*, unsigned long)+0x263) [0x7f1ef053c973]
[2025-04-01T21:43:31.859Z] [bt] (5) /tmp/libxgboost4j5960392861403856956.so(+0xf3da2d) [0x7f1ef04eaa2d]
[2025-04-01T21:43:31.859Z] [bt] (6) /tmp/libxgboost4j5960392861403856956.so(xgboost::data::cuda_impl::MakeSketches(xgboost::Context const*, xgboost::data::DataIterProxy<void (void*), int (void*)>, xgboost::data::DMatrixProxy, std::shared_ptrxgboost::DMatrix, xgboost::BatchParam const&, float, std::shared_ptrxgboost::common::HistogramCuts, xgboost::MetaInfo const&, long, xgboost::data::ExternalDataInfo*)+0x9d2) [0x7f1ef04eb642]
[2025-04-01T21:43:31.860Z] [bt] (7) /tmp/libxgboost4j5960392861403856956.so(xgboost::data::IterativeDMatrix::InitFromCUDA(xgboost::Context const*, xgboost::BatchParam const&, long, void*, float, std::shared_ptrxgboost::DMatrix)+0x598) [0x7f1ef04cfa58]
[2025-04-01T21:43:31.860Z] [bt] (8) /tmp/libxgboost4j5960392861403856956.so(xgboost::data::IterativeDMatrix::IterativeDMatrix(void*, void*, std::shared_ptrxgboost::DMatrix, void ()(void), int ()(void), float, int, int, long)+0x763) [0x7f1eefe81b93]
[2025-04-01T21:43:31.860Z]
[2025-04-01T21:43:31.860Z]
[2025-04-01T21:43:31.860Z] at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:49)
[2025-04-01T21:43:31.860Z] at ml.dmlc.xgboost4j.java.QuantileDMatrix.(QuantileDMatrix.java:107)
[2025-04-01T21:43:31.860Z] at ml.dmlc.xgboost4j.scala.QuantileDMatrix.(QuantileDMatrix.scala:57)
[2025-04-01T21:43:31.860Z] at ml.dmlc.xgboost4j.scala.spark.GpuXGBoostPlugin.ml$dmlc$xgboost4j$scala$spark$GpuXGBoostPlugin$$buildQuantileDMatrix$1(GpuXGBoostPlugin.scala:160)
[2025-04-01T21:43:31.860Z] at ml.dmlc.xgboost4j.scala.spark.GpuXGBoostPlugin$$anon$1.next(GpuXGBoostPlugin.scala:171)
[2025-04-01T21:43:31.860Z] at ml.dmlc.xgboost4j.scala.spark.GpuXGBoostPlugin$$anon$1.next(GpuXGBoostPlugin.scala:168)
[2025-04-01T21:43:31.860Z] at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$train$4(XGBoost.scala:259)
[2025-04-01T21:43:31.860Z] at ml.dmlc.xgboost4j.scala.spark.Utils$.withResource(Utils.scala:120)
[2025-04-01T21:43:31.860Z] at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$train$2(XGBoost.scala:258)
[2025-04-01T21:43:31.860Z] at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
[2025-04-01T21:43:31.860Z] at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
[2025-04-01T21:43:31.860Z] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2025-04-01T21:43:31.860Z] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[2025-04-01T21:43:31.860Z] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[2025-04-01T21:43:31.860Z] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2025-04-01T21:43:31.860Z] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[2025-04-01T21:43:31.860Z] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[2025-04-01T21:43:31.860Z] at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
[2025-04-01T21:43:31.860Z] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
[2025-04-01T21:43:31.860Z] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
[2025-04-01T21:43:31.860Z] at org.apache.spark.scheduler.Task.run(Task.scala:131)
[2025-04-01T21:43:31.860Z] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
[2025-04-01T21:43:31.860Z] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
[2025-04-01T21:43:31.860Z] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
[2025-04-01T21:43:31.860Z] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2025-04-01T21:43:31.860Z] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2025-04-01T21:43:31.860Z] at java.lang.Thread.run(Thread.java:750)
cc: @wbo4958