-
Notifications
You must be signed in to change notification settings - Fork 80
[FEA] state machine is overly conservative with treating threads as blocked/throwing split and retries #4046
Description
Is your feature request related to a problem? Please describe.
This issue is intended to describe a problem and not necessarily propose a specific solution, since I think there are already different proposals in place for redesigning the state machine. This issue is specifically for documenting the problematic behavior.
The observed problem is that we treat a thread as BUFN (blocked until further notice) if the java thread is blocked (or waiting). BUFN is considered "more severe" than just regular blocked, implying that we have rolled back and then paused (like the design doc shows a thread goes from blocked to bufn_throw to bufn_wait and then to bufn, see https://github.com/NVIDIA/spark-rapids-jni/blob/main/docs/memory_management.md). We do this to be conservative because when a java thread is blocked by something we don't know what it is exactly, so we don't know whether it's blocked in a way that is recoverable or not. So we treat is more like unrecoverable to be "safe" in the sense of trying to allocate less memory.
However, you can have a false positive scenario where a thread is actually able to make progress but we treat it as BUFN, and in the extreme case we will start telling threads to split their inputs and retry if we see all threads as potentially BUFN. But splitting is not always possible (we have retry blocks that cannot split by their nature), therefore we might end up failing a task that might otherwise succeed.
Typically we want code paths to be able to split when possible, but that is outside the scope of the issue. This issue specifically covers the behavior of treating blocked java threads as BUFN.
Here is an example from an executor log: we treat as effectively bufn because the java thread is WAITING, I think because it is just waiting on IO
17:00:57.124363,DETAIL,137824149960256,-1,-1,UNKNOWN,,deadlock state is reached with all_task_ids: {4,12,20,124,116,108,100,92,84,76,68,60,52,44,36,28} (16), blocked_task_ids: {4,12,20,124,116,108,100,92,84,76,6
8,60,52,44,36,28} (16), bufn_task_ids: {4,12,20,124,116,108,100,92,84,76,68,60,52,44,36,28} (16), threads: {137804622321216,137804623369792,137804624418368,137804625466944,137804626515520,137804627564096,1378138
58825792,137813859874368,137813860922944,137824093333056,137824146814528,137824152057408,137826779989568,137826781038144,137826785232448,137826803062336} (16)
17:00:57.124384,DETAIL,137824149960256,-1,-1,UNKNOWN,,all_bufn state is reached with all_task_ids size: 16
17:00:57.124387,TRANSITION,137824149960256,137826785232448,12,THREAD_BUFN,THREAD_SPLIT_THROW,
...
Thread: Executor task launch worker for task 3.0 in stage 1.0 (TID 4) - State: WAITING - Thread ID: 85
at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429) at java.util.concurrent.FutureTask.get(FutureTask.java:191) at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readPartFiles$8(GpuMultiFileReader.scala:1348) at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readPartFiles$8$adapted(GpuMultiFileReader.scala:1347) ...
(edited)
[10:55](https://nvidia.slack.com/archives/D075RT7DAUV/p1764874515790359)
so TID 4 is running according to the state machine but the java thread is waiting so then we say it's actually bufn and then we throw a split and retry, it doesn't make sense to me
Describe alternatives you've considered
As mentioned, this problem can also be worked around by making sure everything is splittable, which we would also like to do, but that's a separate work stream.