[FEA] state machine is overly conservative with treating threads as blocked/throwing split and retries

**Is your feature request related to a problem? Please describe.**

This issue is intended to describe a problem and not necessarily propose a specific solution, since I think there are already different proposals in place for redesigning the state machine. This issue is specifically for documenting the problematic behavior.

The observed problem is that we treat a thread as BUFN (blocked until further notice) if the java thread is blocked (or waiting). BUFN is considered "more severe" than just regular blocked, implying that we have rolled back and then paused (like the design doc shows a thread goes from blocked to bufn_throw to bufn_wait and then to bufn, see https://github.com/NVIDIA/spark-rapids-jni/blob/main/docs/memory_management.md). We do this to be conservative because when a java thread is blocked by *something* we don't know what it is exactly, so we don't know whether it's blocked in a way that is recoverable or not. So we treat is more like unrecoverable to be "safe" in the sense of trying to allocate less memory.

However, you can have a false positive scenario where a thread is actually able to make progress but we treat it as BUFN, and in the extreme case we will start telling threads to split their inputs and retry if we see all threads as potentially BUFN. But splitting is not always possible (we have retry blocks that cannot split by their nature), therefore we might end up failing a task that might otherwise succeed.

Typically we want code paths to be able to split when possible, but that is outside the scope of the issue. This issue specifically covers the behavior of treating blocked java threads as BUFN.

Here is an example from an executor log: we treat as effectively bufn because the java thread is WAITING, I think because it is just waiting on IO

```
17:00:57.124363,DETAIL,137824149960256,-1,-1,UNKNOWN,,deadlock state is reached with all_task_ids: {4,12,20,124,116,108,100,92,84,76,68,60,52,44,36,28} (16), blocked_task_ids: {4,12,20,124,116,108,100,92,84,76,6
8,60,52,44,36,28} (16), bufn_task_ids: {4,12,20,124,116,108,100,92,84,76,68,60,52,44,36,28} (16), threads: {137804622321216,137804623369792,137804624418368,137804625466944,137804626515520,137804627564096,1378138
58825792,137813859874368,137813860922944,137824093333056,137824146814528,137824152057408,137826779989568,137826781038144,137826785232448,137826803062336} (16)
17:00:57.124384,DETAIL,137824149960256,-1,-1,UNKNOWN,,all_bufn state is reached with all_task_ids size: 16
17:00:57.124387,TRANSITION,137824149960256,137826785232448,12,THREAD_BUFN,THREAD_SPLIT_THROW,
...
Thread: Executor task launch worker for task 3.0 in stage 1.0 (TID 4) - State: WAITING - Thread ID: 85
        at sun.misc.Unsafe.park(Native Method)  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)    at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)       at java.util.concurrent.FutureTask.get(FutureTask.java:191)        at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readPartFiles$8(GpuMultiFileReader.scala:1348)       at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readPartFiles$8$adapted(GpuMultiFileReader.scala:1347) ...
(edited)
[10:55](https://nvidia.slack.com/archives/D075RT7DAUV/p1764874515790359)
so TID 4 is running according to the state machine but the java thread is waiting so then we say it's actually bufn and then we throw a split and retry, it doesn't make sense to me
```

**Describe alternatives you've considered**
As mentioned, this problem can also be worked around by making sure everything is splittable, which we would also like to do, but that's a separate work stream.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] state machine is overly conservative with treating threads as blocked/throwing split and retries #4046

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] state machine is overly conservative with treating threads as blocked/throwing split and retries #4046

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions