[VL] SIGSEGV in IOThreadPool during HDFS scan

### Backend

VL (Velox)

### Bug description

When running TPC-DS or heavy scan workloads on HDFS with `IOThreads > 0` and `SplitPreloadPerDriver > 0`, the JVM process has a change to crash with SIGSEGV inside `jni_NewStringUTF` during `hdfsGetPathInfo()`. The crashing thread is a `CPUThreadPoolN` thread used for async split preloading.

**Expected behavior**: HDFS file operations should work reliably on IOThreadPool threads across consecutive preload tasks.

**Actual behavior**: After a certain number of tasks, the IOThreadPool thread crashes with SIGSEGV when calling `hdfsGetPathInfo()` via `libhdfs.so`.

## Root cause

`libhdfs.so` caches `JNIEnv*` in an ELF thread-local (`__thread`) variable after the first `AttachCurrentThread` on each thread. The cached env is returned on all subsequent calls without re-validation (confirmed by disassembly of `libhdfs.so`'s `getJNIEnv` function).

Gluten's `JniColumnarBatchIterator::~JniColumnarBatchIterator()` (`JniCommon.cc`) and `JavaInputStreamAdaptor::Close()` (`JniWrapper.cc`) call `vm_->DetachCurrentThread()` after JNI cleanup. This invalidates the `JNIEnv*` and frees the backing `JavaThread` object in the JVM. But libhdfs's TLS cache still holds the old pointer. On the next HDFS call, `libhdfs`'s `getJNIEnv()` returns the stale pointer, and the JVM crashes when it tries to transition the freed thread state.

### Detailed mechanism

**libhdfs `getJNIEnv` fast path** (from disassembly):
```
1. __tls_get_addr() → get &(__thread hdfsTls*)
2. if (tls_ptr != NULL) → return tls_ptr->env    // NO RE-VALIDATION
3. else → slow path: AttachCurrentThread, cache env
```

**After `DetachCurrentThread`**:
- JVM frees the `JavaThread` object, reclaims the memory at the env address
- libhdfs `__thread` TLS still holds the stale `hdfsTls*` → stale `env`
- Next HDFS call → `getJNIEnv()` fast path returns stale env
- `jni_NewStringUTF(stale_env, ...)` → computes `JavaThread* = env - 0x200` → freed memory
- JVM reads `*(JavaThread + 0x290)` — gets garbage (not the magic alive marker `0xdeab`)
- JVM calls `block_if_vm_exited()`, sets JavaThread\* = NULL
- `transition_from_native(NULL, ...)` → **SIGSEGV** at address 0x278

### Evidence from core dump

Core dump: `core.CPUThreadPool21.1770392` (from TPC-DS benchmark on YARN)

Registers at crash frame (`ThreadStateTransition::transition_from_native`):
```
RDI = 0x0                    ← JavaThread* is NULL (set by block_if_vm_exited)
R12 = 0x7f3003a52200         ← stale JNIEnv* from libhdfs TLS cache
```

Memory at stale env (`0x7f3003a52200`):
```
0x7f3003a52200: 0x0000000000000000  0x0000000000000000   ← JNI function table is NULL
0x7f3003a52210: 0x0000001200000112  0x0000000000000000   ← JVM method resolution data (reused memory)
```

Call chain (resolved from `libvelox.so` symbol table via `nm`):
```
CPUThreadPool21 (preload task)
  → SplitReader::createReader()          [libvelox.so + 0x6173914]
    → HdfsFileSystem::openFileForRead()  [libvelox.so + 0x3787216]
      → HdfsReadFile::HdfsReadFile()     [libvelox.so + 0x378AB36, constructor]
        → driver_->GetPathInfo()
          → hdfsGetPathInfo()            [libhdfs.so]
            → getJNIEnv() → returns stale env
              → jni_NewStringUTF(stale_env, path) → SIGSEGV
```

### How DetachCurrentThread gets called on CPUThreadPool threads

The two call sites:
1. `JniColumnarBatchIterator::~JniColumnarBatchIterator()` — `cpp/core/jni/JniCommon.cc`
2. `JavaInputStreamAdaptor::Close()` — `cpp/core/jni/JniWrapper.cc`

These objects are held via `shared_ptr` chains rooted in the Velox `Task`. When a task is terminated (e.g., by memory arbitration or `WholeStageResultIterator::~WholeStageResultIterator()` calling `task_->requestCancel()`), `Task::terminate()` calls `driver->closeByTask()` → `closeOperators()` which destroys `DataSource` objects, dropping the last `shared_ptr` references. If this cleanup runs on a CPUThreadPool thread (e.g., triggered by memory pressure callback during a preload task), the destructor calls `DetachCurrentThread` on that thread.

Sequence:
1. CPUThreadPool21 runs preload task A → libhdfs attaches thread, caches env in TLS
2. Object cleanup on the same thread → destructor calls `DetachCurrentThread` → env invalidated, but libhdfs TLS still holds it
3. CPUThreadPool21 runs preload task B → `hdfsGetPathInfo()` → stale env → **SIGSEGV**


### Gluten version

main branch

### Spark version

Spark-3.5.x

### Spark configurations

_No response_

### System information

_No response_

### Relevant logs

```bash
core dump back trace:


Core: core.CPUThreadPool21.1770392
#10 ThreadStateTransition::transition_from_native(JavaThread*, JavaThreadState)  [libjvm.so]
    RDI=0x0 (NULL JavaThread*), R12=0x7f3003a52200 (stale JNIEnv*)
#11 jni_NewStringUTF                                                             [libjvm.so]
#12 newJavaStr (env=0x7f3003a52200, path="/.../catalog_sales/...parquet")        [libhdfs.so]
#13 constructNewObjectOfPath                                                     [libhdfs.so]
#14 hdfsGetPathInfo                                                              [libhdfs.so]
#15 HdfsReadFile::HdfsReadFile()                                                 [libvelox.so + 0x378AB36]
#16 HdfsFileSystem::openFileForRead()                                            [libvelox.so + 0x3787216]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] SIGSEGV in IOThreadPool during HDFS scan #11895

Backend

Bug description

Root cause

Detailed mechanism

Evidence from core dump

How DetachCurrentThread gets called on CPUThreadPool threads

Gluten version

Spark version

Spark configurations

System information

Relevant logs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[VL] SIGSEGV in IOThreadPool during HDFS scan #11895

Description

Backend

Bug description

Root cause

Detailed mechanism

Evidence from core dump

How DetachCurrentThread gets called on CPUThreadPool threads

Gluten version

Spark version

Spark configurations

System information

Relevant logs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions