JNI classloader bug: lance-jni dispatcher FindClass fails on user-installed JARs (Databricks)

> **Status (2026-05-20):** root cause is a JNI classloader bug in `lance-jni`'s dispatcher.
> Upstream issue: https://github.com/lance-format/lance/issues/6577 (still open).
>
> The classloader fix at https://github.com/lance-format/lance/pull/6549 was tested working
> on DBR 18.2 LTS (validated end-to-end with a cross-compiled `liblance_jni.so` from that
> branch — see the comment on #6549) but the PR was **closed by the author on 2026-05-20**
> with a "deduplicate with #6633" note. PR #6633 (merged 2026-05-12) only swaps
> `attach_current_thread_permanently()` → `attach_current_thread_as_daemon()`; it does not
> touch `find_class`. The classloader bug therefore remains unfixed on `main`.
>
> Workaround: pin `lance-spark-bundle-*:0.4.0` (stable, lance-core 4.0.0). See the
> bundle-version matrix below.

## Summary

`format("lance").save()` crashes the executor with **exit code 50** on Databricks Runtime when the lance-jni native dispatcher tries to resolve `org.lance.ipc.AsyncScanner` via JNI — the class is in the user-installed bundle JAR, but the dispatcher's native thread uses the JVM system classloader (which only sees `/databricks/jars/*`) and the lookup fails.

The official Databricks setup from https://lance.org/integrations/spark/vendors/databricks/ is followed — published Maven coord, no shading, no relocations, no init scripts. Reproduces against `org.lance:lance-spark-bundle-4.1_2.13:0.4.0-beta.4` on DBR 18.2 LTS.

## Root cause (from executor stderr, captured via `cluster_log_conf`)

```
thread 'lance-jni-dispatcher' (2688) panicked at src/dispatcher.rs:44:22:
AsyncScanner class not found: JavaException

ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread-91:
  java.lang.NoClassDefFoundError: org/lance/ipc/AsyncScanner
Caused by: java.lang.ClassNotFoundException: org.lance.ipc.AsyncScanner
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526)
```

`org.lance.ipc.AsyncScanner` IS present in the published bundle JAR — `unzip -l lance-spark-bundle-4.1_2.13-0.4.0-beta.4.jar | grep AsyncScanner` confirms `org/lance/ipc/AsyncScanner.class` is shipped. The JAR is on the cluster classpath after the standard cluster Libraries → Maven install.

The failure is the standard JNI classloader gotcha: native threads attached by Rust use the JVM's `AppClassLoader`, which on Databricks only sees `/databricks/jars/*` (Spark itself + DBR-managed jars) — not user-installed libraries from the cluster Libraries UI. Java application threads use the *thread-context* classloader and find the class fine; the JNI native thread doesn't.

The Rust dispatcher's `FindClass(env, "org/lance/ipc/AsyncScanner")` returns null, the Rust side `unwrap()`/`expect()` panics, the panic propagates through JNI as `JavaException`, the executor's `SparkUncaughtExceptionHandler` calls `System.exit(50)`.

This is documented JNI behavior (https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/invocation.html#wp1539). It bites every JNI library that doesn't pre-resolve `jclass` refs from a Java-thread context.

## Suggested fix (upstream lance-jni)

Two canonical approaches:

1. **Cache global `jclass` refs at Java-thread attach time.** When lance-jni first crosses Java → Rust, the calling thread is a Java application thread whose context classloader sees user libraries. Capture `env->NewGlobalRef(env, env->FindClass("org/lance/ipc/AsyncScanner"))` at that point, and use the cached global ref from the dispatcher thread thereafter. This is what most production JNI libraries (e.g. RocksDB, the netty-tcnative bindings, Arrow's own JNI) do.

2. **Resolve via `Thread.currentThread().getContextClassLoader().loadClass(...)`.** The dispatcher already has a `JNIEnv`; call `Thread.currentThread()` then `Class.forName(name, true, contextClassLoader)`. Slower than (1) but no init-order constraints. Useful as a fallback when (1) misses a class.

The panic site at `dispatcher.rs:44:22` is the natural place to fix it — the panic message itself names the symptom.

## Environment

- Databricks Runtime: `18.2.x-scala2.13` (Spark 4.1, Java 21, Arrow Java 18.x)
- Cluster: `Standard_D4s_v5` × 2 workers, default access mode (no isolation shared)
- Library: cluster Libraries → Maven, no other classpath manipulation
- Coord: `org.lance:lance-spark-bundle-4.1_2.13:0.4.0-beta.4`

The same exit-50 reproduces on DBR 16.4 LTS / 17.3 LTS with the matching bundle for those Spark versions; the underlying classloader behavior is identical across DBR versions.

## Reproducer

Create a fresh DBR 18.2 cluster (defaults, 2 workers). Install `org.lance:lance-spark-bundle-4.1_2.13:0.4.0-beta.4` via cluster Libraries → Maven. Restart. Run:

```scala
import org.apache.spark.sql.{RowFactory, SparkSession}
import org.apache.spark.sql.types._
import scala.util.Random
import scala.collection.JavaConverters._

val spark = SparkSession.builder().getOrCreate()

def randomVec(rng: Random, dim: Int): Array[Float] = {
  val v = new Array[Float](dim); var i = 0
  while (i < dim) { v(i) = rng.nextFloat(); i += 1 }
  v
}

val Dim = 128
val N = 10000
val schema = new StructType(Array(
  StructField("rid", IntegerType, nullable = false),
  StructField(
    "rvec",
    ArrayType(FloatType, containsNull = false),
    nullable = false,
    new MetadataBuilder().putLong("arrow.fixed-size-list.size", Dim.toLong).build())))

val rng = new Random(1337)
val rows = (0 until N).map(i => RowFactory.create(Integer.valueOf(i), randomVec(rng, Dim)))
val df = spark.createDataFrame(rows.asJava, schema).repartition(2)

df.write.format("lance").save(s"/local_disk0/tmp/lance-repro-${System.currentTimeMillis}")
```

Outcome:

```
[TASK_FAILED_EXECUTOR_LOSS] Task failed due to executor loss:
ExecutorLostFailure (executor 4 exited caused by one of the running tasks)
Reason: Command exited with code 50, uncaught exception
SQLSTATE: XX000
```

Capturing the actual diagnostic requires `cluster_log_conf={dbfs:{destination:dbfs:/cluster-logs/...}}` at cluster create time, then waiting ~3 min after the failure for DBR's log delivery agent to flush executor stderr to DBFS. Without that conf, executor stderr is gone the moment the cluster terminates and the user only sees the generic Spark "executor exited code 50" message.

## Bundle version matrix

The crash reproduces with these published bundles (all use lance-jni from lance-core 5.x / 6.x):

| bundle | underlying lance-core | DBR 18.2 result |
|---|---|---|
| `lance-spark-bundle-4.0_2.13:0.4.0-beta.{1..4}` | 5.x / 6.x | crash |
| `lance-spark-bundle-4.1_2.13:0.4.0-beta.4` | 6.0.0-beta.4 | crash |

These work on DBR (no AsyncScanner classload from this codepath, presumably):

| bundle | underlying lance-core | DBR 18.2 result |
|---|---|---|
| `lance-spark-bundle-4.0_2.13:0.3.0-beta.3` | 4.1.0-beta.2 | works |
| `lance-spark-bundle-*:0.4.0` (stable) | 4.0.0 | works |

So: `lance-jni` introduced an `AsyncScanner` JNI lookup somewhere between lance-core 4.x and 5.x. A `git log lance-jni/src/dispatcher.rs` since 4.x should locate the change — that's where the `FindClass` call needs the classloader fix.

## Workaround

Pin to `lance-spark-bundle-*:0.4.0` (or `0.3.0-beta.3`) on Databricks. This forces lance-core 4.x and the dispatcher path that doesn't hit AsyncScanner.

## Asks

1. Is there an in-flight fix for the lance-jni dispatcher's `FindClass` to use cached global refs?
2. The official Databricks doc (https://lance.org/integrations/spark/vendors/databricks/) currently recommends the latest bundle, which has this bug — would a pin to `:0.4.0` until the fix lands be reasonable?

I have full executor logs (`exec4-stderr.log` showing the panic, `driver-stdout.log` for context) and can share them on request.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JNI classloader bug: lance-jni dispatcher FindClass fails on user-installed JARs (Databricks) #538

Summary

Root cause (from executor stderr, captured via `cluster_log_conf`)

Suggested fix (upstream lance-jni)

Environment

Reproducer

Bundle version matrix

Workaround

Asks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bundle	underlying lance-core	DBR 18.2 result
`lance-spark-bundle-4.0_2.13:0.4.0-beta.{1..4}`	5.x / 6.x	crash
`lance-spark-bundle-4.1_2.13:0.4.0-beta.4`	6.0.0-beta.4	crash

bundle	underlying lance-core	DBR 18.2 result
`lance-spark-bundle-4.0_2.13:0.3.0-beta.3`	4.1.0-beta.2	works
`lance-spark-bundle-*:0.4.0` (stable)	4.0.0	works

JNI classloader bug: lance-jni dispatcher FindClass fails on user-installed JARs (Databricks) #538

Description

Summary

Root cause (from executor stderr, captured via cluster_log_conf)

Suggested fix (upstream lance-jni)

Environment

Reproducer

Bundle version matrix

Workaround

Asks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Root cause (from executor stderr, captured via `cluster_log_conf`)