Status (2026-05-20): root cause is a JNI classloader bug in lance-jni's dispatcher.
Upstream issue: lance-format/lance#6577 (still open).
The classloader fix at lance-format/lance#6549 was tested working
on DBR 18.2 LTS (validated end-to-end with a cross-compiled liblance_jni.so from that
branch — see the comment on #6549) but the PR was closed by the author on 2026-05-20
with a "deduplicate with #6633" note. PR #6633 (merged 2026-05-12) only swaps
attach_current_thread_permanently() → attach_current_thread_as_daemon(); it does not
touch find_class. The classloader bug therefore remains unfixed on main.
Workaround: pin lance-spark-bundle-*:0.4.0 (stable, lance-core 4.0.0). See the
bundle-version matrix below.
Summary
format("lance").save() crashes the executor with exit code 50 on Databricks Runtime when the lance-jni native dispatcher tries to resolve org.lance.ipc.AsyncScanner via JNI — the class is in the user-installed bundle JAR, but the dispatcher's native thread uses the JVM system classloader (which only sees /databricks/jars/*) and the lookup fails.
The official Databricks setup from https://lance.org/integrations/spark/vendors/databricks/ is followed — published Maven coord, no shading, no relocations, no init scripts. Reproduces against org.lance:lance-spark-bundle-4.1_2.13:0.4.0-beta.4 on DBR 18.2 LTS.
Root cause (from executor stderr, captured via cluster_log_conf)
thread 'lance-jni-dispatcher' (2688) panicked at src/dispatcher.rs:44:22:
AsyncScanner class not found: JavaException
ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread-91:
java.lang.NoClassDefFoundError: org/lance/ipc/AsyncScanner
Caused by: java.lang.ClassNotFoundException: org.lance.ipc.AsyncScanner
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526)
org.lance.ipc.AsyncScanner IS present in the published bundle JAR — unzip -l lance-spark-bundle-4.1_2.13-0.4.0-beta.4.jar | grep AsyncScanner confirms org/lance/ipc/AsyncScanner.class is shipped. The JAR is on the cluster classpath after the standard cluster Libraries → Maven install.
The failure is the standard JNI classloader gotcha: native threads attached by Rust use the JVM's AppClassLoader, which on Databricks only sees /databricks/jars/* (Spark itself + DBR-managed jars) — not user-installed libraries from the cluster Libraries UI. Java application threads use the thread-context classloader and find the class fine; the JNI native thread doesn't.
The Rust dispatcher's FindClass(env, "org/lance/ipc/AsyncScanner") returns null, the Rust side unwrap()/expect() panics, the panic propagates through JNI as JavaException, the executor's SparkUncaughtExceptionHandler calls System.exit(50).
This is documented JNI behavior (https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/invocation.html#wp1539). It bites every JNI library that doesn't pre-resolve jclass refs from a Java-thread context.
Suggested fix (upstream lance-jni)
Two canonical approaches:
-
Cache global jclass refs at Java-thread attach time. When lance-jni first crosses Java → Rust, the calling thread is a Java application thread whose context classloader sees user libraries. Capture env->NewGlobalRef(env, env->FindClass("org/lance/ipc/AsyncScanner")) at that point, and use the cached global ref from the dispatcher thread thereafter. This is what most production JNI libraries (e.g. RocksDB, the netty-tcnative bindings, Arrow's own JNI) do.
-
Resolve via Thread.currentThread().getContextClassLoader().loadClass(...). The dispatcher already has a JNIEnv; call Thread.currentThread() then Class.forName(name, true, contextClassLoader). Slower than (1) but no init-order constraints. Useful as a fallback when (1) misses a class.
The panic site at dispatcher.rs:44:22 is the natural place to fix it — the panic message itself names the symptom.
Environment
- Databricks Runtime:
18.2.x-scala2.13 (Spark 4.1, Java 21, Arrow Java 18.x)
- Cluster:
Standard_D4s_v5 × 2 workers, default access mode (no isolation shared)
- Library: cluster Libraries → Maven, no other classpath manipulation
- Coord:
org.lance:lance-spark-bundle-4.1_2.13:0.4.0-beta.4
The same exit-50 reproduces on DBR 16.4 LTS / 17.3 LTS with the matching bundle for those Spark versions; the underlying classloader behavior is identical across DBR versions.
Reproducer
Create a fresh DBR 18.2 cluster (defaults, 2 workers). Install org.lance:lance-spark-bundle-4.1_2.13:0.4.0-beta.4 via cluster Libraries → Maven. Restart. Run:
import org.apache.spark.sql.{RowFactory, SparkSession}
import org.apache.spark.sql.types._
import scala.util.Random
import scala.collection.JavaConverters._
val spark = SparkSession.builder().getOrCreate()
def randomVec(rng: Random, dim: Int): Array[Float] = {
val v = new Array[Float](dim); var i = 0
while (i < dim) { v(i) = rng.nextFloat(); i += 1 }
v
}
val Dim = 128
val N = 10000
val schema = new StructType(Array(
StructField("rid", IntegerType, nullable = false),
StructField(
"rvec",
ArrayType(FloatType, containsNull = false),
nullable = false,
new MetadataBuilder().putLong("arrow.fixed-size-list.size", Dim.toLong).build())))
val rng = new Random(1337)
val rows = (0 until N).map(i => RowFactory.create(Integer.valueOf(i), randomVec(rng, Dim)))
val df = spark.createDataFrame(rows.asJava, schema).repartition(2)
df.write.format("lance").save(s"/local_disk0/tmp/lance-repro-${System.currentTimeMillis}")
Outcome:
[TASK_FAILED_EXECUTOR_LOSS] Task failed due to executor loss:
ExecutorLostFailure (executor 4 exited caused by one of the running tasks)
Reason: Command exited with code 50, uncaught exception
SQLSTATE: XX000
Capturing the actual diagnostic requires cluster_log_conf={dbfs:{destination:dbfs:/cluster-logs/...}} at cluster create time, then waiting ~3 min after the failure for DBR's log delivery agent to flush executor stderr to DBFS. Without that conf, executor stderr is gone the moment the cluster terminates and the user only sees the generic Spark "executor exited code 50" message.
Bundle version matrix
The crash reproduces with these published bundles (all use lance-jni from lance-core 5.x / 6.x):
| bundle |
underlying lance-core |
DBR 18.2 result |
lance-spark-bundle-4.0_2.13:0.4.0-beta.{1..4} |
5.x / 6.x |
crash |
lance-spark-bundle-4.1_2.13:0.4.0-beta.4 |
6.0.0-beta.4 |
crash |
These work on DBR (no AsyncScanner classload from this codepath, presumably):
| bundle |
underlying lance-core |
DBR 18.2 result |
lance-spark-bundle-4.0_2.13:0.3.0-beta.3 |
4.1.0-beta.2 |
works |
lance-spark-bundle-*:0.4.0 (stable) |
4.0.0 |
works |
So: lance-jni introduced an AsyncScanner JNI lookup somewhere between lance-core 4.x and 5.x. A git log lance-jni/src/dispatcher.rs since 4.x should locate the change — that's where the FindClass call needs the classloader fix.
Workaround
Pin to lance-spark-bundle-*:0.4.0 (or 0.3.0-beta.3) on Databricks. This forces lance-core 4.x and the dispatcher path that doesn't hit AsyncScanner.
Asks
- Is there an in-flight fix for the lance-jni dispatcher's
FindClass to use cached global refs?
- The official Databricks doc (https://lance.org/integrations/spark/vendors/databricks/) currently recommends the latest bundle, which has this bug — would a pin to
:0.4.0 until the fix lands be reasonable?
I have full executor logs (exec4-stderr.log showing the panic, driver-stdout.log for context) and can share them on request.
Summary
format("lance").save()crashes the executor with exit code 50 on Databricks Runtime when the lance-jni native dispatcher tries to resolveorg.lance.ipc.AsyncScannervia JNI — the class is in the user-installed bundle JAR, but the dispatcher's native thread uses the JVM system classloader (which only sees/databricks/jars/*) and the lookup fails.The official Databricks setup from https://lance.org/integrations/spark/vendors/databricks/ is followed — published Maven coord, no shading, no relocations, no init scripts. Reproduces against
org.lance:lance-spark-bundle-4.1_2.13:0.4.0-beta.4on DBR 18.2 LTS.Root cause (from executor stderr, captured via
cluster_log_conf)org.lance.ipc.AsyncScannerIS present in the published bundle JAR —unzip -l lance-spark-bundle-4.1_2.13-0.4.0-beta.4.jar | grep AsyncScannerconfirmsorg/lance/ipc/AsyncScanner.classis shipped. The JAR is on the cluster classpath after the standard cluster Libraries → Maven install.The failure is the standard JNI classloader gotcha: native threads attached by Rust use the JVM's
AppClassLoader, which on Databricks only sees/databricks/jars/*(Spark itself + DBR-managed jars) — not user-installed libraries from the cluster Libraries UI. Java application threads use the thread-context classloader and find the class fine; the JNI native thread doesn't.The Rust dispatcher's
FindClass(env, "org/lance/ipc/AsyncScanner")returns null, the Rust sideunwrap()/expect()panics, the panic propagates through JNI asJavaException, the executor'sSparkUncaughtExceptionHandlercallsSystem.exit(50).This is documented JNI behavior (https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/invocation.html#wp1539). It bites every JNI library that doesn't pre-resolve
jclassrefs from a Java-thread context.Suggested fix (upstream lance-jni)
Two canonical approaches:
Cache global
jclassrefs at Java-thread attach time. When lance-jni first crosses Java → Rust, the calling thread is a Java application thread whose context classloader sees user libraries. Captureenv->NewGlobalRef(env, env->FindClass("org/lance/ipc/AsyncScanner"))at that point, and use the cached global ref from the dispatcher thread thereafter. This is what most production JNI libraries (e.g. RocksDB, the netty-tcnative bindings, Arrow's own JNI) do.Resolve via
Thread.currentThread().getContextClassLoader().loadClass(...). The dispatcher already has aJNIEnv; callThread.currentThread()thenClass.forName(name, true, contextClassLoader). Slower than (1) but no init-order constraints. Useful as a fallback when (1) misses a class.The panic site at
dispatcher.rs:44:22is the natural place to fix it — the panic message itself names the symptom.Environment
18.2.x-scala2.13(Spark 4.1, Java 21, Arrow Java 18.x)Standard_D4s_v5× 2 workers, default access mode (no isolation shared)org.lance:lance-spark-bundle-4.1_2.13:0.4.0-beta.4The same exit-50 reproduces on DBR 16.4 LTS / 17.3 LTS with the matching bundle for those Spark versions; the underlying classloader behavior is identical across DBR versions.
Reproducer
Create a fresh DBR 18.2 cluster (defaults, 2 workers). Install
org.lance:lance-spark-bundle-4.1_2.13:0.4.0-beta.4via cluster Libraries → Maven. Restart. Run:Outcome:
Capturing the actual diagnostic requires
cluster_log_conf={dbfs:{destination:dbfs:/cluster-logs/...}}at cluster create time, then waiting ~3 min after the failure for DBR's log delivery agent to flush executor stderr to DBFS. Without that conf, executor stderr is gone the moment the cluster terminates and the user only sees the generic Spark "executor exited code 50" message.Bundle version matrix
The crash reproduces with these published bundles (all use lance-jni from lance-core 5.x / 6.x):
lance-spark-bundle-4.0_2.13:0.4.0-beta.{1..4}lance-spark-bundle-4.1_2.13:0.4.0-beta.4These work on DBR (no AsyncScanner classload from this codepath, presumably):
lance-spark-bundle-4.0_2.13:0.3.0-beta.3lance-spark-bundle-*:0.4.0(stable)So:
lance-jniintroduced anAsyncScannerJNI lookup somewhere between lance-core 4.x and 5.x. Agit log lance-jni/src/dispatcher.rssince 4.x should locate the change — that's where theFindClasscall needs the classloader fix.Workaround
Pin to
lance-spark-bundle-*:0.4.0(or0.3.0-beta.3) on Databricks. This forces lance-core 4.x and the dispatcher path that doesn't hit AsyncScanner.Asks
FindClassto use cached global refs?:0.4.0until the fix lands be reasonable?I have full executor logs (
exec4-stderr.logshowing the panic,driver-stdout.logfor context) and can share them on request.