Skip to content

JNI classloader bug: lance-jni dispatcher FindClass fails on user-installed JARs (Databricks) #538

@sezruby

Description

@sezruby

Status (2026-05-20): root cause is a JNI classloader bug in lance-jni's dispatcher.
Upstream issue: lance-format/lance#6577 (still open).

The classloader fix at lance-format/lance#6549 was tested working
on DBR 18.2 LTS (validated end-to-end with a cross-compiled liblance_jni.so from that
branch — see the comment on #6549) but the PR was closed by the author on 2026-05-20
with a "deduplicate with #6633" note. PR #6633 (merged 2026-05-12) only swaps
attach_current_thread_permanently()attach_current_thread_as_daemon(); it does not
touch find_class. The classloader bug therefore remains unfixed on main.

Workaround: pin lance-spark-bundle-*:0.4.0 (stable, lance-core 4.0.0). See the
bundle-version matrix below.

Summary

format("lance").save() crashes the executor with exit code 50 on Databricks Runtime when the lance-jni native dispatcher tries to resolve org.lance.ipc.AsyncScanner via JNI — the class is in the user-installed bundle JAR, but the dispatcher's native thread uses the JVM system classloader (which only sees /databricks/jars/*) and the lookup fails.

The official Databricks setup from https://lance.org/integrations/spark/vendors/databricks/ is followed — published Maven coord, no shading, no relocations, no init scripts. Reproduces against org.lance:lance-spark-bundle-4.1_2.13:0.4.0-beta.4 on DBR 18.2 LTS.

Root cause (from executor stderr, captured via cluster_log_conf)

thread 'lance-jni-dispatcher' (2688) panicked at src/dispatcher.rs:44:22:
AsyncScanner class not found: JavaException

ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread-91:
  java.lang.NoClassDefFoundError: org/lance/ipc/AsyncScanner
Caused by: java.lang.ClassNotFoundException: org.lance.ipc.AsyncScanner
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526)

org.lance.ipc.AsyncScanner IS present in the published bundle JAR — unzip -l lance-spark-bundle-4.1_2.13-0.4.0-beta.4.jar | grep AsyncScanner confirms org/lance/ipc/AsyncScanner.class is shipped. The JAR is on the cluster classpath after the standard cluster Libraries → Maven install.

The failure is the standard JNI classloader gotcha: native threads attached by Rust use the JVM's AppClassLoader, which on Databricks only sees /databricks/jars/* (Spark itself + DBR-managed jars) — not user-installed libraries from the cluster Libraries UI. Java application threads use the thread-context classloader and find the class fine; the JNI native thread doesn't.

The Rust dispatcher's FindClass(env, "org/lance/ipc/AsyncScanner") returns null, the Rust side unwrap()/expect() panics, the panic propagates through JNI as JavaException, the executor's SparkUncaughtExceptionHandler calls System.exit(50).

This is documented JNI behavior (https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/invocation.html#wp1539). It bites every JNI library that doesn't pre-resolve jclass refs from a Java-thread context.

Suggested fix (upstream lance-jni)

Two canonical approaches:

  1. Cache global jclass refs at Java-thread attach time. When lance-jni first crosses Java → Rust, the calling thread is a Java application thread whose context classloader sees user libraries. Capture env->NewGlobalRef(env, env->FindClass("org/lance/ipc/AsyncScanner")) at that point, and use the cached global ref from the dispatcher thread thereafter. This is what most production JNI libraries (e.g. RocksDB, the netty-tcnative bindings, Arrow's own JNI) do.

  2. Resolve via Thread.currentThread().getContextClassLoader().loadClass(...). The dispatcher already has a JNIEnv; call Thread.currentThread() then Class.forName(name, true, contextClassLoader). Slower than (1) but no init-order constraints. Useful as a fallback when (1) misses a class.

The panic site at dispatcher.rs:44:22 is the natural place to fix it — the panic message itself names the symptom.

Environment

  • Databricks Runtime: 18.2.x-scala2.13 (Spark 4.1, Java 21, Arrow Java 18.x)
  • Cluster: Standard_D4s_v5 × 2 workers, default access mode (no isolation shared)
  • Library: cluster Libraries → Maven, no other classpath manipulation
  • Coord: org.lance:lance-spark-bundle-4.1_2.13:0.4.0-beta.4

The same exit-50 reproduces on DBR 16.4 LTS / 17.3 LTS with the matching bundle for those Spark versions; the underlying classloader behavior is identical across DBR versions.

Reproducer

Create a fresh DBR 18.2 cluster (defaults, 2 workers). Install org.lance:lance-spark-bundle-4.1_2.13:0.4.0-beta.4 via cluster Libraries → Maven. Restart. Run:

import org.apache.spark.sql.{RowFactory, SparkSession}
import org.apache.spark.sql.types._
import scala.util.Random
import scala.collection.JavaConverters._

val spark = SparkSession.builder().getOrCreate()

def randomVec(rng: Random, dim: Int): Array[Float] = {
  val v = new Array[Float](dim); var i = 0
  while (i < dim) { v(i) = rng.nextFloat(); i += 1 }
  v
}

val Dim = 128
val N = 10000
val schema = new StructType(Array(
  StructField("rid", IntegerType, nullable = false),
  StructField(
    "rvec",
    ArrayType(FloatType, containsNull = false),
    nullable = false,
    new MetadataBuilder().putLong("arrow.fixed-size-list.size", Dim.toLong).build())))

val rng = new Random(1337)
val rows = (0 until N).map(i => RowFactory.create(Integer.valueOf(i), randomVec(rng, Dim)))
val df = spark.createDataFrame(rows.asJava, schema).repartition(2)

df.write.format("lance").save(s"/local_disk0/tmp/lance-repro-${System.currentTimeMillis}")

Outcome:

[TASK_FAILED_EXECUTOR_LOSS] Task failed due to executor loss:
ExecutorLostFailure (executor 4 exited caused by one of the running tasks)
Reason: Command exited with code 50, uncaught exception
SQLSTATE: XX000

Capturing the actual diagnostic requires cluster_log_conf={dbfs:{destination:dbfs:/cluster-logs/...}} at cluster create time, then waiting ~3 min after the failure for DBR's log delivery agent to flush executor stderr to DBFS. Without that conf, executor stderr is gone the moment the cluster terminates and the user only sees the generic Spark "executor exited code 50" message.

Bundle version matrix

The crash reproduces with these published bundles (all use lance-jni from lance-core 5.x / 6.x):

bundle underlying lance-core DBR 18.2 result
lance-spark-bundle-4.0_2.13:0.4.0-beta.{1..4} 5.x / 6.x crash
lance-spark-bundle-4.1_2.13:0.4.0-beta.4 6.0.0-beta.4 crash

These work on DBR (no AsyncScanner classload from this codepath, presumably):

bundle underlying lance-core DBR 18.2 result
lance-spark-bundle-4.0_2.13:0.3.0-beta.3 4.1.0-beta.2 works
lance-spark-bundle-*:0.4.0 (stable) 4.0.0 works

So: lance-jni introduced an AsyncScanner JNI lookup somewhere between lance-core 4.x and 5.x. A git log lance-jni/src/dispatcher.rs since 4.x should locate the change — that's where the FindClass call needs the classloader fix.

Workaround

Pin to lance-spark-bundle-*:0.4.0 (or 0.3.0-beta.3) on Databricks. This forces lance-core 4.x and the dispatcher path that doesn't hit AsyncScanner.

Asks

  1. Is there an in-flight fix for the lance-jni dispatcher's FindClass to use cached global refs?
  2. The official Databricks doc (https://lance.org/integrations/spark/vendors/databricks/) currently recommends the latest bundle, which has this bug — would a pin to :0.4.0 until the fix lands be reasonable?

I have full executor logs (exec4-stderr.log showing the panic, driver-stdout.log for context) and can share them on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions