Skip to content

[Bug]: HNSW Rust loader SIGSEGVs (exit 139) on truncated or graph-corrupt persisted segments instead of raising — two on-disk corruption shapes (1.5.8, macOS ARM64) #7238

@fuzemobi

Description

@fuzemobi

Summary

Two distinct shapes of on-disk corruption in persisted HNSW vector segments cause chromadb_rust_bindings to crash the whole process with SIGSEGV (KERN_INVALID_ADDRESS at 0x84, exit 139) instead of raising a Python exception. The crash fires on the first collection.count() (and any other call that loads the vector segment), in the Rust segment-loader worker threads. PersistentClient(...) and list_collections() succeed, so the failure is undiagnosable from Python without a crash reporter / PYTHONFAULTHANDLER.

Related (same "segfault instead of error" class, different corruption shapes): #6949, #7069, #6984.

Environment

  • ChromaDB: 1.5.8 (chromadb_rust_bindings.abi3.so)
  • Python: 3.14.5 (Homebrew, pipx venv)
  • OS: macOS 26.3.1 ARM64 (Apple Silicon, Mac16,5)
  • Embedding function: DefaultEmbeddingFunction (all-MiniLM-L6-v2, 384 dims)
  • PRAGMA integrity_check on chroma.sqlite3: ok (sqlite metadata fully intact in both cases)

Corruption shape 1 — truncated segment (interrupted flush)

Vector segment directory left in a partially-flushed state, most likely by the writing process being killed mid-flush:

data_level0.bin   167600 bytes   (~100 vectors present)
header.bin           100 bytes
length.bin           400 bytes
link_lists.bin         0 bytes   <-- empty
index_metadata.pickle              <-- MISSING entirely

The segment also had no row in max_seq_id, so its collection's WAL entries (627) were never purged. Loading this segment → SIGSEGV.

Corruption shape 2 — corrupt graph in a fully-present segment

A 392 MB segment with all five files present and plausible sizes (data_level0.bin 385 MB, index_metadata.pickle 22 MB, length.bin 920 KB ≈ 230K vectors, link_lists.bin 1.9 MB). Loading it crashes with a near-NULL byte read at offset 0x84:

Exception Type:    EXC_BAD_ACCESS (SIGSEGV)
Exception Subtype: KERN_INVALID_ADDRESS at 0x0000000000000084
far: 0x0000000000000084  esr: 0x92000006 (Data Abort) byte read Translation fault

Thread 23 Crashed:
0   chromadb_rust_bindings.abi3.so  0x11a908000 + 25958564
1   chromadb_rust_bindings.abi3.so  0x11a908000 + 25943260
2   chromadb_rust_bindings.abi3.so  0x11a908000 + 25931232
3   chromadb_rust_bindings.abi3.so  0x11a908000 + 9018588
4   chromadb_rust_bindings.abi3.so  0x11a908000 + 10211600
...
15  chromadb_rust_bindings.abi3.so  0x11a908000 + 24364112
16  libsystem_pthread.dylib         _pthread_start + 136

Several sibling worker threads were in the same loader code path concurrently (parallel segment load); two of them show the same faulting frames. Binary UUID: 0140fcd8-8f81-3a8b-80fa-0fec930e00d6.

Python-side stack at crash (faulthandler):

File "chromadb/api/rust.py", line 397 in _count
File "chromadb/api/models/Collection.py", line 55 in count

Reproduction

Shape 1 reproduces by simulating an interrupted flush on any persisted collection:

# after creating/persisting a collection with a few hundred docs:
import os
seg = "/path/to/persist_dir/<vector-segment-uuid>"
open(os.path.join(seg, "link_lists.bin"), "w").close()   # truncate to 0
os.remove(os.path.join(seg, "index_metadata.pickle"))

import chromadb
col = chromadb.PersistentClient(path="/path/to/persist_dir").get_collection("my_collection")
col.count()   # SIGSEGV, exit 139 — no Python exception

Expected behavior

Malformed/truncated HNSW segment files should produce a catchable Python exception (e.g. InternalError: vector segment <uuid> failed to load: link_lists.bin truncated), ideally with guidance to rebuild. A validation pass over the five segment files (sizes/consistency against header.bin + pickle) before the graph walk would catch both shapes cheaply.

Workaround that recovered both collections (no data loss)

  1. Shape 1: quarantine (move out) the corrupt segment dir → loader starts fresh and replays the un-purged WAL → collection self-heals.
  2. Shape 2 (WAL already purged): export ids/documents/metadatas directly from chroma.sqlite3 (embeddings + embedding_metadata, document under the chroma:document key), quarantine the segment dir, delete_collection + create_collection, re-upsert (re-embed).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions