Skip to content

[Feature Request]: HNSW index pruning #2594

Open
@tazarov

Description

@tazarov

Describe the problem

HNSW index growth is unbounded (which is by design as graph removals are expensive). Chroma currently uses the default configuration for allow_replace_deleted which is set to False.

A frequently changing index (e.g., many add/delete, possibly also update) causes the size of therow, but the growth is not limited to the storage footprint but also to memory as the whole index (including deleted IDs) is HNSW index to g loaded into memory.

Describe the proposed solution

Three options with varying difficulty:

  • enable allow_replace_deleted (low complexity, also backward compatible if we add this to default collection config) - note: performance of such an approach should be measured
  • Rebuild index (med complexity) - add a new CLI command to rebuild the binary index by copying all active vectors and labels to a new index - note: slow and memory consuming (at most 2x), but might (to be tested) offer better performance
  • Add pruning functionality to the HNSW lib directly (high complexity)

Alternatives considered

No response

Importance

i cannot use Chroma without it

Additional Information

from typing import Any, Dict
import os
import psutil
import chromadb

client = chromadb.PersistentClient("hnsw_leak")


def get_col():
    return client.get_or_create_collection("hnsw_leak1")


def add_items(ids,documents):
    col = get_col()
    col.add(ids=ids, documents=documents)


def delete_items(ids: list):
    col = get_col()
    col.delete(ids=ids)
    

def query_items(query):
    col = get_col()
    return col.query(query_texts=[query], n_results=10)


def get_process_memory_usage() -> float:
    return psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024
import uuid

for i in range(100):
    items = {
        "ids": [f"{uuid.uuid4()}" for _ in range(500)],
        "documents": [f"this document is about {uuid.uuid4()}" for _ in range(500)]
    }
    add_items(ids=items["ids"], documents=items["documents"])
    print("After add", get_process_memory_usage())
    query_items("this document is about 4")
    print("After query", get_process_memory_usage())
    delete_items(items["ids"])
    print("After delete", get_process_memory_usage())

add_items(ids=items["ids"], documents=items["documents"])

At this point, measuring the size of the HNSW index dir will reveal that it is much larger than similar collection with only 500 added embeddings.

import hnswlib
import psutil
import os
def get_process_memory_usage() -> float:
    return psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024

print("Before loading", get_process_memory_usage())
index = hnswlib.Index(space='l2', dim=384)
index.load_index("hnsw_leak/83d9c2ed-5d41-46db-95fb-ad5b1bdcb780/",is_persistent_index=True,max_elements=501)
print("After loading", get_process_memory_usage())

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions