[Feature Request]: HNSW index pruning

### Describe the problem

HNSW index growth is unbounded (which is by design as graph removals are expensive). Chroma currently uses the default configuration for `allow_replace_deleted` which is set to `False`.

A frequently changing index (e.g., many add/delete, possibly also update) causes the size of therow, but the growth is not limited to the storage footprint but also to memory as the whole index (including deleted IDs) is HNSW index to g loaded into memory.





### Describe the proposed solution

Three options with varying difficulty:

- enable `allow_replace_deleted` (low complexity, also backward compatible if we add this to default collection config) - note: performance of such an approach should be measured
- Rebuild index (med complexity) -  add a new CLI command to rebuild the binary index by copying all active vectors and labels to a new index - note: slow and memory consuming (at most 2x), but might (to be tested) offer better performance
- Add pruning functionality to the HNSW lib directly (high complexity)

### Alternatives considered

_No response_

### Importance

i cannot use Chroma without it

### Additional Information

```python
from typing import Any, Dict
import os
import psutil
import chromadb

client = chromadb.PersistentClient("hnsw_leak")


def get_col():
    return client.get_or_create_collection("hnsw_leak1")


def add_items(ids,documents):
    col = get_col()
    col.add(ids=ids, documents=documents)


def delete_items(ids: list):
    col = get_col()
    col.delete(ids=ids)
    

def query_items(query):
    col = get_col()
    return col.query(query_texts=[query], n_results=10)


def get_process_memory_usage() -> float:
    return psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024
```

```python
import uuid

for i in range(100):
    items = {
        "ids": [f"{uuid.uuid4()}" for _ in range(500)],
        "documents": [f"this document is about {uuid.uuid4()}" for _ in range(500)]
    }
    add_items(ids=items["ids"], documents=items["documents"])
    print("After add", get_process_memory_usage())
    query_items("this document is about 4")
    print("After query", get_process_memory_usage())
    delete_items(items["ids"])
    print("After delete", get_process_memory_usage())

add_items(ids=items["ids"], documents=items["documents"])
 ```

At this point, measuring the size of the HNSW index dir will reveal that it is much larger than similar collection with only 500 added embeddings.

```python
import hnswlib
import psutil
import os
def get_process_memory_usage() -> float:
    return psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024

print("Before loading", get_process_memory_usage())
index = hnswlib.Index(space='l2', dim=384)
index.load_index("hnsw_leak/83d9c2ed-5d41-46db-95fb-ad5b1bdcb780/",is_persistent_index=True,max_elements=501)
print("After loading", get_process_memory_usage())
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request]: HNSW index pruning #2594

Describe the problem

Describe the proposed solution

Alternatives considered

Importance

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request]: HNSW index pruning #2594

Description

Describe the problem

Describe the proposed solution

Alternatives considered

Importance

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions