Clustering freezing when assigning noise points:

I'm attempting to cluster ~600k short texts (reviews). The process goes ok up until it logs that it's assigning noise points to clusters. It spends close to an hour embedding and clustering.

I successfully clustered a much smaller dataset sampled from this data (1000 items)

I'm running lilac in docker, it's the `latest` tag, which appears to currently be `lilacai/lilac:0.3.5`. I'm running it on a system with a geforce 3090.

Here are the logs:

```
jinaai/jina-embeddings-v2-small-en using device: cuda:0
[local/reviews][1 shards] map "cluster_documents" to "('review__cluster',)"Computing embeddings: 100%|██████████| 646314/646314 [35:16<00:00, 305.37it/s]s]
Computing embeddings took 2122.877s.
/usr/local/lib/python3.11/site-packages/umap/umap_.py:1943: UserWarning: n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")
UMAP: Reducing dim from 512 to 5 of 646314 vectors took 1435.081s.
HDBSCAN: Clustering took 151.289s.
237724 noise points (36.8%) will be assigned to nearest cluster.
```

After this point it freezes, and the UI and server become very slow. The main thread appears to be very busy

<img width="790" alt="image" src="https://github.com/lilacai/lilac/assets/2024018/4bf461fa-8f11-4664-b4bb-3b885611f186">

Looking at the code, it seems like it's skipping assigning labels or something?

https://github.com/lilacai/lilac/blob/8e7418d533e6fba64ef1854e4112e66c035321bf/lilac/data/clustering.py#L386

```
if num_noisy > 0 and num_noisy < len(clusterer.labels_):
```

the `num_noisy` count here seems to be quite high, and so I assume this condition isn't true, so it's actually skipping assigning labels. My read-through of the code here falls apart, and I'm unsure where the process is spending its time.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering freezing when assigning noise points: #1190

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clustering freezing when assigning noise points: #1190

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions