Skip to content

Commit 7cae26d

Browse files
generallЕвгения Суходольская
and
Евгения Суходольская
authored
Distance-based data exploration article (#1245)
* wip: article * add clustering * add snippet to plot chart * add graphs * conclusion * fix grammar * author * review --------- Co-authored-by: Евгения Суходольская <[email protected]>
1 parent 540d061 commit 7cae26d

21 files changed

+275
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,275 @@
1+
---
2+
title: "Distance-based data exploration"
3+
short_description: "Efficient visualization and clusterization of high-dimensional data with Qdrant"
4+
description: "Explore your data under a new angle with Qdrant's tools for dimensionality reduction, clusterization, and visualization."
5+
social_preview_image: /articles_data/distance-based-exploration/social-preview.jpg
6+
preview_dir: /articles_data/distance-based-exploration/preview
7+
weight: -250
8+
author: Andrey Vasnetsov
9+
date: 2025-03-11T12:00:00+03:00
10+
draft: false
11+
keywords:
12+
- clusterization
13+
- dimensionality reduction
14+
- vizualization
15+
category: data-exploration
16+
---
17+
18+
19+
## Hidden Structure
20+
21+
When working with large collections of documents, images, or other arrays of unstructured data, it often becomes useful to understand the big picture.
22+
Examining data points individually is not always the best way to grasp the structure of the data.
23+
24+
{{< figure src="/articles_data/distance-based-exploration/no-context-data.png" alt="Data visualization" caption="Datapoints without context, pretty much useless" >}}
25+
26+
As numbers in a table obtain meaning when plotted on a graph, visualising distances (similar/dissimilar) between unstructured data items can reveal hidden structures and patterns.
27+
28+
{{< figure src="/articles_data/distance-based-exploration/data-on-chart.png" alt="Data visualization" caption="Vizualized chart, very intuitive" >}}
29+
There are many tools to investigate data similarity, and Qdrant's [1.12 release](https://qdrant.tech/blog/qdrant-1.12.x/) made it much easier to start this investigation. With the new [Distance Matrix API](/documentation/concepts/explore/#distance-matrix), Qdrant handles the most computationally expensive part of the process—calculating the distances between data points.
30+
31+
In many implementations, the distance matrix calculation was part of the clustering or visualization processes, requiring either brute-force computation or building a temporary index. With Qdrant, however, the data is already indexed, and the distance matrix can be computed relatively cheaply.
32+
33+
In this article, we will explore several methods for data exploration using the Distance Matrix API.
34+
35+
## Dimensionality Reduction
36+
37+
Initially, we might want to visualize an entire dataset, or at least a large portion of it, at a glance. However, high-dimensional data cannot be directly visualized. We must apply dimensionality reduction techniques to convert data into a lower-dimensional representation while preserving important data properties.
38+
39+
In this article, we will use [UMAP](https://github.com/lmcinnes/umap) as our dimensionality reduction algorithm.
40+
41+
Here is a **very** simplified but intuitive explanation of UMAP:
42+
43+
1. *Randomly generate points in 2D space*: Assign a random 2D point to each high-dimensional point.
44+
2. *Compute distance matrix for high-dimensional points*: Calculate distances between all pairs of points.
45+
3. *Compute distance matrix for 2D points*: Perform similarly to step 2.
46+
4. *Match both distance matrices*: Adjust 2D points to minimize differences.
47+
48+
{{< figure src="/articles_data/distance-based-exploration/umap.png" alt="UMAP" caption="Canonical example of UMAP results, [source](https://github.com/lmcinnes/umap?tab=readme-ov-file#performance-and-examples)" >}}
49+
50+
UMAP preserves the relative distances between high-dimensional points; the actual coordinates are not essential. If we already have the distance matrix, step 2 can be skipped entirely.
51+
52+
Let's use Qdrant to calculate the distance matrix and apply UMAP.
53+
We will use one of the default datasets perfect for experimenting in Qdrant--[Midjourney Styles dataset](https://midlibrary.io/).
54+
55+
Use this command to download and import the dataset into Qdrant:
56+
57+
```http
58+
PUT /collections/midlib/snapshots/recover
59+
{
60+
"location": "http://snapshots.qdrant.io/midlib.snapshot"
61+
}
62+
```
63+
64+
<details>
65+
<summary>We also need to prepare our python enviroment:</summary>
66+
67+
```bash
68+
pip install umap-learn seaborn matplotlib qdrant-client
69+
```
70+
71+
Import the necessary libraries:
72+
73+
```python
74+
# Used to talk to Qdrant
75+
from qdrant_client import QdrantClient
76+
# Package with original UMAP implementation
77+
from umap import UMAP
78+
# Python implementation for sparse matrices
79+
from scipy.sparse import csr_matrix
80+
# For vizualization
81+
import seaborn as sns
82+
```
83+
84+
Establish connection to Qdrant:
85+
86+
```python
87+
client = QdrantClient("http://localhost:6333")
88+
```
89+
90+
</details>
91+
92+
After this is done, we can compute the distance matrix:
93+
94+
```python
95+
96+
# Request distances matrix from Qdrant
97+
# `_offsets` suffix defines a format of the output matrix.
98+
result = client.search_matrix_offsets(
99+
collection_name="midlib",
100+
sample=1000, # Select a subset of the data, as the whole dataset might be too large
101+
limit=20, # For performance reasons, limit the number of closest neighbors to consider
102+
)
103+
104+
# Convert distances matrix to python-native format
105+
matrix = csr_matrix(
106+
(result.scores, (result.offsets_row, result.offsets_col))
107+
)
108+
109+
# Make the matrix symmetric, as UMAP expects it.
110+
# Distance matrix is always symmetric, but qdrant only computes half of it.
111+
matrix = matrix + matrix.T
112+
```
113+
114+
Now we can apply UMAP to the distance matrix:
115+
116+
```python
117+
umap = UMAP(
118+
metric="precomputed", # We provide ready-made distance matrix
119+
n_components=2, # output dimension
120+
n_neighbors=20, # Same as the limit in the search_matrix_offsets
121+
)
122+
123+
vectors_2d = umap.fit_transform(matrix)
124+
```
125+
126+
That's all that is needed to get the 2d representation of the data.
127+
128+
{{< figure src="/articles_data/distance-based-exploration/umap-midlib.png" alt="UMAP on Midlib" caption="UMAP applied to Midlib dataset" >}}
129+
130+
<aside role="status">Interactive version of this plot is available in <a href="https://qdrant.tech/documentation/web-ui/"> Qdrant Web UI </a>!</aside>
131+
132+
UMAP isn't the only algorithm compatible with our distance matrix API. For example, `scikit-learn` also offers:
133+
134+
- [Isomap](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.Isomap.html) - Non-linear dimensionality reduction through Isometric Mapping.
135+
- [SpectralEmbedding](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.SpectralEmbedding.html) - Forms an affinity matrix given by the specified function and applies spectral decomposition to the corresponding graph Laplacian.
136+
- [TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) - well-known algorithm for dimensionality reduction.
137+
138+
## Clustering
139+
140+
Another approach to data structure understanding is clustering--grouping similar items.
141+
142+
*Note that there's no universally best clustering criterion or algorithm.*
143+
144+
{{< figure src="/articles_data/distance-based-exploration/clustering.png" alt="Clustering" caption="Clustering example, [source](https://scikit-learn.org/)" width="80%" >}}
145+
146+
Many clustering algorithms accept precomputed distance matrix as input, so we can use the same distance matrix we calculated before.
147+
148+
Let's consider a simple example of clustering the Midlib dataset with **KMeans algorithm**.
149+
150+
From [scikit-learn.cluster documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) we know that `fit()` method of KMeans algorithm prefers as an input:
151+
152+
153+
> `X : {array-like, sparse matrix} of shape (n_samples, n_features)`:
154+
> Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
155+
156+
157+
So we can re-use `matrix` from the previous example:
158+
159+
160+
```python
161+
from sklearn.cluster import KMeans
162+
163+
# Initialize KMeans with 10 clusters
164+
kmeans = KMeans(n_clusters=10)
165+
166+
# Generate index of the cluster each sample belongs to
167+
cluster_labels = kmeans.fit_predict(matrix)
168+
```
169+
170+
With this simple code, we have clustered the data into 10 clusters, while the main CPU-intensive part of the process was done by Qdrant.
171+
172+
{{< figure src="/articles_data/distance-based-exploration/clustering-midlib.png" alt="Clustering on Midlib" caption="Clustering applied to Midlib dataset" >}}
173+
174+
175+
<details>
176+
<summary>How to plot this chart</summary>
177+
178+
```python
179+
sns.scatterplot(
180+
# Coordinates obtained from UMAP
181+
x=vectors_2d[:, 0], y=vectors_2d[:, 1],
182+
# Color datapoints by cluster
183+
hue=cluster_labels,
184+
palette=sns.color_palette("pastel", 10),
185+
legend="full",
186+
)
187+
```
188+
</details>
189+
190+
191+
## Graphs
192+
193+
Clustering and dimensionality reduction both aim to provide a more transparent overview of the data.
194+
However, they share a common characteristic - they require a training step before the results can be visualized.
195+
196+
This also implies that introducing new data points necessitates re-running the training step, which may be computationally expensive.
197+
198+
Graphs offer an alternative approach to data exploration, enabling direct, interactive visualization of relationships between data points.
199+
In a graph representation, each data point is a node, and similarities between data points are represented as edges connecting the nodes.
200+
201+
Such a graph can be rendered in real-time using [force-directed layout](https://en.wikipedia.org/wiki/Force-directed_graph_drawing) algorithms, which aim to minimize the system's energy by repositioning nodes dynamically--the more similar the data points are, the stronger the edges between them.
202+
203+
Adding new data points to the graph is as straightforward as inserting new nodes and edges without the need to re-run any training steps.
204+
205+
In practice, rendering a graph for an entire dataset at once may be computationally expensive and overwhelming for the user. Therefore, let's explore a few strategies to address this issue.
206+
207+
### Expanding from a single node
208+
209+
This is the simplest approach, where we start with a single node and expand the graph by adding the most similar nodes to the graph.
210+
211+
{{< figure src="/articles_data/distance-based-exploration/graph.gif" alt="Graph" caption="Graph representation of the data" >}}
212+
213+
<aside role="status">An interactive version of this plot is available in <a href="https://qdrant.tech/documentation/web-ui/"> Qdrant Web UI </a>!</aside>
214+
215+
### Sampling from a collection
216+
217+
Expanding a single node works well if you want to explore neighbors of a single point, but what if you want to explore the whole dataset?
218+
If your dataset is small enough, you can render relations for all the data points at once. But it is a rare case in practice.
219+
220+
Instead, we can sample a subset of the data and render the graph for this subset.
221+
This way, we can get a good overview of the data without overwhelming the user with too much information.
222+
223+
Let's try to do so in [Qdrant's Graph Exploration Tool](https://qdrant.tech/blog/qdrant-1.11.x/#web-ui-graph-exploration-tool):
224+
225+
```json
226+
{
227+
"limit": 5, # node neighbors to consider
228+
"sample": 100 # nodes
229+
}
230+
```
231+
232+
{{< figure src="/articles_data/distance-based-exploration/graph-sampled.png" alt="Graph" caption="Graph representation of the data ([Qdrant's Graph Exploration Tool](https://qdrant.tech/blog/qdrant-1.11.x/#web-ui-graph-exploration-tool))">}}
233+
234+
This graph captures some high-level structure of the data, but as you might have noticed, it is quite noisy.
235+
This is because the differences in similarities are relatively small, and they might be overwhelmed by the stretches and compressions of the force-directed layout algorithm.
236+
237+
To make the graph more readable, let's concentrate on the most important similarities and build a so called [Minimum/Maximum Spanning Tree](https://en.wikipedia.org/wiki/Minimum_spanning_tree).
238+
239+
```json
240+
{
241+
"limit": 5,
242+
"sample": 100,
243+
"tree": true
244+
}
245+
```
246+
247+
{{< figure src="/articles_data/distance-based-exploration/spanning-tree.png" alt="Graph" caption="Spanning tree of the graph ([Qdrant's Graph Exploration Tool](https://qdrant.tech/blog/qdrant-1.11.x/#web-ui-graph-exploration-tool))" width="80%" >}}
248+
249+
This algorithm will only keep the most important edges and remove the rest while keeping the graph connected.
250+
By doing so, we can reveal clusters of the data and the most important relations between them.
251+
252+
In some sense, this is similar to hierarchical clustering, but with the ability to interactively explore the data.
253+
Another analogy might be a dynamically constructed mind map.
254+
255+
256+
<!--
257+
258+
We can talk about building graphs for search response as well, but it would require experiments
259+
and this article is stale already. Maybe later we can either extend this or create a new article.
260+
261+
**Using search response**
262+
263+
264+
ToDo
265+
266+
-->
267+
268+
## Conclusion
269+
270+
Vector similarity goes beyond looking up the nearest neighbors--it provides a powerful tool for data exploration.
271+
Many algorithms can construct human-readable data representations, and Qdrant makes using them easy.
272+
273+
Several data exploration instruments are available in the Qdrant Web UI ([Visualization and Graph Exploration Tools](https://qdrant.tech/articles/web-ui-gsoc/)), and for more advanced use cases, you could directly utilise our distance matrix API.
274+
275+
Try it with your data and see what hidden structures you can reveal!
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Binary file not shown.
Loading
Loading
Loading
Loading
Loading
Binary file not shown.
Loading
Loading
Loading
Loading

0 commit comments

Comments
 (0)