|
| 1 | +--- |
| 2 | +title: "Distance-based data exploration" |
| 3 | +short_description: "Efficient visualization and clusterization of high-dimensional data with Qdrant" |
| 4 | +description: "Explore your data under a new angle with Qdrant's tools for dimensionality reduction, clusterization, and visualization." |
| 5 | +social_preview_image: /articles_data/distance-based-exploration/social-preview.jpg |
| 6 | +preview_dir: /articles_data/distance-based-exploration/preview |
| 7 | +weight: -250 |
| 8 | +author: Andrey Vasnetsov |
| 9 | +date: 2025-03-11T12:00:00+03:00 |
| 10 | +draft: false |
| 11 | +keywords: |
| 12 | + - clusterization |
| 13 | + - dimensionality reduction |
| 14 | + - vizualization |
| 15 | +category: data-exploration |
| 16 | +--- |
| 17 | + |
| 18 | + |
| 19 | +## Hidden Structure |
| 20 | + |
| 21 | +When working with large collections of documents, images, or other arrays of unstructured data, it often becomes useful to understand the big picture. |
| 22 | +Examining data points individually is not always the best way to grasp the structure of the data. |
| 23 | + |
| 24 | +{{< figure src="/articles_data/distance-based-exploration/no-context-data.png" alt="Data visualization" caption="Datapoints without context, pretty much useless" >}} |
| 25 | + |
| 26 | +As numbers in a table obtain meaning when plotted on a graph, visualising distances (similar/dissimilar) between unstructured data items can reveal hidden structures and patterns. |
| 27 | + |
| 28 | +{{< figure src="/articles_data/distance-based-exploration/data-on-chart.png" alt="Data visualization" caption="Vizualized chart, very intuitive" >}} |
| 29 | +There are many tools to investigate data similarity, and Qdrant's [1.12 release](https://qdrant.tech/blog/qdrant-1.12.x/) made it much easier to start this investigation. With the new [Distance Matrix API](/documentation/concepts/explore/#distance-matrix), Qdrant handles the most computationally expensive part of the process—calculating the distances between data points. |
| 30 | + |
| 31 | +In many implementations, the distance matrix calculation was part of the clustering or visualization processes, requiring either brute-force computation or building a temporary index. With Qdrant, however, the data is already indexed, and the distance matrix can be computed relatively cheaply. |
| 32 | + |
| 33 | +In this article, we will explore several methods for data exploration using the Distance Matrix API. |
| 34 | + |
| 35 | +## Dimensionality Reduction |
| 36 | + |
| 37 | +Initially, we might want to visualize an entire dataset, or at least a large portion of it, at a glance. However, high-dimensional data cannot be directly visualized. We must apply dimensionality reduction techniques to convert data into a lower-dimensional representation while preserving important data properties. |
| 38 | + |
| 39 | +In this article, we will use [UMAP](https://github.com/lmcinnes/umap) as our dimensionality reduction algorithm. |
| 40 | + |
| 41 | +Here is a **very** simplified but intuitive explanation of UMAP: |
| 42 | + |
| 43 | +1. *Randomly generate points in 2D space*: Assign a random 2D point to each high-dimensional point. |
| 44 | +2. *Compute distance matrix for high-dimensional points*: Calculate distances between all pairs of points. |
| 45 | +3. *Compute distance matrix for 2D points*: Perform similarly to step 2. |
| 46 | +4. *Match both distance matrices*: Adjust 2D points to minimize differences. |
| 47 | + |
| 48 | +{{< figure src="/articles_data/distance-based-exploration/umap.png" alt="UMAP" caption="Canonical example of UMAP results, [source](https://github.com/lmcinnes/umap?tab=readme-ov-file#performance-and-examples)" >}} |
| 49 | + |
| 50 | +UMAP preserves the relative distances between high-dimensional points; the actual coordinates are not essential. If we already have the distance matrix, step 2 can be skipped entirely. |
| 51 | + |
| 52 | +Let's use Qdrant to calculate the distance matrix and apply UMAP. |
| 53 | +We will use one of the default datasets perfect for experimenting in Qdrant--[Midjourney Styles dataset](https://midlibrary.io/). |
| 54 | + |
| 55 | +Use this command to download and import the dataset into Qdrant: |
| 56 | + |
| 57 | +```http |
| 58 | +PUT /collections/midlib/snapshots/recover |
| 59 | +{ |
| 60 | + "location": "http://snapshots.qdrant.io/midlib.snapshot" |
| 61 | +} |
| 62 | +``` |
| 63 | + |
| 64 | +<details> |
| 65 | +<summary>We also need to prepare our python enviroment:</summary> |
| 66 | + |
| 67 | +```bash |
| 68 | +pip install umap-learn seaborn matplotlib qdrant-client |
| 69 | +``` |
| 70 | + |
| 71 | +Import the necessary libraries: |
| 72 | + |
| 73 | +```python |
| 74 | +# Used to talk to Qdrant |
| 75 | +from qdrant_client import QdrantClient |
| 76 | +# Package with original UMAP implementation |
| 77 | +from umap import UMAP |
| 78 | +# Python implementation for sparse matrices |
| 79 | +from scipy.sparse import csr_matrix |
| 80 | +# For vizualization |
| 81 | +import seaborn as sns |
| 82 | +``` |
| 83 | + |
| 84 | +Establish connection to Qdrant: |
| 85 | + |
| 86 | +```python |
| 87 | +client = QdrantClient("http://localhost:6333") |
| 88 | +``` |
| 89 | + |
| 90 | +</details> |
| 91 | + |
| 92 | +After this is done, we can compute the distance matrix: |
| 93 | + |
| 94 | +```python |
| 95 | + |
| 96 | +# Request distances matrix from Qdrant |
| 97 | +# `_offsets` suffix defines a format of the output matrix. |
| 98 | +result = client.search_matrix_offsets( |
| 99 | + collection_name="midlib", |
| 100 | + sample=1000, # Select a subset of the data, as the whole dataset might be too large |
| 101 | + limit=20, # For performance reasons, limit the number of closest neighbors to consider |
| 102 | +) |
| 103 | + |
| 104 | +# Convert distances matrix to python-native format |
| 105 | +matrix = csr_matrix( |
| 106 | + (result.scores, (result.offsets_row, result.offsets_col)) |
| 107 | +) |
| 108 | + |
| 109 | +# Make the matrix symmetric, as UMAP expects it. |
| 110 | +# Distance matrix is always symmetric, but qdrant only computes half of it. |
| 111 | +matrix = matrix + matrix.T |
| 112 | +``` |
| 113 | + |
| 114 | +Now we can apply UMAP to the distance matrix: |
| 115 | + |
| 116 | +```python |
| 117 | +umap = UMAP( |
| 118 | + metric="precomputed", # We provide ready-made distance matrix |
| 119 | + n_components=2, # output dimension |
| 120 | + n_neighbors=20, # Same as the limit in the search_matrix_offsets |
| 121 | +) |
| 122 | + |
| 123 | +vectors_2d = umap.fit_transform(matrix) |
| 124 | +``` |
| 125 | + |
| 126 | +That's all that is needed to get the 2d representation of the data. |
| 127 | + |
| 128 | +{{< figure src="/articles_data/distance-based-exploration/umap-midlib.png" alt="UMAP on Midlib" caption="UMAP applied to Midlib dataset" >}} |
| 129 | + |
| 130 | +<aside role="status">Interactive version of this plot is available in <a href="https://qdrant.tech/documentation/web-ui/"> Qdrant Web UI </a>!</aside> |
| 131 | + |
| 132 | +UMAP isn't the only algorithm compatible with our distance matrix API. For example, `scikit-learn` also offers: |
| 133 | + |
| 134 | +- [Isomap](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.Isomap.html) - Non-linear dimensionality reduction through Isometric Mapping. |
| 135 | +- [SpectralEmbedding](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.SpectralEmbedding.html) - Forms an affinity matrix given by the specified function and applies spectral decomposition to the corresponding graph Laplacian. |
| 136 | +- [TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) - well-known algorithm for dimensionality reduction. |
| 137 | + |
| 138 | +## Clustering |
| 139 | + |
| 140 | +Another approach to data structure understanding is clustering--grouping similar items. |
| 141 | + |
| 142 | +*Note that there's no universally best clustering criterion or algorithm.* |
| 143 | + |
| 144 | +{{< figure src="/articles_data/distance-based-exploration/clustering.png" alt="Clustering" caption="Clustering example, [source](https://scikit-learn.org/)" width="80%" >}} |
| 145 | + |
| 146 | +Many clustering algorithms accept precomputed distance matrix as input, so we can use the same distance matrix we calculated before. |
| 147 | + |
| 148 | +Let's consider a simple example of clustering the Midlib dataset with **KMeans algorithm**. |
| 149 | + |
| 150 | +From [scikit-learn.cluster documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) we know that `fit()` method of KMeans algorithm prefers as an input: |
| 151 | + |
| 152 | + |
| 153 | +> `X : {array-like, sparse matrix} of shape (n_samples, n_features)`: |
| 154 | +> Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format. |
| 155 | +
|
| 156 | + |
| 157 | +So we can re-use `matrix` from the previous example: |
| 158 | + |
| 159 | + |
| 160 | +```python |
| 161 | +from sklearn.cluster import KMeans |
| 162 | + |
| 163 | +# Initialize KMeans with 10 clusters |
| 164 | +kmeans = KMeans(n_clusters=10) |
| 165 | + |
| 166 | +# Generate index of the cluster each sample belongs to |
| 167 | +cluster_labels = kmeans.fit_predict(matrix) |
| 168 | +``` |
| 169 | + |
| 170 | +With this simple code, we have clustered the data into 10 clusters, while the main CPU-intensive part of the process was done by Qdrant. |
| 171 | + |
| 172 | +{{< figure src="/articles_data/distance-based-exploration/clustering-midlib.png" alt="Clustering on Midlib" caption="Clustering applied to Midlib dataset" >}} |
| 173 | + |
| 174 | + |
| 175 | +<details> |
| 176 | +<summary>How to plot this chart</summary> |
| 177 | + |
| 178 | +```python |
| 179 | +sns.scatterplot( |
| 180 | + # Coordinates obtained from UMAP |
| 181 | + x=vectors_2d[:, 0], y=vectors_2d[:, 1], |
| 182 | + # Color datapoints by cluster |
| 183 | + hue=cluster_labels, |
| 184 | + palette=sns.color_palette("pastel", 10), |
| 185 | + legend="full", |
| 186 | +) |
| 187 | +``` |
| 188 | +</details> |
| 189 | + |
| 190 | + |
| 191 | +## Graphs |
| 192 | + |
| 193 | +Clustering and dimensionality reduction both aim to provide a more transparent overview of the data. |
| 194 | +However, they share a common characteristic - they require a training step before the results can be visualized. |
| 195 | + |
| 196 | +This also implies that introducing new data points necessitates re-running the training step, which may be computationally expensive. |
| 197 | + |
| 198 | +Graphs offer an alternative approach to data exploration, enabling direct, interactive visualization of relationships between data points. |
| 199 | +In a graph representation, each data point is a node, and similarities between data points are represented as edges connecting the nodes. |
| 200 | + |
| 201 | +Such a graph can be rendered in real-time using [force-directed layout](https://en.wikipedia.org/wiki/Force-directed_graph_drawing) algorithms, which aim to minimize the system's energy by repositioning nodes dynamically--the more similar the data points are, the stronger the edges between them. |
| 202 | + |
| 203 | +Adding new data points to the graph is as straightforward as inserting new nodes and edges without the need to re-run any training steps. |
| 204 | + |
| 205 | +In practice, rendering a graph for an entire dataset at once may be computationally expensive and overwhelming for the user. Therefore, let's explore a few strategies to address this issue. |
| 206 | + |
| 207 | +### Expanding from a single node |
| 208 | + |
| 209 | +This is the simplest approach, where we start with a single node and expand the graph by adding the most similar nodes to the graph. |
| 210 | + |
| 211 | +{{< figure src="/articles_data/distance-based-exploration/graph.gif" alt="Graph" caption="Graph representation of the data" >}} |
| 212 | + |
| 213 | +<aside role="status">An interactive version of this plot is available in <a href="https://qdrant.tech/documentation/web-ui/"> Qdrant Web UI </a>!</aside> |
| 214 | + |
| 215 | +### Sampling from a collection |
| 216 | + |
| 217 | +Expanding a single node works well if you want to explore neighbors of a single point, but what if you want to explore the whole dataset? |
| 218 | +If your dataset is small enough, you can render relations for all the data points at once. But it is a rare case in practice. |
| 219 | + |
| 220 | +Instead, we can sample a subset of the data and render the graph for this subset. |
| 221 | +This way, we can get a good overview of the data without overwhelming the user with too much information. |
| 222 | + |
| 223 | +Let's try to do so in [Qdrant's Graph Exploration Tool](https://qdrant.tech/blog/qdrant-1.11.x/#web-ui-graph-exploration-tool): |
| 224 | + |
| 225 | +```json |
| 226 | +{ |
| 227 | + "limit": 5, # node neighbors to consider |
| 228 | + "sample": 100 # nodes |
| 229 | +} |
| 230 | +``` |
| 231 | + |
| 232 | +{{< figure src="/articles_data/distance-based-exploration/graph-sampled.png" alt="Graph" caption="Graph representation of the data ([Qdrant's Graph Exploration Tool](https://qdrant.tech/blog/qdrant-1.11.x/#web-ui-graph-exploration-tool))">}} |
| 233 | + |
| 234 | +This graph captures some high-level structure of the data, but as you might have noticed, it is quite noisy. |
| 235 | +This is because the differences in similarities are relatively small, and they might be overwhelmed by the stretches and compressions of the force-directed layout algorithm. |
| 236 | + |
| 237 | +To make the graph more readable, let's concentrate on the most important similarities and build a so called [Minimum/Maximum Spanning Tree](https://en.wikipedia.org/wiki/Minimum_spanning_tree). |
| 238 | + |
| 239 | +```json |
| 240 | +{ |
| 241 | + "limit": 5, |
| 242 | + "sample": 100, |
| 243 | + "tree": true |
| 244 | +} |
| 245 | +``` |
| 246 | + |
| 247 | +{{< figure src="/articles_data/distance-based-exploration/spanning-tree.png" alt="Graph" caption="Spanning tree of the graph ([Qdrant's Graph Exploration Tool](https://qdrant.tech/blog/qdrant-1.11.x/#web-ui-graph-exploration-tool))" width="80%" >}} |
| 248 | + |
| 249 | +This algorithm will only keep the most important edges and remove the rest while keeping the graph connected. |
| 250 | +By doing so, we can reveal clusters of the data and the most important relations between them. |
| 251 | + |
| 252 | +In some sense, this is similar to hierarchical clustering, but with the ability to interactively explore the data. |
| 253 | +Another analogy might be a dynamically constructed mind map. |
| 254 | + |
| 255 | + |
| 256 | +<!-- |
| 257 | +
|
| 258 | +We can talk about building graphs for search response as well, but it would require experiments |
| 259 | +and this article is stale already. Maybe later we can either extend this or create a new article. |
| 260 | +
|
| 261 | +**Using search response** |
| 262 | +
|
| 263 | +
|
| 264 | +ToDo |
| 265 | +
|
| 266 | +--> |
| 267 | + |
| 268 | +## Conclusion |
| 269 | + |
| 270 | +Vector similarity goes beyond looking up the nearest neighbors--it provides a powerful tool for data exploration. |
| 271 | +Many algorithms can construct human-readable data representations, and Qdrant makes using them easy. |
| 272 | + |
| 273 | +Several data exploration instruments are available in the Qdrant Web UI ([Visualization and Graph Exploration Tools](https://qdrant.tech/articles/web-ui-gsoc/)), and for more advanced use cases, you could directly utilise our distance matrix API. |
| 274 | + |
| 275 | +Try it with your data and see what hidden structures you can reveal! |
0 commit comments