Skip to content

Commit e91bdaa

Browse files
committed
Updated doc
1 parent 68b660f commit e91bdaa

2 files changed

Lines changed: 94 additions & 1 deletion

File tree

landing/public/content/docs/embedding.md

Lines changed: 94 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,97 @@
11
# Visual embedding workflow
2+
SmooSense uses [LanceDB](https://lancedb.com/) as its storage engine for embeddings and vector indexes.
3+
4+
LanceDB is an innovative columnar storage format built specifically for AI and vector-search workloads.
5+
It offers **near-zero cost at idle** while being able to **scale up rapidly under spiky or bursty query loads**,
6+
making it well suited for interactive and exploratory AI use cases.
7+
8+
## Compute or ingest embedding
9+
To work with embedding, please install SmooSense with embedding feature
10+
11+
```bash
12+
uv tool install -U "smoosense[emb]"
13+
```
14+
15+
### From images
16+
Run `sense-images ./images/*.jpg`. It will run a Python script that computes OpenAI Clip and Facebook Dino v2
17+
embedding, creates a Lance table, builds vector index and opens it in your web browser.
18+
19+
### From videos
20+
Run `sense-videos ./videos/**/*.mp4`. It will run a Python script that computes OpenAI clip embedding for the first
21+
frame of the video, creates a Lance table, builds vector index and opens it in your web browser.
22+
23+
### From parquet files
24+
We also provide a CIL tool to convert parquet files to lance.
25+
It will also detect columns having a equal-size float/double arrays, convert to pyarrow FixedSizeListArray, and build vector index in the lance file.
26+
27+
```bash
28+
parquet-to-lance --help
29+
30+
Usage: parquet-to-lance [OPTIONS] PARQUET_PATH LANCE_PATH
31+
32+
Convert a Parquet file to Lance format.
33+
34+
PARQUET_PATH: Input Parquet file
35+
LANCE_PATH: Output Lance table path
36+
• Parent directory = database
37+
• Basename = table name
38+
• Example: /db/my_table.lance → db=/db, table=my_table
39+
40+
Features:
41+
✦ Converts float[]/double[] to fixed-size arrays
42+
✦ Builds vector index for embeddings (dim > 10)
43+
✦ Appends as new version if table exists
44+
45+
Examples:
46+
parquet-to-lance data.parquet ./my_db/my_table.lance
47+
parquet-to-lance emb.parquet /data/lance_db/embeddings
48+
```
49+
50+
## Similarity search with embedding
51+
SmooSense integrated vector search with [Lance index](https://docs.lancedb.com/indexing/vector-index).
52+
When a vector index is found, you can run vector search with a single click.
53+
54+
![](/images/emb/emb-search.jpg)
55+
56+
Try yourself: [link](https://demo.smoosense.ai/Table?tablePath=s3%3A%2F%2Fsmoosense-demo%2Fembedding%2Fphotos%2Fimages_table.lance)
57+
58+
## Interactive UMAP visualization
59+
[UMAP](https://umap-learn.readthedocs.io/) (Uniform Manifold Approximation and Projection) reduces high-dimensional embeddings
60+
to 2D coordinates for visualization while preserving the structure of your data.
61+
62+
SmooSense computes UMAP projections on-the-fly from your embedding columns and renders them as interactive scatter plots.
63+
64+
### Features
65+
- **Hover preview**: Hover over any point to see the image, audio, or video preview
66+
- **Lasso selection**: Draw a lasso to select multiple points and view them in a gallery
67+
- **Color by category**: Use a categorical column to color points by group (creates separate traces with legend)
68+
- **Color by value**: Use a numerical column to color points by continuous value (uses color scale)
69+
- **SQL filtering**: Filter data with SQL conditions before computing UMAP
70+
- **Adjustable parameters**: Fine-tune `n_neighbors` and `min_dist` to control the projection
71+
72+
### Parameters
73+
| Parameter | Range | Description |
74+
|-----------|-------|-------------|
75+
| `n_neighbors` | 2-100 | Controls local vs global structure. Low values (2-15) create tight clusters; high values (50-100) preserve global relationships |
76+
| `min_dist` | 0-1 | Controls point density. Low values (0-0.1) pack points tightly; high values (0.5-1) spread them out |
77+
78+
### Performance
79+
- UMAP computation runs in parallel using all CPU cores
80+
- For large datasets (>1,000 rows), SmooSense automatically samples to keep visualization responsive
81+
- Results include runtime and sampling info in the status bar
82+
83+
### Try UMAP visualization yourself
84+
Explore image embeddings with UMAP visualization. Note that this demo only shows the interactive visualization.
85+
For full functionality please run SmooSense on your computer.
86+
87+
```tabs
88+
--- Images
89+
![demo](/example/emb-images)
90+
91+
--- Audio
92+
![demo](/example/emb-audio)
93+
```
94+
295
396
## Balance map
497
People turn to semantic balance analysis using embeddings when they need to understand whether their dataset is fair,
@@ -29,7 +122,7 @@ maxColumns=2 height=300px
29122
/images/emb/example-imbalance.jpg | Example of imbalance. When ratios differ, the color shifts toward the dominant group, making imbalances immediately visible.
30123
```
31124
32-
### Try yourself
125+
### Try BalanceMap yourself
33126
Zoom in and drag around, you can easily find a blue cluster where all the data is in train fold, no testing or validation at all.
34127
35128
![demo](/Table?tablePath=s3://smoosense-demo/datasets/COCO2017/images-emb-2d.parquet&activeTab=Plot&activePlotTab=BalanceMap&columnForGalleryVisual=coco_url&columnForGalleryCaption=fold&bubblePlotXColumn=emb_x&bubblePlotYColumn=emb_y&bubblePlotBreakdownColumn=fold)
654 KB
Loading

0 commit comments

Comments
 (0)