|
1 | 1 | # Visual embedding workflow |
| 2 | +SmooSense uses [LanceDB](https://lancedb.com/) as its storage engine for embeddings and vector indexes. |
| 3 | + |
| 4 | +LanceDB is an innovative columnar storage format built specifically for AI and vector-search workloads. |
| 5 | +It offers **near-zero cost at idle** while being able to **scale up rapidly under spiky or bursty query loads**, |
| 6 | +making it well suited for interactive and exploratory AI use cases. |
| 7 | + |
| 8 | +## Compute or ingest embedding |
| 9 | +To work with embedding, please install SmooSense with embedding feature |
| 10 | + |
| 11 | +```bash |
| 12 | +uv tool install -U "smoosense[emb]" |
| 13 | +``` |
| 14 | + |
| 15 | +### From images |
| 16 | +Run `sense-images ./images/*.jpg`. It will run a Python script that computes OpenAI Clip and Facebook Dino v2 |
| 17 | +embedding, creates a Lance table, builds vector index and opens it in your web browser. |
| 18 | + |
| 19 | +### From videos |
| 20 | +Run `sense-videos ./videos/**/*.mp4`. It will run a Python script that computes OpenAI clip embedding for the first |
| 21 | +frame of the video, creates a Lance table, builds vector index and opens it in your web browser. |
| 22 | + |
| 23 | +### From parquet files |
| 24 | +We also provide a CIL tool to convert parquet files to lance. |
| 25 | +It will also detect columns having a equal-size float/double arrays, convert to pyarrow FixedSizeListArray, and build vector index in the lance file. |
| 26 | + |
| 27 | +```bash |
| 28 | +parquet-to-lance --help |
| 29 | + |
| 30 | +Usage: parquet-to-lance [OPTIONS] PARQUET_PATH LANCE_PATH |
| 31 | + |
| 32 | + Convert a Parquet file to Lance format. |
| 33 | + |
| 34 | + PARQUET_PATH: Input Parquet file |
| 35 | + LANCE_PATH: Output Lance table path |
| 36 | + • Parent directory = database |
| 37 | + • Basename = table name |
| 38 | + • Example: /db/my_table.lance → db=/db, table=my_table |
| 39 | + |
| 40 | + Features: |
| 41 | + ✦ Converts float[]/double[] to fixed-size arrays |
| 42 | + ✦ Builds vector index for embeddings (dim > 10) |
| 43 | + ✦ Appends as new version if table exists |
| 44 | + |
| 45 | + Examples: |
| 46 | + parquet-to-lance data.parquet ./my_db/my_table.lance |
| 47 | + parquet-to-lance emb.parquet /data/lance_db/embeddings |
| 48 | +``` |
| 49 | +
|
| 50 | +## Similarity search with embedding |
| 51 | +SmooSense integrated vector search with [Lance index](https://docs.lancedb.com/indexing/vector-index). |
| 52 | +When a vector index is found, you can run vector search with a single click. |
| 53 | +
|
| 54 | + |
| 55 | +
|
| 56 | +Try yourself: [link](https://demo.smoosense.ai/Table?tablePath=s3%3A%2F%2Fsmoosense-demo%2Fembedding%2Fphotos%2Fimages_table.lance) |
| 57 | +
|
| 58 | +## Interactive UMAP visualization |
| 59 | +[UMAP](https://umap-learn.readthedocs.io/) (Uniform Manifold Approximation and Projection) reduces high-dimensional embeddings |
| 60 | +to 2D coordinates for visualization while preserving the structure of your data. |
| 61 | +
|
| 62 | +SmooSense computes UMAP projections on-the-fly from your embedding columns and renders them as interactive scatter plots. |
| 63 | +
|
| 64 | +### Features |
| 65 | +- **Hover preview**: Hover over any point to see the image, audio, or video preview |
| 66 | +- **Lasso selection**: Draw a lasso to select multiple points and view them in a gallery |
| 67 | +- **Color by category**: Use a categorical column to color points by group (creates separate traces with legend) |
| 68 | +- **Color by value**: Use a numerical column to color points by continuous value (uses color scale) |
| 69 | +- **SQL filtering**: Filter data with SQL conditions before computing UMAP |
| 70 | +- **Adjustable parameters**: Fine-tune `n_neighbors` and `min_dist` to control the projection |
| 71 | +
|
| 72 | +### Parameters |
| 73 | +| Parameter | Range | Description | |
| 74 | +|-----------|-------|-------------| |
| 75 | +| `n_neighbors` | 2-100 | Controls local vs global structure. Low values (2-15) create tight clusters; high values (50-100) preserve global relationships | |
| 76 | +| `min_dist` | 0-1 | Controls point density. Low values (0-0.1) pack points tightly; high values (0.5-1) spread them out | |
| 77 | +
|
| 78 | +### Performance |
| 79 | +- UMAP computation runs in parallel using all CPU cores |
| 80 | +- For large datasets (>1,000 rows), SmooSense automatically samples to keep visualization responsive |
| 81 | +- Results include runtime and sampling info in the status bar |
| 82 | +
|
| 83 | +### Try UMAP visualization yourself |
| 84 | +Explore image embeddings with UMAP visualization. Note that this demo only shows the interactive visualization. |
| 85 | +For full functionality please run SmooSense on your computer. |
| 86 | +
|
| 87 | +```tabs |
| 88 | +--- Images |
| 89 | + |
| 90 | +
|
| 91 | +--- Audio |
| 92 | + |
| 93 | +``` |
| 94 | +
|
2 | 95 |
|
3 | 96 | ## Balance map |
4 | 97 | People turn to semantic balance analysis using embeddings when they need to understand whether their dataset is fair, |
@@ -29,7 +122,7 @@ maxColumns=2 height=300px |
29 | 122 | /images/emb/example-imbalance.jpg | Example of imbalance. When ratios differ, the color shifts toward the dominant group, making imbalances immediately visible. |
30 | 123 | ``` |
31 | 124 |
|
32 | | -### Try yourself |
| 125 | +### Try BalanceMap yourself |
33 | 126 | Zoom in and drag around, you can easily find a blue cluster where all the data is in train fold, no testing or validation at all. |
34 | 127 |
|
35 | 128 |  |
|
0 commit comments