SAEs (Sparse Autoencoders) disentangle learned representations into a sparse set of human-understandable concepts. This is achieved by decomposing the representations into a sparse linear combination of a set of learned dictionary vectors. They consist of a linear encoder
The formula for an SAE is given by:
The decoder matrix
To assign human-readable names to these concepts, each vector from
When using CLIP (Contrastive Language–Image Pretraining) feature extractors, the dictionary vectors from the SAE decoder can be directly mapped to text by finding the most similar text embeddings in CLIP space—without the need for large language models—resulting in semantically meaningful and human-interpretable concepts.
Description based on the paper "Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery".
- CLIP
$\rightarrow$ cosine similarity - SAE
$\rightarrow$ Manhattan distance
- Description: Precision for the top-k retrieved item.
- Calculation: Measures whether first k results are relevant (i.e., belongs to the set of relevant items for the query).
- Description: Recall for the top-k retrieved item.
- Calculation: Measures whether first k results are relevant in the context of all relevant items for the query.
- Description: The mean of the Average Precision (AP) across all queries.
- Calculation: Average Precision for a query is the average of the precision calculated for each relevant result in the ranking. mAP is the mean of these Average Precisions across all queries.
- Description: A metric that evaluates all query-reference pairs jointly.
- Calculation: Sorts all pairs by confidence (similarity score) and computes the area under the precision-recall curve for the entire set of pairs.
Note: microAP is better when evaluating global ranking performance, while mAP treats each query independently.
Note that microAP (incl. Precision and Recall) metrics was computed on whole dataset (50k query and 1M reference images), while other calcuations (RAM usage and calculation time) were run on first 500 query images due to computation limitations. The proportions are preserved, as 104 of selected images had GT reference (20% of dataset).
| CLIP | SAE | |
|---|---|---|
| Precision | 0.1047 | 0.3004 |
| Recall | 0.075 | 0.075 |
| microAP | 0.0399 | 0.0647 |
| RAM usage | 1.6 GB | 3.3 GB |
| calculation time | 37 s | 1087 s |
We have chosen a similarity threshold of 0.7507 for CLIP and -69.96 for SAE. One can choose different threshold and therefore adjust the tradeoff between Precision and Recall.
streamlit run path_to_app
[1] Douze, M., Tolias, G., Pizzi, E., Papakipos, Z., Chanussot, L., Radenovic, F., Jenicek, T., Maximov, M., Leal-Taixé, L., Elezi, I., Chum, O., & Canton Ferrer, C. "The 2021 Image Similarity Dataset and Challenge." arXiv preprint arXiv:2106.09672 (2021). https://arxiv.org/pdf/2106.09672