This repository was archived by the owner on Mar 26, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
127 lines (93 loc) · 3 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# rsimsimd
<!-- badges: start -->
[](https://github.com/DavZim/rsimsimd/actions/workflows/R-CMD-check.yaml)
[](https://CRAN.R-project.org/package=rsimsimd)
<!-- badges: end -->
`{rsimsimd}` is a light R wrapper around the rust crate [`simsimd`](https://github.com/ashvardanian/SimSIMD).
It allows efficient calculation of similarity metrics using [SIMD operations](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data)
## Installation
You can install the package like so:
``` r
# dev version
# install.packages("remotes")
# remotes::install_github("DavZim/rsimsimd")
# CRAN version
install.packages("rsimsimd")
```
## Example
This is a basic example which shows you how to solve a common problem:
```{r example}
library(rsimsimd)
# a simple cosine similarity calculation
dist_cosine(c(1, 2, 3),
c(4, 5, 6))
# more realistic embedding use case
# with 1536 (OpenAI) embedding dimensions
n_dimensions <- 1536
set.seed(123)
vec1 <- rnorm(n_dimensions)
vec2 <- rnorm(n_dimensions)
dist_cosine(vec1, vec2)
# if you have a list (or a vector) of embeddings (eg a database)
# and you want to compare a list of vectors, you can achieve this like so
# simulate a DB of 1000 embedding vectors
db <- lapply(seq(1000), function(x) rnorm(n_dimensions))
# simulate a lookup of 3 embedding vectors
lookup <- lapply(seq(3), function(x) rnorm(n_dimensions))
res <- dist_cosine(lookup, db)
# one row for each lookup, one column for each DB entry
dim(res)
```
## Functions
- [`dist_cosine()`] Cosine Similarity Matrix
[ ] dist_dot
[ ] dist_sqeuclidean
[ ] div_jensenshannon
[ ] div_kullbackleibler
- [`get_capabilities()`] reports current hardware capabilities
## Benchmark
```{R benchmark, eval=requireNamespace("bench", quietly = TRUE)}
# alternative implementation of cosine, taken from lsa
# see also https://github.com/cran/lsa/blob/master/R/cosine.R
lsa_cosine <- function(x, y) {
as.vector(crossprod(x, y) / sqrt(crossprod(x) * crossprod(y)))
}
set.seed(123)
n_dimensions <- 1536
v1 <- rnorm(N)
v2 <- rnorm(N)
bench::mark(
lsa_cosine(v1, v2),
dist_cosine(v1, v2)
)
# compare 1 embedding to 1'000 embeddings
ll_1k <- lapply(seq(1000), function(i) rnorm(n_dimensions))
bench::mark(
sapply(ll_1k, function(ll) lsa_cosine(v1, ll)),
dist_cosine(v1, ll_1k),
check = FALSE # rounding errors
)
# compare 1 embedding to 100'000 embeddings
ll_100k <- lapply(seq(100000), function(i) rnorm(n_dimensions))
bench::mark(
sapply(ll_100k, function(ll) lsa_cosine(v1, ll)),
dist_cosine(v1, ll_100k),
check = FALSE # rounding errors
)
# 1k x 1k comparisons => 1mln comparisons
bench::mark(
dist_cosine(ll_1k)
)
```