Skip to content

Commit 53b8b9d

Browse files
committed
Add BM25 search
1 parent 3c40788 commit 53b8b9d

File tree

10 files changed

+841
-23
lines changed

10 files changed

+841
-23
lines changed

.github/workflows/ci.yml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ jobs:
2323
# Additional services can be defined here if required.
2424
services:
2525
db:
26-
image: postgres:15
26+
image: postgres:17
2727
ports:
2828
- 5432/tcp
2929
env:
@@ -40,8 +40,8 @@ jobs:
4040
name: Test on OTP ${{matrix.otp}} / Elixir ${{matrix.elixir}}
4141
strategy:
4242
matrix:
43-
otp: ['27.2']
44-
elixir: ['1.18.0']
43+
otp: ['28.3']
44+
elixir: ['1.19.5']
4545
steps:
4646
# Step: Setup Elixir + Erlang image as the base.
4747
- name: Set up Elixir
@@ -86,6 +86,9 @@ jobs:
8686
- name: Install dependencies
8787
run: mix deps.get --check-locked
8888

89+
- name: Install pg_textsearch extension
90+
run: |
91+
docker exec ${{ job.services.db.id }} bash -lc "apt-get update && apt-get install -y wget && wget -q -O /tmp/pg_textsearch.deb https://github.com/timescale/pg_textsearch/releases/download/v0.5.0/pg-textsearch-postgresql-17_0.5.0-1_amd64.deb && dpkg -i /tmp/pg_textsearch.deb"
8992
# TODO These steps can be moved to `mix.exs`
9093

9194
# Step: Compile the project treating any warnings as errors.

CHANGELOG.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,32 @@
1+
# v0.6.0
2+
3+
## New 🔥
4+
5+
**BM25 Full-Text Search** is now available via the new `Torus.bm25/5` macro!
6+
7+
[BM25](https://en.wikipedia.org/wiki/Okapi_BM25) is a modern ranking algorithm that generally provides superior relevance scoring compared to traditional TF-IDF (used by `full_text/5`). This integration uses the [pg_textsearch](https://github.com/timescale/pg_textsearch) extension by Timescale.
8+
9+
Key features:
10+
11+
- State-of-the-art BM25 ranking with configurable index parameters (k1, b)
12+
- Blazingly fast top-k queries via Block-Max WAND optimization (`Torus.bm25/5` + `limit`)
13+
- Simple syntax: `Post |> Torus.bm25([p], p.body, "search term") |> limit(10)`
14+
- Score selection with `:score_key` and post-filtering with `:score_threshold`
15+
- Language/stemming configured at index creation via `text_config`
16+
17+
Requirements:
18+
19+
- PostgreSQL 17+
20+
- pg_textsearch extension installed
21+
- BM25 index on the search column (with `text_config` for language)
22+
23+
See the [BM25 Search Guide](https://dimamik.com/posts/bm25_search) for detailed setup instructions and examples.
24+
25+
**When to use BM25 vs full_text:**
26+
27+
- Use `bm25/5` for fast single-column search with modern relevance ranking
28+
- Use `full_text/5` for multi-column search with weights or when using stored tsvector columns
29+
130
# v0.5.3
231

332
## Fixes

README.md

Lines changed: 36 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ Post
3838

3939
See [`full_text/5`](https://hexdocs.pm/torus/Torus.html#full_text/5) for more details.
4040

41-
## 6 types of search:
41+
## 7 types of search:
4242

4343
1. **Pattern matching**: Searches for a specific pattern in a string.
4444

@@ -58,12 +58,13 @@ See [`full_text/5`](https://hexdocs.pm/torus/Torus.html#full_text/5) for more de
5858
1. **Similarity:** Searches for records that closely match the input text using trigram distance.
5959

6060
```elixir
61-
iex> insert_posts!(["Hogwarts Secrets", "Quidditch Fever", "Hogwart’s Secret"])
62-
...> Post
63-
...> |> Torus.similarity([p], [p.title], "hoggwarrds")
64-
...> |> limit(2)
65-
...> |> select([p], p.title)
66-
...> |> Repo.all()
61+
insert_posts!(["Hogwarts Secrets", "Quidditch Fever", "Hogwart’s Secret"])
62+
63+
Post
64+
|> Torus.similarity([p], [p.title], "hoggwarrds")
65+
|> limit(2)
66+
|> select([p], p.title)
67+
|> Repo.all()
6768
["Hogwarts Secrets", "Hogwart’s Secret"]
6869
```
6970

@@ -74,20 +75,39 @@ See [`full_text/5`](https://hexdocs.pm/torus/Torus.html#full_text/5) for more de
7475
1. **Full text**: Uses term-document matrix vectors for, enabling efficient querying and ranking based on term frequency. Supports prefix search and is great for large datasets to quickly return relevant results. See [PostgreSQL Full Text Search](https://www.postgresql.org/docs/current/textsearch.html) for internal implementation details.
7576

7677
```elixir
77-
iex> insert_post!(title: "Hogwarts Shocker", body: "A spell disrupts the Quidditch Cup.")
78-
...> insert_post!(title: "Diagon Bombshell", body: "Secrets uncovered in the heart of Hogwarts.")
79-
...> insert_post!(title: "Completely unrelated", body: "No magic here!")
80-
...> Post
81-
...> |> Torus.full_text([p], [p.title, p.body], "uncov hogwar")
82-
...> |> select([p], p.title)
83-
...> |> Repo.all()
78+
insert_post!(title: "Hogwarts Shocker", body: "A spell disrupts the Quidditch Cup.")
79+
insert_post!(title: "Diagon Bombshell", body: "Secrets uncovered in the heart of Hogwarts.")
80+
insert_post!(title: "Completely unrelated", body: "No magic here!")
81+
82+
Post
83+
|> Torus.full_text([p], [p.title, p.body], "uncov hogwar")
84+
|> select([p], p.title)
85+
|> Repo.all()
8486
["Diagon Bombshell"]
8587
```
8688

87-
Use it when you dont care about spelling, the documents are long, or if you need to order the results by rank.
89+
Use it when you don't care about spelling, the documents are long, you need multi-column search with weights, or if you need to order the results by rank.
8890

8991
See [`full_text/5`](https://hexdocs.pm/torus/Torus.html#full_text/5) for more details.
9092

93+
1. **BM25 full text**: Modern BM25 ranking algorithm for superior relevance scoring using the [pg_textsearch](https://github.com/timescale/pg_textsearch) extension. BM25 generally provides better ranking than traditional built-in TF-IDF full text search and is optimized for top-k queries.
94+
95+
```elixir
96+
insert_post!(title: "Hogwarts Shocker", body: "A spell disrupts the Quidditch Cup.")
97+
insert_post!(title: "Diagon Bombshell", body: "Secrets uncovered in the heart of Hogwarts.")
98+
insert_post!(title: "Completely unrelated", body: "No magic here!")
99+
100+
Post
101+
|> Torus.bm25([p], p.body, "secrets hogwarts")
102+
|> select([p], p.title)
103+
|> Repo.all()
104+
["Diagon Bombshell"]
105+
```
106+
107+
Use it when you need state-of-the-art relevance ranking for single-column search, especially with LIMIT clauses. Requires PostgreSQL 17+.
108+
109+
See [`bm25/5`](https://hexdocs.pm/torus/Torus.html#bm25/5) and the [BM25 Search Guide](https://dimamik.com/posts/bm25_search) for detailed setup instructions and examples.
110+
91111
1. **Semantic Search**: Understands the contextual meaning of queries to match and retrieve related content utilizing natural language processing. Read more about semantic search in [Semantic search with Torus guide](/guides/semantic_search.md).
92112

93113
```elixir
@@ -131,7 +151,7 @@ Torus offers a few helpers to debug, explain, and analyze your queries before us
131151

132152
## Torus support
133153

134-
For now, Torus supports pattern match, similarity, full-text, and semantic search, with plans to expand support further. These docs will be updated with more examples on which search type to choose and how to make them more performant (by adding indexes or using specific functions).
154+
For now, Torus supports pattern match, similarity, full-text (TF-IDF and BM25), and semantic search, with plans to expand support further. These docs will be updated with more examples on which search type to choose and how to make them more performant (by adding indexes or using specific functions).
135155

136156
<!-- MDOC -->
137157

lib/torus.ex

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -366,6 +366,157 @@ defmodule Torus do
366366
Torus.Search.FullText.to_tsquery(column, query_text, opts)
367367
end
368368

369+
@doc group: "Full text"
370+
@doc """
371+
BM25 ranked full-text search using the [pg_textsearch](https://github.com/timescale/pg_textsearch) extension.
372+
373+
BM25 is a modern ranking function that generally provides better relevance than traditional
374+
TF-IDF (used by `full_text/5`). It's particularly effective for top-k queries with LIMIT clauses
375+
due to Block-Max WAND optimization.
376+
377+
For detailed usage examples, performance tips, and migration guide, see the [BM25 Search guide](https://dimamik.com/posts/bm25_search).
378+
379+
> #### Requirements {: .warning}
380+
>
381+
> - Requires the `pg_textsearch` extension to be installed
382+
> - PostgreSQL 17+ only
383+
> - Requires a BM25 index on the search column
384+
> - **Single column only** - unlike `full_text/5`, BM25 indexes work on one column at a time
385+
> - **Language is set at index creation** - use `text_config` in the index `WITH` clause
386+
>
387+
> ```elixir
388+
> defmodule YourApp.Repo.Migrations.CreatePgTextsearchExtension do
389+
> use Ecto.Migration
390+
>
391+
> def change do
392+
> execute "CREATE EXTENSION IF NOT EXISTS pg_textsearch", "DROP EXTENSION IF EXISTS pg_textsearch"
393+
>
394+
> # Create BM25 index with language configuration
395+
> execute \"\"\"
396+
> CREATE INDEX posts_body_bm25_idx ON posts
397+
> USING bm25(body) WITH (text_config='english')
398+
> \"\"\", "DROP INDEX posts_body_bm25_idx"
399+
> end
400+
> end
401+
> ```
402+
403+
## Options
404+
405+
* `:order` - Ordering of results. Note that BM25 returns **negative scores** (lower is better):
406+
- `:asc` (default) - orders by score ascending (best matches first)
407+
- `:desc` - orders by score descending (worst matches first)
408+
- `:none` - no ordering applied
409+
* `:index_name` - Explicit index name. Required when using `score_threshold`.
410+
* `:score_key` - Atom key to select the BM25 score into the result map.
411+
- `:none` (default) - score is not selected
412+
- `atom` - selects score as this key (use with `select_merge/3`)
413+
* `:score_threshold` - Post-filter results by BM25 score (applied after ORDER BY).
414+
Since scores are negative and lower is better, use negative thresholds (e.g., `-3.0`
415+
keeps only results with score < -3.0, i.e., scores like -4.0, -5.0 which are better matches).
416+
May return fewer results than LIMIT.
417+
* `:pre_filter` - Whether to exclude non-matching rows.
418+
- `false` (default) - no pre-filtering
419+
- `true` - adds a `WHERE score < 0` clause to exclude non-matches
420+
421+
## Examples
422+
423+
Basic search - returns top 10 most relevant posts:
424+
425+
Post
426+
|> Torus.bm25([p], p.body, "database search")
427+
|> limit(10)
428+
|> select([p], p.body)
429+
|> Repo.all()
430+
431+
With score selection:
432+
433+
Post
434+
|> Torus.bm25([p], p.body, "database", score_key: :relevance)
435+
|> limit(5)
436+
|> select([p], %{body: p.body})
437+
|> Repo.all()
438+
# => [%{body: "...", relevance: -2.5}, ...]
439+
440+
With WHERE clause pre-filtering:
441+
442+
Post
443+
|> where([p], p.category_id == 123)
444+
|> Torus.bm25([p], p.body, "database")
445+
|> limit(10)
446+
|> Repo.all()
447+
448+
With score threshold (post-filtering, may return fewer than LIMIT, `index_name` is required):
449+
450+
Post
451+
|> Torus.bm25([p], p.body, "database", score_threshold: -5.0, index_name: "posts_body_idx")
452+
|> limit(10)
453+
|> Repo.all()
454+
455+
## When to use `bm25/5` vs `full_text/5`
456+
457+
**Use `bm25/5` when:**
458+
- You need better relevance ranking than TF-IDF
459+
- You need faster search with large datasets
460+
- You have large result sets with LIMIT (top-k queries)
461+
- Single column search is sufficient
462+
- You're on PostgreSQL 17+
463+
464+
**Use `full_text/5` when:**
465+
- You need multi-column search with different weights per column
466+
- You want to use stored tsvector columns
467+
- You're on PostgreSQL < 17
468+
- You need the `concat` filter type
469+
470+
## Multi-column search workaround
471+
472+
Since BM25 indexes work on single columns, you can create a generated column:
473+
474+
```sql
475+
ALTER TABLE posts
476+
ADD COLUMN searchable_text TEXT
477+
GENERATED ALWAYS AS (title || ' ' || body) STORED;
478+
479+
CREATE INDEX posts_searchable_bm25_idx
480+
ON posts USING bm25(searchable_text)
481+
WITH (text_config='english');
482+
```
483+
484+
Then search the generated column:
485+
486+
```elixir
487+
Post
488+
|> Torus.bm25([p], p.searchable_text, "search term")
489+
|> limit(10)
490+
|> Repo.all()
491+
```
492+
493+
## Index options
494+
495+
BM25 indexes support these parameters in the `WITH` clause:
496+
497+
- `text_config` - PostgreSQL text search configuration (required). This determines
498+
the language/stemming rules. Available configs: `'english'`, `'french'`, `'german'`,
499+
`'simple'` (no stemming), etc. Run `SELECT cfgname FROM pg_ts_config;` to list all.
500+
- `k1` - Term frequency saturation (default: 1.2, range: 0.1-10.0)
501+
- `b` - Length normalization (default: 0.75, range: 0.0-1.0)
502+
503+
```sql
504+
CREATE INDEX custom_idx ON documents
505+
USING bm25(content)
506+
WITH (text_config='english', k1=1.5, b=0.8);
507+
```
508+
509+
## Performance tips
510+
511+
- BM25 is most efficient with `ORDER BY + LIMIT` (enables Block-Max WAND optimization)
512+
- For filtered searches, create a separate B-tree index on the filter column
513+
- Pre-filtering works best when the filter is selective (<10% of rows)
514+
- Post-filtering with `score_threshold` may return fewer results than LIMIT
515+
"""
516+
defmacro bm25(query, bindings, qualifier, term, opts \\ []) do
517+
Torus.Search.BM25.bm25(query, bindings, qualifier, term, opts)
518+
end
519+
369520
@doc group: "Pattern matching"
370521
@doc """
371522
The substring function with three parameters provides extraction of a substring

0 commit comments

Comments
 (0)