discuss: similarity algorithms

For years now we've been fighting the `TF/IDF` algorithm and more recently we've changed to the `BM25` similarity algo which is much better for short texts like ours but it's still not perfect.

There is a really [great article here](https://opensourceconnections.com/blog/2014/12/08/title-search-when-relevancy-is-only-skin-deep/) which talks about the caveats of scoring short title fields.

The cool thing about [BM25](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html#bm25) (and [other similarity algos](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html)) is that they have some tunable parameters, albeit considered 'expert settings'.

One setting that interests me, in particular, is the [k1 value](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html#bm25) which "Controls non-linear term frequency normalization (saturation).".

The default settings for `BM25` are `k1=1.2` and `b=0.75`, which are really nice settings for general use of elasticsearch, they work well for short fields like titles as well as for large fields like indexing a whole chapter of a book.

For geocoding specifically we almost exclusively deal with short strings (<50 chars).
I also personally feel that Term Frequencies are much less relevant for geocoding because they can cause [issues like this](https://github.com/pelias/openstreetmap/issues/507).

I'd like to open this up to @pelias/contributors to discuss introducing our own custom similarity configuration (or multiple if required).
In particular, I would like to investigate the effects of setting `k1=0` (or very very low).

Thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

discuss: similarity algorithms #408

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

discuss: similarity algorithms #408

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions