[Request] Clarify token pruning docs

See [Slack](https://elastic.slack.com/archives/CMBFM3E3W/p1739886861586889?thread_ts=1739809268.184109&cid=CMBFM3E3W) for more context. 

Right now we say: 
```
tokens_weight_threshold: Tokens whose weight is less than tokens_weight_threshold are considered insignificant and pruned. This value must be between 0 and 1. Default: 0.4.
```

This is misleading. 

By setting the tokens_freq_ratio_threshold to 10, you are saying that in order to be pruned, a document must be 10x more frequent than the average token across all tokens in all documents for that field. This is higher than the default of 5, so you’re dialing this back and requiring tokens to be even more frequent in order to be pruned. In practice, I would expect this would prune only extremely common tokens - think common words like is and the for example.

By setting the tokens_weight_threshold to 0.4, you are saying that you want to take the best scoring token, and never prune anything that’s more than 40% of that score. Because scores can vary so widely in any given text search results, we can’t issue a blanket “this is the minimum score” and still expect to have consistently good results. Instead, let’s say your top score was 0.2. That means that in order to be pruned, a token’s score would have to be below 0.08.
Both of those criteria must match for a token to be pruned.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Request] Clarify token pruning docs #481

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Request] Clarify token pruning docs #481

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions