Skip to content

[Request] Clarify token pruning docs #481

Open
@kderusso

Description

@kderusso

See Slack for more context.

Right now we say:

tokens_weight_threshold: Tokens whose weight is less than tokens_weight_threshold are considered insignificant and pruned. This value must be between 0 and 1. Default: 0.4.

This is misleading.

By setting the tokens_freq_ratio_threshold to 10, you are saying that in order to be pruned, a document must be 10x more frequent than the average token across all tokens in all documents for that field. This is higher than the default of 5, so you’re dialing this back and requiring tokens to be even more frequent in order to be pruned. In practice, I would expect this would prune only extremely common tokens - think common words like is and the for example.

By setting the tokens_weight_threshold to 0.4, you are saying that you want to take the best scoring token, and never prune anything that’s more than 40% of that score. Because scores can vary so widely in any given text search results, we can’t issue a blanket “this is the minimum score” and still expect to have consistently good results. Instead, let’s say your top score was 0.2. That means that in order to be pruned, a token’s score would have to be below 0.08.
Both of those criteria must match for a token to be pruned.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Team:SearchIssues owned by the Search Docs Team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions