From 63632e96665fcc33a817ca153c3f51ae0d158d7d Mon Sep 17 00:00:00 2001 From: Fanis Tharropoulos Date: Thu, 30 Jan 2025 14:54:42 +0200 Subject: [PATCH 01/17] docs(stemming): add custom stemming dictionaries - add new `stemming.md` documentation explaining basic and custom stemming - add stemming menu item in navigation config.js - update collections schema docs with custom stemming functionality - update FAQs with custom stemming example and explanation --- docs-site/content/.vuepress/config.js | 1 + docs-site/content/28.0/api/collections.md | 2 +- docs-site/content/28.0/api/stemming.md | 166 ++++++++++++++++++++++ docs-site/content/guide/faqs.md | 49 ++++++- 4 files changed, 213 insertions(+), 5 deletions(-) create mode 100644 docs-site/content/28.0/api/stemming.md diff --git a/docs-site/content/.vuepress/config.js b/docs-site/content/.vuepress/config.js index 27509087..1c45360d 100644 --- a/docs-site/content/.vuepress/config.js +++ b/docs-site/content/.vuepress/config.js @@ -315,6 +315,7 @@ let config = { ['/28.0/api/curation', 'Curation'], ['/28.0/api/collection-alias', 'Collection Alias'], ['/28.0/api/synonyms', 'Synonyms'], + ['/28.0/api/stemming', 'Stemming'], ['/28.0/api/stopwords', 'Stopwords'], ['/28.0/api/cluster-operations', 'Cluster Operations'], ], diff --git a/docs-site/content/28.0/api/collections.md b/docs-site/content/28.0/api/collections.md index 6099c4ff..64ffe2aa 100644 --- a/docs-site/content/28.0/api/collections.md +++ b/docs-site/content/28.0/api/collections.md @@ -404,7 +404,7 @@ string, then the next document that contains the field named `title` will be exp | Parameter | Required | Description | |:----------------------|:---------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | name | yes | Name of the collection you wish to create. | -| fields | yes | A list of fields that you wish to index for [querying](./search.md#query-parameters), [filtering](./search.md#filter-results), [faceting](./search.md#facet-results), [grouping](./search.md#group-results) and [sorting](./search.md#sort-results). For each field, you have to specify at least it's `name` and [`type`](#field-types).

Eg: ```{"name": "title", "type": "string", "facet": false, "index": true}```

`name` can be a simple string like `"name": "score"`. Or you can also use a RegEx to specify field names matching a pattern. For eg: if you want to specify that all fields starting with `score_` should be an integer, you can set name as `"name": "score_.*"`.

**Declaring a field as optional**
A field can be declared as optional by setting `"optional": true`.

**Declaring a field as a facet**
A field can be declared as a facetable field by setting `"facet": true`. Faceted fields are indexed verbatim without any tokenization or preprocessing. For example, if you are building a product search, `color` and `brand` could be defined as facet fields. Once a field is enabled for faceting in the schema, it can be used in the [`facet_by` search parameter](./search.md#facet-results)..

**Enabling stemming**
Stemming allows you to handle common word variations (singular / plurals, tense changes) of the same root word. For eg: searching for `walking`, will also return results with `walk`, `walked`, `walks`, etc when stemming is enabled.

Enable stemming on the contents of the field during indexing and querying by setting `"stem": true`. The actual value stored on disk is not affected.

We use the [Snowball stemmer](https://snowballstem.org/). Language selection for stemmer is automatically made from the value of the `locale` property associated with the field.

**Declaring a field as un-indexed**
You can set a field as un-indexed (you can't search/sort/filter/facet on it) by setting `"index": false`. This is useful when used along with [auto schema detection](#with-auto-schema-detection) and you need to [exclude certain fields from indexing](#indexing-all-but-some-fields).

**Prevent field from being stored on disk**:
Set `"store": false` to ensure that a field value is removed from the document before the document is saved to disk.

**Configuring language-specific tokenization:**
The default tokenizer that Typesense uses works for most languages, especially ones that separate words by spaces. However, based on feedback from users, we've added locale specific customizations for the following languages. You can enable these customizations for a field, by setting a field called `locale` inside the field definition. Eg: `{name: 'title', type: 'string', locale: 'ja'}` will enable the Japanese locale customizations for the field named `title`.

If you are looking to retain the diacritics, setting the `locale` for your language will help.

Here's a non-exhaustive list of language-specific locales: Read this guide article for more information regarding Locale-Specific search. | +| fields | yes | A list of fields that you wish to index for [querying](./search.md#query-parameters), [filtering](./search.md#filter-results), [faceting](./search.md#facet-results), [grouping](./search.md#group-results) and [sorting](./search.md#sort-results). For each field, you have to specify at least it's `name` and [`type`](#field-types).

Eg: ```{"name": "title", "type": "string", "facet": false, "index": true}```

`name` can be a simple string like `"name": "score"`. Or you can also use a RegEx to specify field names matching a pattern. For eg: if you want to specify that all fields starting with `score_` should be an integer, you can set name as `"name": "score_.*"`.

**Declaring a field as optional**
A field can be declared as optional by setting `"optional": true`.

**Declaring a field as a facet**
A field can be declared as a facetable field by setting `"facet": true`. Faceted fields are indexed verbatim without any tokenization or preprocessing. For example, if you are building a product search, `color` and `brand` could be defined as facet fields. Once a field is enabled for faceting in the schema, it can be used in the [`facet_by` search parameter](./search.md#facet-results)..

**Enabling stemming**
Stemming allows you to handle common word variations (singular / plurals, tense changes) of the same root word. For eg: searching for `walking`, will also return results with `walk`, `walked`, `walks`, etc when stemming is enabled.

Stemming can be enabled via two methods: For more details regarding stemming, read the stemming documentation.

**Declaring a field as un-indexed**
You can set a field as un-indexed (you can't search/sort/filter/facet on it) by setting `"index": false`. This is useful when used along with [auto schema detection](#with-auto-schema-detection) and you need to [exclude certain fields from indexing](#indexing-all-but-some-fields).

**Prevent field from being stored on disk**:
Set `"store": false` to ensure that a field value is removed from the document before the document is saved to disk.

**Configuring language-specific tokenization:**
The default tokenizer that Typesense uses works for most languages, especially ones that separate words by spaces. However, based on feedback from users, we've added locale specific customizations for the following languages. You can enable these customizations for a field, by setting a field called `locale` inside the field definition. Eg: `{name: 'title', type: 'string', locale: 'ja'}` will enable the Japanese locale customizations for the field named `title`.

If you are looking to retain the diacritics, setting the `locale` for your language will help.

Here's a non-exhaustive list of language-specific locales: Read this guide article for more information regarding Locale-Specific search. | | token_separators | no | List of symbols or special characters to be used for splitting the text into individual words _**in addition**_ to space and new-line characters.

For e.g. you can add `-` (hyphen) to this list to make a word like `non-stick` to be split on hyphen and indexed as two separate words.

Read [this guide article](../../guide/tips-for-searching-common-types-of-data.md) for more examples on how to use this setting. | | symbols_to_index | no | List of symbols or special characters to be indexed.

For e.g. you can add `+` to this list to make the word `c++` indexable verbatim.

Read [this guide article](../../guide/tips-for-searching-common-types-of-data.md) for more examples on how to use this setting. | | default_sorting_field | no | The name of an `int32 / float` field that determines the order in which the search results are ranked when a `sort_by` clause is not provided during searching.

This field must indicate some kind of popularity. For example, in a product search application, you could define `num_reviews` field as the `default_sorting_field` to rank products that have the most reviews higher by default.

Additionally, when a word in a search query matches multiple possible words (either during a prefix (partial word) search or because of a typo), this parameter is used to rank such equally matching records.

For e.g. Searching for "ap", will match records with "apple", "apply", "apart", "apron", or any of hundreds of similar words that start with "ap" in your dataset. Also, searching for "jofn", will match records with "john", "joan" and all similar variations that are 1-typo away in your dataset.

For performance reasons though, Typesense will only consider the top `4` prefixes or typo variations by default (the `4` is configurable using the [`max_candidates`](./search.md#ranking-and-sorting-parameters) search parameter, which defaults to `4`).

If `default_sorting_field` is NOT specified in the collection schema, then "top" is defined as the prefixes or typo variations with the most number of matching records.

But let's say you have a field called `popularity` in each record, and you want Typesense to use the value in that field to define the "top" records, you'd set that field as `default_sorting_field: popularity`. Typesense will then use the value of that field to fetch the top `max_candidates` number of terms that are most popular, and as users type in more characters, it will refine the search further to always rank the most popular prefixes highest. | diff --git a/docs-site/content/28.0/api/stemming.md b/docs-site/content/28.0/api/stemming.md new file mode 100644 index 00000000..810187f5 --- /dev/null +++ b/docs-site/content/28.0/api/stemming.md @@ -0,0 +1,166 @@ +--- +sidebarDepth: 1 +sitemap: + priority: 0.7 +--- + +# Stemming + +Stemming is a technique that helps handle variations of words during search. When stemming is enabled, a search for one form of a word will also match other grammatical forms of that word. For example: + +- Searching for "run" would match "running", "runs", "ran" +- Searching for "walk" would match "walking", "walked", "walks" +- Searching for "company" would match "companies" + +Typesense provides two approaches to handle word variations: + +## Basic Stemming + +Basic stemming uses the [Snowball stemmer](https://snowballstem.org/) algorithm to automatically detect and handle word variations. This works well for common word patterns in the configured language. + +To enable basic stemming for a field, set `"stem": true` in your collection schema: + + + + + +The language used for stemming is automatically determined from the `locale` parameter of the field. For example, setting `"locale": "fr"` will use French-specific stemming rules. + +## Custom Stemming Dictionaries + +For cases where you need more precise control over word variations, or when dealing with irregular forms that algorithmic stemming can't handle well, you can use stemming dictionaries. These allow you to define exact mappings between words and their root forms. + +### Creating a Stemming Dictionary + +First, create a JSONL file with your word mappings: + +```json +{"word": "people", "root": "person"} +{"word": "children", "root": "child"} +{"word": "geese", "root": "goose"} +``` + +Then upload it using the stemming dictionary API: + + + + + +#### Sample Response + + + + + +### Using a Stemming Dictionary + +To use a stemming dictionary, specify it in your collection schema using the `stem_dictionary` parameter: + + + + + +:::tip Combining Both Approaches +You can use both basic stemming (`"stem": true`) and dictionary stemming (`"stem_dictionary": "dictionary_name"`) on the same field. When both are enabled, dictionary stemming takes precedence for words that exist in the dictionary. +::: + +### Managing Dictionaries + +#### Retrieve a Dictionary + + + + + +#### List All Dictionaries + + + + + +#### Sample Response + + + + + +## Best Practices + +1. **Start with Basic Stemming**: For most use cases, basic stemming with the appropriate locale setting will handle common word variations well. + +2. **Use Dictionaries for Exceptions**: Add stemming dictionaries when you need to handle: + - Domain-specific variations + - Cases where basic stemming doesn't give desired results + +3. **Language-Specific Considerations**: Remember that basic stemming behavior changes based on the `locale` parameter. Set this appropriately for your content's language. diff --git a/docs-site/content/guide/faqs.md b/docs-site/content/guide/faqs.md index 016d6d13..2b085527 100644 --- a/docs-site/content/guide/faqs.md +++ b/docs-site/content/guide/faqs.md @@ -56,11 +56,52 @@ You can use the `token_separators` and `symbols_to_index` parameters to control ### How do I handle singular / plural variations of a keyword? -You can use the stemming feature to allow search queries that contain variations of a word in your dataset (eg: singular / plurals, tense changes, etc) to still match the record. +There are two ways to handle word variations (like singular/plural forms) in Typesense: -For eg: searching for `walking`, will also return results with `walk`, `walked`, `walks`, etc when stemming is enabled. +#### 1. Using Basic Stemming -You can enable stemming by setting the `stem: true` parameter in the field definition in the collection schema. +You can use the built-in stemming feature to automatically handle common variations of words in your dataset (eg: singular/plurals, tense changes, etc). +For eg: searching for `walking` will also return results with `walk`, `walked`, `walks`, etc when stemming is enabled. + +You can enable stemming by setting the `stem: true` parameter in the field definition in the collection schema. + +#### 2. Using Custom Stemming Dictionaries + +:::warning NOTE +Custom stemming dictionaries are only available in `v28.0` and above. +::: + +For more precise control over word variations, you can use custom stemming dictionaries that define exact mappings between words and their root forms. + +First, create a dictionary by uploading a JSONL file that contains your word mappings: + +```json +{"word": "meetings", "root":"meeting"} +{"word": "people", "root":"person"} +{"word": "children", "root":"child"} +``` + +Upload this dictionary using the stemming dictionary API: + +```bash +curl -X POST \ + -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \ + --data-binary @plurals.jsonl \ + "http://localhost:8108/stemming/dictionary/import?id=my_dictionary" +``` + +Then enable the dictionary in your collection schema by setting the `stem_dictionary` parameter: + +```json +{ + "name": "companies", + "fields": [ + {"name": "title", "type": "string", "stem_dictionary": "my_dictionary"} + ] +} +``` + +For more details on stemming, read the stemming documentation. ### When I search for a short string, I don't get all results. How do I address this? @@ -354,4 +395,4 @@ Here's how Typesense Cloud and Self-Hosted (on any VPS or other cloud) compare: ### I don't see my question answered here or in the docs. What do I do? -Read our [Help](/help.md) section for information on how to get additional help. \ No newline at end of file +Read our [Help](/help.md) section for information on how to get additional help. From 065778000202d040af46a7824a70c30a80932b60 Mon Sep 17 00:00:00 2001 From: Fanis Tharropoulos Date: Thu, 30 Jan 2025 15:03:43 +0200 Subject: [PATCH 02/17] docs(sort_by): add random sorting functionality - add documentation for `_rand()` sorting parameter - document seed value behavior and constraints - add examples of random sorting with and without seeds - include tips about timestamp usage and combining with other sorts --- docs-site/content/28.0/api/search.md | 38 ++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/docs-site/content/28.0/api/search.md b/docs-site/content/28.0/api/search.md index 9e3bee93..b41d9bb6 100644 --- a/docs-site/content/28.0/api/search.md +++ b/docs-site/content/28.0/api/search.md @@ -489,6 +489,44 @@ sort_by=title(missing_values: last):desc The possible values of `missing_values` are: `first` or `last`. + +### Random Sorting + +You can randomly sort search results using the special `_rand()` parameter in `sort_by`. You can optionally provide a seed value, which must be a positive integer. + +For example, with a specific seed: + +```json +{ + "sort_by": "_rand(42)" +} +``` + +Or without a seed value, which will use the current timestamp as the seed: + +```json +{ + "sort_by": "_rand()" +} +``` + +Using a specific seed value will produce the same random ordering across searches, which is useful when you want consistent randomization (e.g., for A/B testing or result sampling). Using `_rand()` without a seed will produce different random orderings on each request. + +You can combine random sorting with other sort fields: + +```json +{ + "sort_by": "_rand():desc,popularity:desc" +} +``` + +:::tip +- When no seed is provided, the current timestamp is used as the seed +- When a seed is provided, it must be a positive integer +- Using the same seed will produce the same random ordering +- Different seed values (or no seed) will produce different random orderings +::: + ## Group Results You can aggregate search results into groups or buckets by specify one or more `group_by` fields. From ff2a54fd8803d1f3f63260f2e8446faeaafaf7dd Mon Sep 17 00:00:00 2001 From: Fanis Tharropoulos Date: Thu, 30 Jan 2025 15:18:13 +0200 Subject: [PATCH 03/17] docs(hybrid-search): add hybrid match re-ranking functionality - add documentation for `rerank_hybrid_matches` parameter - update vector-search.md with re-ranking behavior and examples - expand semantic-search guide with detailed re-ranking explanation - add code samples showing score differences with re-ranking --- docs-site/content/28.0/api/vector-search.md | 40 +++++++++ docs-site/content/guide/semantic-search.md | 98 +++++++++++++++++++++ 2 files changed, 138 insertions(+) diff --git a/docs-site/content/28.0/api/vector-search.md b/docs-site/content/28.0/api/vector-search.md index 35ba4736..f912891c 100644 --- a/docs-site/content/28.0/api/vector-search.md +++ b/docs-site/content/28.0/api/vector-search.md @@ -2726,6 +2726,46 @@ won't have an impact on an embedding field mentioned in `query_by`. However, sin must match the length of `query_by`, you can use a placeholder value like `0`. ::: +### Re-ranking Hybrid Matches + +By default, during hybrid search: +- Documents found through keyword search but not through vector search will only have a text match score +- Documents found through vector search but not through keyword search will only have a vector distance score + +You can optionally compute both scores for all matches by setting `rerank_hybrid_matches: true` in your search parameters. When enabled: +- Documents found only through keyword search will also get a vector distance score +- Documents found only through vector search will also get a text match score + +This allows for more comprehensive ranking of results, at the cost of additional computation time. + +Example: + + + + + +Each hit in the response will contain a `text_match_info` and a `vector_distance` score, regardless of whether it was initially found through keyword or vector search. + ### Distance Threshold You can also set a maximum vector distance threshold for results of semantic search and hybrid search. You should set `distance_threshold` in `vector_query` parameter for this. diff --git a/docs-site/content/guide/semantic-search.md b/docs-site/content/guide/semantic-search.md index d08ae35a..d87c0b5c 100644 --- a/docs-site/content/guide/semantic-search.md +++ b/docs-site/content/guide/semantic-search.md @@ -442,6 +442,104 @@ Notice how searching for `Desktop copier` returns `Desktop` as a result which is } ``` +### Re-ranking Hybrid Matches + +When doing hybrid search, by default Typesense returns both keyword matches and semantic matches in the results. For example: + +```bash +curl 'http://localhost:8108/multi_search' \ + -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \ + -X POST \ + -d '{ + "searches": [ + { + "query_by": "product_name,embedding", + "q": "desktop copier", + "collection": "products", + "prefix": "false", + "exclude_fields": "embedding", + "per_page": 2 + } + ] + }' +``` + +A search for "desktop copier" might return results like this: + +```json{18,22,32,36} +{ + "hits": [ + { + "document": { + "id": "2", + "product_name": "Desktop" + }, + "highlight": { + "product_name": { + "matched_tokens": ["Desktop"], + "snippet": "Desktop" + } + }, + "hybrid_search_info": { + "rank_fusion_score": 0.8500000238418579 + }, + "text_match": 1060320051, + "text_match_info": { + "best_field_score": "517734" + }, + "vector_distance": 0.510231614112854 + }, + { + "document": { + "id": "3", + "product_name": "Printer" + }, + "hybrid_search_info": { + "rank_fusion_score": 0.30000001192092896 + }, + "text_match": 0, + "text_match_info": { + "best_field_score": "0" + }, + "vector_distance": 0.4459354281425476 + } + ] +} +``` + +Notice how: +- The first result "Desktop" is a keyword match (high text_match score) +- The second result "Printer" is a semantic match (low vector_distance but zero text_match) + +By default: +- Documents found through keyword search but not through vector search will only have a text match score +- Documents found through vector search but not through keyword search will only have a vector distance score + +You can optionally compute both scores for all matches by setting `rerank_hybrid_matches: true`. When enabled: +- Documents found only through keyword search will also get a vector distance score +- Documents found only through vector search will also get a text match score + +Example with re-ranking enabled: + +```bash{10} +curl 'http://localhost:8108/multi_search' \ + -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \ + -X POST \ + -d '{ + "searches": [ + { + "collection": "products", + "query_by": "embedding,product_name", + "q": "desktop copier", + "rerank_hybrid_matches": true, + "vector_query": "embedding:([], alpha: 0.8)", + "exclude_fields": "embedding" + } + ] + }' +``` + +This provides more comprehensive ranking of results by computing both scores for all matches, at the cost of additional computation time. ### Pagination From 347ee476e9d517559f04e001efd0d5899fdf7298 Mon Sep 17 00:00:00 2001 From: Fanis Tharropoulos Date: Thu, 30 Jan 2025 15:23:08 +0200 Subject: [PATCH 04/17] docs(sort_by): add pivot sorting functionality - add documentation for `pivot` sorting parameter - describe ascending and descending pivot sort behavior - include example with timestamp pivot sorting - document use cases and combination with other sort fields --- docs-site/content/28.0/api/search.md | 38 ++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/docs-site/content/28.0/api/search.md b/docs-site/content/28.0/api/search.md index b41d9bb6..6a5e443f 100644 --- a/docs-site/content/28.0/api/search.md +++ b/docs-site/content/28.0/api/search.md @@ -527,6 +527,44 @@ You can combine random sorting with other sort fields: - Different seed values (or no seed) will produce different random orderings ::: +### Sorting with a Pivot Value + +You can sort results relative to a specific pivot value using the `pivot` parameter in `sort_by`. This is particularly useful when you want to order items based on their distance from a reference point. + +For example, if you have timestamps and want to sort based on proximity to a specific timestamp: + +```json +{ + "sort_by": "timestamp(pivot: 1728386250):asc" +} +``` + +This will sort results so that: +- With `asc`: Values closest to the pivot value appear first, followed by values further away +- With `desc`: Values furthest from the pivot value appear first, followed by values closer to it + +Example results when sorting in ascending order relative to pivot value 1728386250: +``` +timestamp: 1728386250 (exact match to pivot) +timestamp: 1728387250 (1000 away from pivot) +timestamp: 1728385250 (1000 away from pivot) +timestamp: 1728384250 (2000 away from pivot) +timestamp: 1728383250 (3000 away from pivot) +``` + +You can combine pivot sorting with other sort fields: + +```json +{ + "sort_by": "timestamp(pivot: 1728386250):asc,popularity:desc" +} +``` + +This feature is useful for: +- Sorting by proximity to a reference date/time +- Organizing numerical values around a target number +- Creating "closer to" style sorting experiences + ## Group Results You can aggregate search results into groups or buckets by specify one or more `group_by` fields. From 6b16e7b628e9afaa6a17c63b0a496160e2c85953 Mon Sep 17 00:00:00 2001 From: Fanis Tharropoulos Date: Thu, 30 Jan 2025 15:33:06 +0200 Subject: [PATCH 05/17] docs(stemming): clarify porter stemming behavior - add disclaimer about rules-based stemming limitations - explain potential side effects with brand names and locations - clarify impact on search relevance for specialized content --- docs-site/content/28.0/api/stemming.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs-site/content/28.0/api/stemming.md b/docs-site/content/28.0/api/stemming.md index 810187f5..7bd1f266 100644 --- a/docs-site/content/28.0/api/stemming.md +++ b/docs-site/content/28.0/api/stemming.md @@ -16,7 +16,7 @@ Typesense provides two approaches to handle word variations: ## Basic Stemming -Basic stemming uses the [Snowball stemmer](https://snowballstem.org/) algorithm to automatically detect and handle word variations. This works well for common word patterns in the configured language. +Basic stemming uses the [Snowball stemmer](https://snowballstem.org/) algorithm to automatically detect and handle word variations. Being rules-based, it works well for common word patterns in the configured language, but may produce unintended side effects with brand names, proper nouns, and locations. Since these rules are designed primarily for common nouns, applying them to specialized content like company names or locations can sometimes degrade search relevance. To enable basic stemming for a field, set `"stem": true` in your collection schema: From dd9c641eddeb6c3113c49b49965e82bd132a652c Mon Sep 17 00:00:00 2001 From: Fanis Tharropoulos Date: Thu, 30 Jan 2025 15:33:46 +0200 Subject: [PATCH 06/17] docs(sort_by): add decay function sorting - add decay function sorting documentation with gauss, linear, exp functions - include implementation details and parameter descriptions - add examples using timestamp-based decay sorting - document best practices and tips for each decay function --- docs-site/content/28.0/api/search.md | 67 ++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+) diff --git a/docs-site/content/28.0/api/search.md b/docs-site/content/28.0/api/search.md index 6a5e443f..97ea1508 100644 --- a/docs-site/content/28.0/api/search.md +++ b/docs-site/content/28.0/api/search.md @@ -565,6 +565,73 @@ This feature is useful for: - Organizing numerical values around a target number - Creating "closer to" style sorting experiences +### Decay Function Sorting + +Decay functions allow you to score and sort results based on how far they are from a target value, with the score decreasing according to various mathematical functions. This is particularly useful for: + +- Boosting recent items in time-based sorting +- Implementing distance-based relevance +- Creating smooth falloffs in numeric ranges + +You can use decay functions in the `sort_by` parameter with the following syntax: + +```json +{ + "sort_by": "field_name(origin: value, func: function_name, scale: value, decay: rate):direction" +} +``` + +#### Parameters + +| Parameter | Required | Description | +|-----------|----------|---------------------------------------------------------------------------------------------------------| +| `origin` | Yes | The reference point from which the decay function is calculated. Must be an integer. | +| `func` | Yes | The decay function to use. Supported values: `gauss`, `linear`, `exp`, or `diff` | +| `scale` | Yes | The distance from origin at which the score should decay by the decay rate. Must be a non-zero integer. | +| `offset` | No | An offset to apply to the origin point. Must be an integer. Defaults to 0. | +| `decay` | No | The rate at which scores decay, between 0.0 and 1.0. Defaults to 0.5. | + +#### Example + +Let's say you have a collection of products with timestamps and want to sort them giving preference to items closer to a specific date: + +```bash +curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \ +"http://localhost:8108/collections/products/documents/search\ +?q=smartphone\ +&query_by=product_name\ +&sort_by=timestamp(origin: 1728385250, func: gauss, scale: 1000, decay: 0.5):desc" +``` + +For a dataset with these records: +```json +{"product_name": "Samsung Smartphone", "timestamp": 1728383250} +{"product_name": "Vivo Smartphone", "timestamp": 1728384250} +{"product_name": "Oneplus Smartphone", "timestamp": 1728385250} +{"product_name": "Pixel Smartphone", "timestamp": 1728386250} +{"product_name": "Moto Smartphone", "timestamp": 1728387250} +``` + +The results would be ordered based on how close each timestamp is to the origin (1728385250), with scores decreasing according to the gaussian function: +1. Oneplus Smartphone (exact match with origin - highest score) +2. Pixel Smartphone (1000 units from origin - decayed score) +3. Vivo Smartphone (1000 units from origin - decayed score) +4. Moto Smartphone (2000 units from origin - further decayed score) +5. Samsung Smartphone (2000 units from origin - further decayed score) + +#### Supported Functions + +- `gauss`: Gaussian decay - smooth bell curve falloff +- `linear`: Linear decay - constant rate of decrease +- `exp`: Exponential decay - rapidly decreasing scores +- `diff`: Difference-based decay - simple linear difference from origin + +:::tip +- The `decay` parameter determines how quickly scores decrease - lower values mean faster decay +- Use `gauss` for smooth falloff, `linear` for constant decrease, and `exp` for rapid decrease +- `scale` determines the distance at which scores decay by the specified rate +::: + ## Group Results You can aggregate search results into groups or buckets by specify one or more `group_by` fields. From 7a1bcbe356e49c733e9257564063159c6619e473 Mon Sep 17 00:00:00 2001 From: Fanis Tharropoulos Date: Thu, 30 Jan 2025 15:38:18 +0200 Subject: [PATCH 07/17] docs(stemming): add pre-made english plurals dictionary - add section about pre-made stemming dictionaries - include download link for english plurals dictionary - document benefits of using pre-made dictionary vs algorithmic stemming --- docs-site/content/28.0/api/stemming.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs-site/content/28.0/api/stemming.md b/docs-site/content/28.0/api/stemming.md index 7bd1f266..ef7530f8 100644 --- a/docs-site/content/28.0/api/stemming.md +++ b/docs-site/content/28.0/api/stemming.md @@ -43,6 +43,12 @@ The language used for stemming is automatically determined from the `locale` par For cases where you need more precise control over word variations, or when dealing with irregular forms that algorithmic stemming can't handle well, you can use stemming dictionaries. These allow you to define exact mappings between words and their root forms. +### Pre-made Dictionaries + +Typesense provides a pre-made English plurals dictionary that handles common singular/plural variations. You can download it [here](dl.typesense.org/data/stemming/plurals_en_v1.jsonl) + +This dictionary is particularly useful when you need reliable handling of English plural forms without the potential side effects of algorithmic stemming. + ### Creating a Stemming Dictionary First, create a JSONL file with your word mappings: From 0e07bdfdfe7d354400381bc7634991065be62177 Mon Sep 17 00:00:00 2001 From: Fanis Tharropoulos Date: Thu, 30 Jan 2025 16:00:18 +0200 Subject: [PATCH 08/17] docs(fields): add field-level token separators and symbols - add support for field-level `token_separators` and `symbols_to_index` config - update collections schema documentation with field-level parameters - add example in search tips guide - clarify precedence over collection-level settings --- docs-site/content/28.0/api/collections.md | 12 ++++++------ .../tips-for-searching-common-types-of-data.md | 18 ++++++++++++++++++ 2 files changed, 24 insertions(+), 6 deletions(-) diff --git a/docs-site/content/28.0/api/collections.md b/docs-site/content/28.0/api/collections.md index 64ffe2aa..0367c4b2 100644 --- a/docs-site/content/28.0/api/collections.md +++ b/docs-site/content/28.0/api/collections.md @@ -402,12 +402,12 @@ string, then the next document that contains the field named `title` will be exp ### Schema parameters | Parameter | Required | Description | -|:----------------------|:---------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| name | yes | Name of the collection you wish to create. | -| fields | yes | A list of fields that you wish to index for [querying](./search.md#query-parameters), [filtering](./search.md#filter-results), [faceting](./search.md#facet-results), [grouping](./search.md#group-results) and [sorting](./search.md#sort-results). For each field, you have to specify at least it's `name` and [`type`](#field-types).

Eg: ```{"name": "title", "type": "string", "facet": false, "index": true}```

`name` can be a simple string like `"name": "score"`. Or you can also use a RegEx to specify field names matching a pattern. For eg: if you want to specify that all fields starting with `score_` should be an integer, you can set name as `"name": "score_.*"`.

**Declaring a field as optional**
A field can be declared as optional by setting `"optional": true`.

**Declaring a field as a facet**
A field can be declared as a facetable field by setting `"facet": true`. Faceted fields are indexed verbatim without any tokenization or preprocessing. For example, if you are building a product search, `color` and `brand` could be defined as facet fields. Once a field is enabled for faceting in the schema, it can be used in the [`facet_by` search parameter](./search.md#facet-results)..

**Enabling stemming**
Stemming allows you to handle common word variations (singular / plurals, tense changes) of the same root word. For eg: searching for `walking`, will also return results with `walk`, `walked`, `walks`, etc when stemming is enabled.

Stemming can be enabled via two methods:
  • **Basic Stemming**
    Enable stemming on the contents of the field during indexing and querying by setting `"stem": true`. The actual value stored on disk is not affected.

    We use the [Snowball stemmer](https://snowballstem.org/). Language selection for stemmer is automatically made from the value of the `locale` property associated with the field.

  • **Custom Stemming**
    For more precise control over word variations, you can create a custom stemming dictionary and use it by setting `"stem_dictionary": ""`.

For more details regarding stemming, read the stemming documentation.

**Declaring a field as un-indexed**
You can set a field as un-indexed (you can't search/sort/filter/facet on it) by setting `"index": false`. This is useful when used along with [auto schema detection](#with-auto-schema-detection) and you need to [exclude certain fields from indexing](#indexing-all-but-some-fields).

**Prevent field from being stored on disk**:
Set `"store": false` to ensure that a field value is removed from the document before the document is saved to disk.

**Configuring language-specific tokenization:**
The default tokenizer that Typesense uses works for most languages, especially ones that separate words by spaces. However, based on feedback from users, we've added locale specific customizations for the following languages. You can enable these customizations for a field, by setting a field called `locale` inside the field definition. Eg: `{name: 'title', type: 'string', locale: 'ja'}` will enable the Japanese locale customizations for the field named `title`.

If you are looking to retain the diacritics, setting the `locale` for your language will help.

Here's a non-exhaustive list of language-specific locales:
  • `ja` - Japanese
  • `zh` - Chinese
  • `ko` - Korean
  • `th` - Thai
  • `el` - Greek
  • `ru` - Russian
  • `sr` - Serbian / Cyrillic
  • `uk` - Ukrainian
  • `be` - Belarusian
  • For other languages, please refer to the list of two letter [ISO 639 language codes](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes).
Read this guide article for more information regarding Locale-Specific search. | -| token_separators | no | List of symbols or special characters to be used for splitting the text into individual words _**in addition**_ to space and new-line characters.

For e.g. you can add `-` (hyphen) to this list to make a word like `non-stick` to be split on hyphen and indexed as two separate words.

Read [this guide article](../../guide/tips-for-searching-common-types-of-data.md) for more examples on how to use this setting. | -| symbols_to_index | no | List of symbols or special characters to be indexed.

For e.g. you can add `+` to this list to make the word `c++` indexable verbatim.

Read [this guide article](../../guide/tips-for-searching-common-types-of-data.md) for more examples on how to use this setting. | -| default_sorting_field | no | The name of an `int32 / float` field that determines the order in which the search results are ranked when a `sort_by` clause is not provided during searching.

This field must indicate some kind of popularity. For example, in a product search application, you could define `num_reviews` field as the `default_sorting_field` to rank products that have the most reviews higher by default.

Additionally, when a word in a search query matches multiple possible words (either during a prefix (partial word) search or because of a typo), this parameter is used to rank such equally matching records.

For e.g. Searching for "ap", will match records with "apple", "apply", "apart", "apron", or any of hundreds of similar words that start with "ap" in your dataset. Also, searching for "jofn", will match records with "john", "joan" and all similar variations that are 1-typo away in your dataset.

For performance reasons though, Typesense will only consider the top `4` prefixes or typo variations by default (the `4` is configurable using the [`max_candidates`](./search.md#ranking-and-sorting-parameters) search parameter, which defaults to `4`).

If `default_sorting_field` is NOT specified in the collection schema, then "top" is defined as the prefixes or typo variations with the most number of matching records.

But let's say you have a field called `popularity` in each record, and you want Typesense to use the value in that field to define the "top" records, you'd set that field as `default_sorting_field: popularity`. Typesense will then use the value of that field to fetch the top `max_candidates` number of terms that are most popular, and as users type in more characters, it will refine the search further to always rank the most popular prefixes highest. | +|:----------------------|:---------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| name | yes | Name of the collection you wish to create. | +| fields | yes | A list of fields that you wish to index for [querying](./search.md#query-parameters), [filtering](./search.md#filter-results), [faceting](./search.md#facet-results), [grouping](./search.md#group-results) and [sorting](./search.md#sort-results). For each field, you have to specify at least it's `name` and [`type`](#field-types).

Eg: ```{"name": "title", "type": "string", "facet": false, "index": true}```

`name` can be a simple string like `"name": "score"`. Or you can also use a RegEx to specify field names matching a pattern. For eg: if you want to specify that all fields starting with `score_` should be an integer, you can set name as `"name": "score_.*"`.

**Declaring a field as optional**
A field can be declared as optional by setting `"optional": true`.

**Declaring a field as a facet**
A field can be declared as a facetable field by setting `"facet": true`. Faceted fields are indexed verbatim without any tokenization or preprocessing. For example, if you are building a product search, `color` and `brand` could be defined as facet fields. Once a field is enabled for faceting in the schema, it can be used in the [`facet_by` search parameter](./search.md#facet-results)..

**Enabling stemming**
Stemming allows you to handle common word variations (singular / plurals, tense changes) of the same root word. For eg: searching for `walking`, will also return results with `walk`, `walked`, `walks`, etc when stemming is enabled.

Stemming can be enabled via two methods:
  • **Basic Stemming**
    Enable stemming on the contents of the field during indexing and querying by setting `"stem": true`. The actual value stored on disk is not affected.

    We use the [Snowball stemmer](https://snowballstem.org/). Language selection for stemmer is automatically made from the value of the `locale` property associated with the field.

  • **Custom Stemming**
    For more precise control over word variations, you can create a custom stemming dictionary and use it by setting `"stem_dictionary": ""`.

For more details regarding stemming, read the stemming documentation.

**Declaring a field as un-indexed**
You can set a field as un-indexed (you can't search/sort/filter/facet on it) by setting `"index": false`. This is useful when used along with [auto schema detection](#with-auto-schema-detection) and you need to [exclude certain fields from indexing](#indexing-all-but-some-fields).

**Prevent field from being stored on disk**:
Set `"store": false` to ensure that a field value is removed from the document before the document is saved to disk.

**Configuring language-specific tokenization:**
The default tokenizer that Typesense uses works for most languages, especially ones that separate words by spaces. However, based on feedback from users, we've added locale specific customizations for the following languages. You can enable these customizations for a field, by setting a field called `locale` inside the field definition. Eg: `{name: 'title', type: 'string', locale: 'ja'}` will enable the Japanese locale customizations for the field named `title`.

If you are looking to retain the diacritics, setting the `locale` for your language will help.

Here's a non-exhaustive list of language-specific locales:
  • `ja` - Japanese
  • `zh` - Chinese
  • `ko` - Korean
  • `th` - Thai
  • `el` - Greek
  • `ru` - Russian
  • `sr` - Serbian / Cyrillic
  • `uk` - Ukrainian
  • `be` - Belarusian
  • For other languages, please refer to the list of two letter [ISO 639 language codes](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes).
Read this guide article for more information regarding Locale-Specific search.

**Field-level token separators and symbols**

You can configure tokenization at a field level using `token_separators` and `symbols_to_index`.

Eg: ```{"name": "title", "type": "string", "token_separators": ["-"], "symbols_to_index": ["_"]}```

Field-level settings take precedence over collection-level settings.| +| token_separators | no | List of symbols or special characters to be used for splitting the text into individual words _**in addition**_ to space and new-line characters.

For e.g. you can add `-` (hyphen) to this list to make a word like `non-stick` to be split on hyphen and indexed as two separate words.

Read [this guide article](../../guide/tips-for-searching-common-types-of-data.md) for more examples on how to use this setting. | +| symbols_to_index | no | List of symbols or special characters to be indexed.

For e.g. you can add `+` to this list to make the word `c++` indexable verbatim.

Read [this guide article](../../guide/tips-for-searching-common-types-of-data.md) for more examples on how to use this setting. | +| default_sorting_field | no | The name of an `int32 / float` field that determines the order in which the search results are ranked when a `sort_by` clause is not provided during searching.

This field must indicate some kind of popularity. For example, in a product search application, you could define `num_reviews` field as the `default_sorting_field` to rank products that have the most reviews higher by default.

Additionally, when a word in a search query matches multiple possible words (either during a prefix (partial word) search or because of a typo), this parameter is used to rank such equally matching records.

For e.g. Searching for "ap", will match records with "apple", "apply", "apart", "apron", or any of hundreds of similar words that start with "ap" in your dataset. Also, searching for "jofn", will match records with "john", "joan" and all similar variations that are 1-typo away in your dataset.

For performance reasons though, Typesense will only consider the top `4` prefixes or typo variations by default (the `4` is configurable using the [`max_candidates`](./search.md#ranking-and-sorting-parameters) search parameter, which defaults to `4`).

If `default_sorting_field` is NOT specified in the collection schema, then "top" is defined as the prefixes or typo variations with the most number of matching records.

But let's say you have a field called `popularity` in each record, and you want Typesense to use the value in that field to define the "top" records, you'd set that field as `default_sorting_field: popularity`. Typesense will then use the value of that field to fetch the top `max_candidates` number of terms that are most popular, and as users type in more characters, it will refine the search further to always rank the most popular prefixes highest. | ### Field parameters diff --git a/docs-site/content/guide/tips-for-searching-common-types-of-data.md b/docs-site/content/guide/tips-for-searching-common-types-of-data.md index 50bc56f5..85d57b30 100644 --- a/docs-site/content/guide/tips-for-searching-common-types-of-data.md +++ b/docs-site/content/guide/tips-for-searching-common-types-of-data.md @@ -55,6 +55,24 @@ You can do this by setting Date: Thu, 30 Jan 2025 16:08:05 +0200 Subject: [PATCH 09/17] docs(sort_by): add text match score bucketing - add `bucket_size` parameter for text match score sorting - implement grouping of results into relevance buckets - add examples demonstrating bucketing with secondary sort criteria - document bucket size behavior and best practices --- docs-site/content/28.0/api/search.md | 39 ++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/docs-site/content/28.0/api/search.md b/docs-site/content/28.0/api/search.md index 97ea1508..793dd0bb 100644 --- a/docs-site/content/28.0/api/search.md +++ b/docs-site/content/28.0/api/search.md @@ -632,6 +632,45 @@ The results would be ordered based on how close each timestamp is to the origin - `scale` determines the distance at which scores decay by the specified rate ::: +### Text Match Score Bucketing + +When sorting by text match score (`_text_match`), you can optionally group results into buckets of similar text match scores and then apply additional sorting within each bucket. This is useful when you want to maintain approximate relevance groupings while applying secondary sorting criteria. + +You can enable this by using the `bucket_size` parameter: + +```json +{ + "sort_by": "_text_match(bucket_size: 3):desc,points:desc" +} +``` + +For example, if you search for "mark" against these records: +```json +[ + {"title": "Mark Antony", "points": 100}, + {"title": "Marks Spencer", "points": 200}, + {"title": "Mark Twain", "points": 100}, + {"title": "Mark Payne", "points": 300}, + {"title": "Marks Henry", "points": 200}, + {"title": "Mark Aurelius", "points": 200} +] +``` + +With `bucket_size: 3`, Typesense will: +1. First group the results into buckets of 3 records based on text match scores +2. Then within each bucket, sort by points in descending order + +So records with similar text match relevance stay together, while being ordered by points within their relevance group. + +The `bucket_size` parameter accepts: +- Any positive integer: Groups results into buckets of that size +- `0`: Disables bucketing (default behavior) + +:::tip +- A larger bucket size means more emphasis on the secondary sort field +- When `bucket_size` is larger than the number of results, no bucketing occurs +::: + ## Group Results You can aggregate search results into groups or buckets by specify one or more `group_by` fields. From 43b16a248119f2372c374485647f25f61ef69918 Mon Sep 17 00:00:00 2001 From: Fanis Tharropoulos Date: Thu, 30 Jan 2025 16:14:11 +0200 Subject: [PATCH 10/17] docs(collections): add collection truncate operation - add truncate collection endpoint documentation - implement code examples in all supported languages - include sample response format - explain difference between truncate and delete operations --- docs-site/content/28.0/api/collections.md | 85 +++++++++++++++++++++++ 1 file changed, 85 insertions(+) diff --git a/docs-site/content/28.0/api/collections.md b/docs-site/content/28.0/api/collections.md index 0367c4b2..647dab40 100644 --- a/docs-site/content/28.0/api/collections.md +++ b/docs-site/content/28.0/api/collections.md @@ -932,6 +932,91 @@ via the [API](../api/cluster-operations.md#compacting-the-on-disk-database). **Definition** `DELETE ${TYPESENSE_HOST}/collections/:collection` +## Truncate a collection + +You can remove all documents from a collection while keeping the collection and schema intact by using the truncate operation. + + + + + + + + + + + + + +**Sample Response** + + + + + +**Definition** +`DELETE ${TYPESENSE_HOST}/collections/:collection/documents?truncate=true` + +The response includes the number of documents that were deleted in the `num_deleted` field. For an empty collection, this value will be 0. + ## Update or alter a collection Typesense supports adding or removing fields to a collection's schema in-place. From b19a8b979b1744e9c51d7408c735a025b22960d6 Mon Sep 17 00:00:00 2001 From: Fanis Tharropoulos Date: Thu, 30 Jan 2025 16:27:40 +0200 Subject: [PATCH 11/17] docs(geo-poly): add support for geographic polygons - add new `geopolygon` field type - implement polygon area storage and point-in-polygon queries - update field types documentation with geopolygon details - add examples of creating and searching polygon territories --- docs-site/content/28.0/api/collections.md | 39 +++++----- docs-site/content/28.0/api/geosearch.md | 94 ++++++++++++++++++++++- 2 files changed, 113 insertions(+), 20 deletions(-) diff --git a/docs-site/content/28.0/api/collections.md b/docs-site/content/28.0/api/collections.md index 647dab40..7649cbbc 100644 --- a/docs-site/content/28.0/api/collections.md +++ b/docs-site/content/28.0/api/collections.md @@ -432,25 +432,26 @@ string, then the next document that contains the field named `title` will be exp Typesense allows you to index the following types of fields: -| `type` | Description | -|:-------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `string` | String values | -| `string[]` | Array of strings | -| `int32` | Integer values up to 2,147,483,647 | -| `int32[]` | Array of `int32` | -| `int64` | Integer values larger than 2,147,483,647 | -| `int64[]` | Array of `int64` | -| `float` | Floating point / decimal numbers | -| `float[]` | Array of floating point / decimal numbers | -| `bool` | `true` or `false` | -| `bool[]` | Array of booleans | -| `geopoint` | Latitude and longitude specified as `[lat, lng]`. Read more [here](geosearch.md). | -| `geopoint[]` | Arrays of Latitude and longitude specified as `[[lat1, lng1], [lat2, lng2]]`. Read more [here](geosearch.md). | -| `object` | Nested objects. Read more [here](#indexing-nested-fields). | -| `object[]` | Arrays of nested objects. Read more [here](#indexing-nested-fields). | -| `string*` | Special type that automatically converts values to a `string` or `string[]`. | -| `image` | Special type that is used to indicate a base64 encoded string of an image used for [Image search](./image-search.md). | -| `auto` | Special type that automatically attempts to infer the data type based on the documents added to the collection. See [automatic schema detection](#with-auto-schema-detection). | +| `type` | Description | +|:-------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `string` | String values | +| `string[]` | Array of strings | +| `int32` | Integer values up to 2,147,483,647 | +| `int32[]` | Array of `int32` | +| `int64` | Integer values larger than 2,147,483,647 | +| `int64[]` | Array of `int64` | +| `float` | Floating point / decimal numbers | +| `float[]` | Array of floating point / decimal numbers | +| `bool` | `true` or `false` | +| `bool[]` | Array of booleans | +| `geopoint` | Latitude and longitude specified as `[lat, lng]`. Read more [here](geosearch.md). | +| `geopoint[]` | Arrays of Latitude and longitude specified as `[[lat1, lng1], [lat2, lng2]]`. Read more [here](geosearch.md). | +| `geopolygon` | Geographic polygon defined by an array of coordinates specified as `[lat1, lng1, lat2, lng2, ...]`. Latitude/longitude pairs must be in counter-clockwise (CCW) or clockwise (CW) order. Read more here. | +| `object` | Nested objects. Read more [here](#indexing-nested-fields). | +| `object[]` | Arrays of nested objects. Read more [here](#indexing-nested-fields). | +| `string*` | Special type that automatically converts values to a `string` or `string[]`. | +| `image` | Special type that is used to indicate a base64 encoded string of an image used for [Image search](./image-search.md). | +| `auto` | Special type that automatically attempts to infer the data type based on the documents added to the collection. See [automatic schema detection](#with-auto-schema-detection). | ### Cloning a collection schema diff --git a/docs-site/content/28.0/api/geosearch.md b/docs-site/content/28.0/api/geosearch.md index a90b6dee..112c029d 100644 --- a/docs-site/content/28.0/api/geosearch.md +++ b/docs-site/content/28.0/api/geosearch.md @@ -425,6 +425,98 @@ You want to specify the geo-points of the polygon as lat, lng pairs. 'filter_by' : 'location:(48.8662, 2.3255, 48.8581, 2.3209, 48.8561, 2.3448, 48.8641, 2.3469)' ``` +## Geographic Polygons + +You can also store polygonal geographic areas using the `geopolygon` field type and then check if points fall within these areas. + +### Creating a Collection with Geopolygons + +Let's create a collection with a field to store polygon areas: + + + + + +### Adding Polygon Areas + +Add documents containing polygon areas by specifying the coordinates in counter-clockwise (CCW) or clockwise (CW) order: + + + + + +:::warning NOTE +Coordinates must be specified in proper CCW or CW order to form a valid polygon. Incorrect ordering will result in an error. +::: + +### Searching Points in Polygons + +You can search for documents whose polygon areas contain a specific point: + + + + + +This will return all polygons that contain the point (0.5, 0.5). + +**Sample Response** + + + + + ## Sorting by Additional Attributes within a Radius ### exclude_radius @@ -448,4 +540,4 @@ Similarly, you can bucket all geo points into "groups" using the `precision` par 'sort_by' : 'location(48.853, 2.344, precision: 2mi):asc, popularity:desc' ``` -This will bucket the results into 2-mile groups and force records within each bucket into a tie for "geo score", so that the popularity metric can be used to tie-break and sort results within each bucket. \ No newline at end of file +This will bucket the results into 2-mile groups and force records within each bucket into a tie for "geo score", so that the popularity metric can be used to tie-break and sort results within each bucket. From 64667909317cd85116a0f47e7eeca01ea0c28029 Mon Sep 17 00:00:00 2001 From: Fanis Tharropoulos Date: Thu, 30 Jan 2025 16:51:09 +0200 Subject: [PATCH 12/17] docs(search): add max_filter_by_candidates parameter - add control over fuzzy filter_by candidates limit - update documentation for filter parameters - add parameter description and default value - document use case for prefix filtering control --- docs-site/content/28.0/api/search.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs-site/content/28.0/api/search.md b/docs-site/content/28.0/api/search.md index 793dd0bb..f356ab86 100644 --- a/docs-site/content/28.0/api/search.md +++ b/docs-site/content/28.0/api/search.md @@ -222,10 +222,11 @@ When a `string[]` field is queried, the `highlights` structure will include the ### Filter parameters -| Parameter | Required | Description | -|:-------------------|:---------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Parameter | Required | Description | +|:-------------------|:---------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | filter_by | no | Filter conditions for refining your search results.

A field can be matched against one or more values.

Examples:
- `country: USA`
- `country: [USA, UK]` returns documents that have `country` of `USA` OR `UK`.

**Exact vs Non-Exact Filtering:**
To match a string field's full value verbatim, you can use the `:=` (exact match) operator. For eg: `category := Shoe` will match documents with `category` set as `Shoe` and not documents with a `category` field set as `Shoe Rack`.

Using the `:` (non-exact) operator will do a word-level partial match on the field, without taking token position into account (so is usually faster). Eg: `category:Shoe` will match records with `category` of `Shoe` or `Shoe Rack` or `Outdoor Shoe`.

Tip: If you have a field that doesn't have any spaces in the values across any documents and want to filter on it, you want to use the `:` operator to improve performance, since it will avoid doing token position checks.

**Escaping Special Characters:**
You can also filter using multiple values and use the backtick character to denote a string literal: category:= [\`Running Shoes, Men\`, \`Sneaker (Men)\`, Boots].

**Negation:**
Not equals / negation is supported via the `:!=` operator, e.g. `author:!=JK Rowling` or `id:!=[id1, id2]`. You can also negate multiple values: `author:!=[JK Rowling, Gilbert Patten]`

To exclude results that _contains_ a specific string during filtering you can do `artist:! Jackson` will exclude all documents whose `artist` field value contains the word `jackson`.

**Numeric Filtering:**
Filter documents with numeric values between a min and max value, using the range operator `[min..max]` or using simple comparison operators `>`, `>=` `<`, `<=`, `=`.

You can enable `"range_index": true` on the numerical field schema for fast range queries (will incur additional memory usage for the index though).

Examples:
-`num_employees:<40`
-`num_employees:[10..100]`
-`num_employees:[<10, >100]`
-`num_employees:[10..100, 140]` (Filter docs where value is between 10 to 100 or exactly 140).
-`num_employees:!= [10, 100, 140]` (Filter docs where value is **NOT** 10, 100 or 140).

**Multiple Conditions:**
You can separate multiple conditions with the `&&` operator.

Examples:
- `num_employees:>100 && country: [USA, UK]`
- `categories:=Shoes && categories:=Outdoor`

To do ORs across _different_ fields (eg: color is blue OR category is Shoe), you can use the ` ||` operator.

Examples:
- `color: blue || category: shoe`
- `(color: blue || category: shoe) && in_stock: true`

**Filtering Arrays:**
filter_by can be used with array fields as well.

For eg: If `genres` is a `string[]` field:

- `genres:=[Rock, Pop]` will return documents where the `genres` array field contains `Rock OR Pop`.
- `genres:=Rock && genres:=Acoustic` will return documents where the `genres` array field contains both `Rock AND Acoustic`.

**Prefix filtering:**
You can filter on records that begin with a given prefix string like this:

`company_name: Acm*`

This will will return documents where any of the words in the `company_name` field begin with `acm`, for e.g. a name like `Corporation of Acme`.

You can combine the field-level match operator `:=` with prefix filtering like this:

`name := S*`

This will return documents that have `name: Steve Jobs` but not documents that have `name: Adam Stator`.

**Geo Filtering:**
Read more about [GeoSearch and filtering](geosearch.md) in this dedicated section.

**Embedding Filters in API Keys:**
You can embed the `filter_by` parameter (or parts of it) in a Scoped Search API Key to set up conditional access control for documents and/or enforce filters for any search requests that use that API key. Read more about [Scoped Search API Key](api-keys.md#generate-scoped-search-key) in this dedicated section. | -| enable_lazy_filter | no | Applies the filtering operation incrementally / lazily. Set this to `true` when you are potentially filtering on large values but the tokens in the query are expected to match very few documents. Default: `false`. | +| enable_lazy_filter | no | Applies the filtering operation incrementally / lazily. Set this to `true` when you are potentially filtering on large values but the tokens in the query are expected to match very few documents. Default: `false`. | +| max_filter_by_candidates | no | Controls the number of similar words that Typesense considers during fuzzy search on `filter_by` values. Useful for controlling prefix matches like `company_name:Acm*`. Default: 4. | ### Ranking and Sorting parameters From b1eb0f4b8197955b35e02b29eddb77900b3c77e3 Mon Sep 17 00:00:00 2001 From: Fanis Tharropoulos Date: Thu, 30 Jan 2025 16:55:50 +0200 Subject: [PATCH 13/17] docs(collections): add schema change status endpoint - add GET /operations/schema_changes endpoint documentation - include sample response showing progress metrics - document validation and alteration status tracking - explain empty response behavior for no changes --- docs-site/content/28.0/api/collections.md | 43 +++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/docs-site/content/28.0/api/collections.md b/docs-site/content/28.0/api/collections.md index 7649cbbc..8bec5686 100644 --- a/docs-site/content/28.0/api/collections.md +++ b/docs-site/content/28.0/api/collections.md @@ -1249,6 +1249,49 @@ curl "http://localhost:8108/collections/companies" \ }' ``` +### Get Schema Change Status + +You can check the status of in-progress schema change operations by using the schema changes endpoint. + + + + + +If no schema changes are in progress, you'll get an empty response. When a schema change is in progress, you'll get details about the operation: + + + + + +The response shows: +- Which collection is being altered +- Number of documents validated against the new schema +- Number of documents that have been altered + +**Definition** +`GET ${TYPESENSE_HOST}/operations/schema_changes` + ### Using an alias If you need to do zero-downtime schema changes, you could also re-create the collection fully with the updated schema and use From bb2ebf542a134345c03b952418a87606ae787d25 Mon Sep 17 00:00:00 2001 From: Fanis Tharropoulos Date: Thu, 30 Jan 2025 17:03:58 +0200 Subject: [PATCH 14/17] docs(vector): add remote model API key update support - add API endpoint for updating embedding model API keys - document PATCH request format for key updates - include example for OpenAI embedding model - add warning about required field parameters --- docs-site/content/28.0/api/vector-search.md | 34 +++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/docs-site/content/28.0/api/vector-search.md b/docs-site/content/28.0/api/vector-search.md index f912891c..7b891817 100644 --- a/docs-site/content/28.0/api/vector-search.md +++ b/docs-site/content/28.0/api/vector-search.md @@ -1390,6 +1390,40 @@ curl 'http://localhost:8108/collections' \ **Note:** The only supported model is `embedding-gecko-001` for now. +### Updating Remote Model API Key + +You can update the API key used for remote embedding models (like OpenAI) without recreating the collection: + + + + + +:::warning +Note: All fields parameters (`name`, `embed.from`, and `model_config` parameters) must be included in the update request. +::: ### Using GCP Vertex AI API This API also provided by Google under the Google Cloud Platform (GCP) umbrella. From ca96eeed503761d3ea14da734aa1a56a804e5d32 Mon Sep 17 00:00:00 2001 From: Fanis Tharropoulos Date: Thu, 30 Jan 2025 17:08:28 +0200 Subject: [PATCH 15/17] docs(vector): clarify distance metrics behavior - document default cosine similarity metric - explain distance_threshold behavior in different contexts - add details about sorting with distance thresholds - include examples of threshold usage --- docs-site/content/28.0/api/vector-search.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/docs-site/content/28.0/api/vector-search.md b/docs-site/content/28.0/api/vector-search.md index 7b891817..2ae38f10 100644 --- a/docs-site/content/28.0/api/vector-search.md +++ b/docs-site/content/28.0/api/vector-search.md @@ -1424,6 +1424,7 @@ curl "http://localhost:8108/collections/products" \ :::warning Note: All fields parameters (`name`, `embed.from`, and `model_config` parameters) must be included in the update request. ::: + ### Using GCP Vertex AI API This API also provided by Google under the Google Cloud Platform (GCP) umbrella. @@ -2997,6 +2998,24 @@ You can set a custom `ef` via the `vector_query` parameter (default value is `10 } ``` +## Distance Metrics + +By default, Typesense uses cosine similarity as the distance metric for vector search. When you use a `distance_threshold` parameter, documents with cosine distances larger than the threshold will: + +- In standalone vector search: be excluded from results +- When used in sorting (`sort_by`): get the maximum possible distance score but remain in results + +You can use this with both cosine similarity and inner product distance metrics. For example: + +```json +{ + "vector_query": "embedding:([], distance_threshold: 0.30)" +} +``` + +This helps filter out less relevant results while still allowing other sort conditions to take effect. +``` + ## Vector Search Parameters Here are all the possible parameters you can use inside the `vector_query` search parameter, that we've covered in the various sections above: From 87b0b912bcbcd0cec50458abea682bf03345e698 Mon Sep 17 00:00:00 2001 From: Fanis Tharropoulos Date: Fri, 31 Jan 2025 10:01:04 +0200 Subject: [PATCH 16/17] docs(search): enhance documentation for text match score bucketing - Add explanation of `buckets` and `bucket_size` parameters in API docs - Restructure ranking documentation for better readability - Add detailed examples for both bucketing approaches --- docs-site/content/28.0/api/search.md | 40 ++++++++++++++----- .../content/guide/ranking-and-relevance.md | 35 +++++++++++++--- 2 files changed, 60 insertions(+), 15 deletions(-) diff --git a/docs-site/content/28.0/api/search.md b/docs-site/content/28.0/api/search.md index f356ab86..61fa8e23 100644 --- a/docs-site/content/28.0/api/search.md +++ b/docs-site/content/28.0/api/search.md @@ -633,11 +633,31 @@ The results would be ordered based on how close each timestamp is to the origin - `scale` determines the distance at which scores decay by the specified rate ::: -### Text Match Score Bucketing +## Text Match Score Bucketing -When sorting by text match score (`_text_match`), you can optionally group results into buckets of similar text match scores and then apply additional sorting within each bucket. This is useful when you want to maintain approximate relevance groupings while applying secondary sorting criteria. +When sorting by text match score (`_text_match`), Typesense offers two different approaches to bucket your results: `buckets` and `bucket_size`. Both parameters allow you to group results with similar relevance scores together before applying secondary sorting criteria. -You can enable this by using the `bucket_size` parameter: +### Using the `buckets` Parameter + +The `buckets` parameter divides your results into a specified number of equal-sized groups: + +```json +{ + "sort_by": "_text_match(buckets: 10):desc,weighted_score:desc" +} +``` + +This approach: +1. Takes all matching results +2. Divides them into the specified number of equal-sized buckets (e.g., 10 buckets) +3. Forces all results within each bucket to have the same text match score +4. Applies the secondary sort criteria (e.g., `weighted_score`) within each bucket + +For example, if you have 100 results and specify `buckets: 10`, each bucket will contain 10 results that will be treated as having equal relevance, then sorted by the secondary criterion. + +### Using the `bucket_size` Parameter + +Alternatively, the `bucket_size` parameter groups results into fixed-size buckets: ```json { @@ -658,19 +678,21 @@ For example, if you search for "mark" against these records: ``` With `bucket_size: 3`, Typesense will: -1. First group the results into buckets of 3 records based on text match scores -2. Then within each bucket, sort by points in descending order +1. Group the results into buckets of 3 records based on text match scores +2. Within each bucket, sort by points in descending order -So records with similar text match relevance stay together, while being ordered by points within their relevance group. +This ensures that records with similar text match relevance stay together while being ordered by points within their relevance group. The `bucket_size` parameter accepts: - Any positive integer: Groups results into buckets of that size - `0`: Disables bucketing (default behavior) -:::tip -- A larger bucket size means more emphasis on the secondary sort field +### Choosing Between `buckets` and `bucket_size` + +- Use `buckets` when you want to ensure a specific number of relevance groups, regardless of the total number of results +- Use `bucket_size` when you want to maintain consistent bucket sizes, regardless of the total number of results +- A larger number of `buckets` or larger `bucket_size` means more emphasis on the secondary sort field - When `bucket_size` is larger than the number of results, no bucketing occurs -::: ## Group Results diff --git a/docs-site/content/guide/ranking-and-relevance.md b/docs-site/content/guide/ranking-and-relevance.md index b9af7c4a..127174c4 100644 --- a/docs-site/content/guide/ranking-and-relevance.md +++ b/docs-site/content/guide/ranking-and-relevance.md @@ -87,13 +87,14 @@ If you wish to sort the documents strictly by an indexed numerical or string fie ## Ranking based on Relevance and Popularity If you have a popularity score for your documents that you have either: - 1) calculated on your end in your application using any formula of your choice or 2) calculated using a counter analytics rule in Typesense -You can have Typesense mix your custom scores with the text relevance score it calculates, so results that are more popular (as defined by your custom score) are boosted more in ranking. +You can have Typesense mix your custom scores with the text relevance score it calculates, so results that are more popular (as defined by your custom score) are boosted more in ranking. There are two approaches you can use: + +### Using Fixed Number of Buckets -Here's the search parameter to achieve this: +Here's how to divide results into a specific number of relevance groups: ```json { @@ -104,7 +105,6 @@ Here's the search parameter to achieve this: Where `weighted_score` is a field in your document with your custom score. This will do the following: - 1. Fetch all results matching the query 2. Sort them by text relevance (text match score desc) 3. Divide the results into equal-sized 10 buckets (with the first bucket containing the most relevant results) @@ -112,8 +112,31 @@ This will do the following: 5. This will cause a tie inside each bucket, and then the `weighted_score` will be used to break the tie and re-rank results within each bucket. The higher the number of buckets, the more granular the re-ranking based on your weighted score will be. -For eg, if you have 100 results, and `buckets: 50`, then each bucket will have 2 results, those two results within each bucket will be re-ranked based on your `weighted_score`. +For example, if you have 100 results, and `buckets: 50`, then each bucket will have 2 results, those two results within each bucket will be re-ranked based on your `weighted_score`. + +### Using Fixed Bucket Size + +Alternatively, you can group results into fixed-size relevance groups: + +```json +{ + "sort_by": "_text_match(bucket_size: 3):desc,weighted_score:desc" +} +``` + +This approach will: +1. Fetch all results matching the query +2. Sort them by text relevance +3. Group results into fixed-size buckets (e.g., 3 results per bucket) +4. Apply the `weighted_score` sorting within each fixed-size bucket + +For example, if you have 100 results and `bucket_size: 3`, Typesense will create approximately 33 buckets with 3 results each. Each group of 3 results with similar text relevance will be sorted by their `weighted_score`. + +### Choosing Between Approaches +- Use `buckets` when you want a specific number of relevance groups +- Use `bucket_size` when you want to ensure a consistent number of results are compared by popularity at a time +- Both approaches help you balance between text relevance and popularity in your search results ## Ranking based on Relevance and Recency A common need is to rank results have been published recently higher than older results. @@ -212,4 +235,4 @@ In such cases, you can have Typesense automatically drop words / tokens from the This behavior is controlled by the `drop_tokens_threshold` search parameter, which has a default value of `1`. This means that if a search query only returns 1 or 0 results, Typesense will start dropping search keywords and repeat the search until at least 1 result is found. -To turn this behavior off, set `drop_tokens_threshold=0` \ No newline at end of file +To turn this behavior off, set `drop_tokens_threshold=0` From 98d63d03cb31f2c145bd807560ab11578e8620db Mon Sep 17 00:00:00 2001 From: Fanis Tharropoulos Date: Fri, 31 Jan 2025 12:44:40 +0200 Subject: [PATCH 17/17] docs(buckets): fix secondary field emphasis mention --- docs-site/content/28.0/api/search.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs-site/content/28.0/api/search.md b/docs-site/content/28.0/api/search.md index 61fa8e23..06f1a062 100644 --- a/docs-site/content/28.0/api/search.md +++ b/docs-site/content/28.0/api/search.md @@ -691,7 +691,7 @@ The `bucket_size` parameter accepts: - Use `buckets` when you want to ensure a specific number of relevance groups, regardless of the total number of results - Use `bucket_size` when you want to maintain consistent bucket sizes, regardless of the total number of results -- A larger number of `buckets` or larger `bucket_size` means more emphasis on the secondary sort field +- A smaller number of `buckets` or larger `bucket_size` means more emphasis on the secondary sort field - When `bucket_size` is larger than the number of results, no bucketing occurs ## Group Results