diff --git a/docs-site/content/.vuepress/config.js b/docs-site/content/.vuepress/config.js index 27509087..1c45360d 100644 --- a/docs-site/content/.vuepress/config.js +++ b/docs-site/content/.vuepress/config.js @@ -315,6 +315,7 @@ let config = { ['/28.0/api/curation', 'Curation'], ['/28.0/api/collection-alias', 'Collection Alias'], ['/28.0/api/synonyms', 'Synonyms'], + ['/28.0/api/stemming', 'Stemming'], ['/28.0/api/stopwords', 'Stopwords'], ['/28.0/api/cluster-operations', 'Cluster Operations'], ], diff --git a/docs-site/content/28.0/api/collections.md b/docs-site/content/28.0/api/collections.md index 6099c4ff..8bec5686 100644 --- a/docs-site/content/28.0/api/collections.md +++ b/docs-site/content/28.0/api/collections.md @@ -402,12 +402,12 @@ string, then the next document that contains the field named `title` will be exp ### Schema parameters | Parameter | Required | Description | -|:----------------------|:---------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| name | yes | Name of the collection you wish to create. | -| fields | yes | A list of fields that you wish to index for [querying](./search.md#query-parameters), [filtering](./search.md#filter-results), [faceting](./search.md#facet-results), [grouping](./search.md#group-results) and [sorting](./search.md#sort-results). For each field, you have to specify at least it's `name` and [`type`](#field-types).

Eg: ```{"name": "title", "type": "string", "facet": false, "index": true}```

`name` can be a simple string like `"name": "score"`. Or you can also use a RegEx to specify field names matching a pattern. For eg: if you want to specify that all fields starting with `score_` should be an integer, you can set name as `"name": "score_.*"`.

**Declaring a field as optional**
A field can be declared as optional by setting `"optional": true`.

**Declaring a field as a facet**
A field can be declared as a facetable field by setting `"facet": true`. Faceted fields are indexed verbatim without any tokenization or preprocessing. For example, if you are building a product search, `color` and `brand` could be defined as facet fields. Once a field is enabled for faceting in the schema, it can be used in the [`facet_by` search parameter](./search.md#facet-results)..

**Enabling stemming**
Stemming allows you to handle common word variations (singular / plurals, tense changes) of the same root word. For eg: searching for `walking`, will also return results with `walk`, `walked`, `walks`, etc when stemming is enabled.

Enable stemming on the contents of the field during indexing and querying by setting `"stem": true`. The actual value stored on disk is not affected.

We use the [Snowball stemmer](https://snowballstem.org/). Language selection for stemmer is automatically made from the value of the `locale` property associated with the field.

**Declaring a field as un-indexed**
You can set a field as un-indexed (you can't search/sort/filter/facet on it) by setting `"index": false`. This is useful when used along with [auto schema detection](#with-auto-schema-detection) and you need to [exclude certain fields from indexing](#indexing-all-but-some-fields).

**Prevent field from being stored on disk**:
Set `"store": false` to ensure that a field value is removed from the document before the document is saved to disk.

**Configuring language-specific tokenization:**
The default tokenizer that Typesense uses works for most languages, especially ones that separate words by spaces. However, based on feedback from users, we've added locale specific customizations for the following languages. You can enable these customizations for a field, by setting a field called `locale` inside the field definition. Eg: `{name: 'title', type: 'string', locale: 'ja'}` will enable the Japanese locale customizations for the field named `title`.

If you are looking to retain the diacritics, setting the `locale` for your language will help.

Here's a non-exhaustive list of language-specific locales: Read this guide article for more information regarding Locale-Specific search. | -| token_separators | no | List of symbols or special characters to be used for splitting the text into individual words _**in addition**_ to space and new-line characters.

For e.g. you can add `-` (hyphen) to this list to make a word like `non-stick` to be split on hyphen and indexed as two separate words.

Read [this guide article](../../guide/tips-for-searching-common-types-of-data.md) for more examples on how to use this setting. | -| symbols_to_index | no | List of symbols or special characters to be indexed.

For e.g. you can add `+` to this list to make the word `c++` indexable verbatim.

Read [this guide article](../../guide/tips-for-searching-common-types-of-data.md) for more examples on how to use this setting. | -| default_sorting_field | no | The name of an `int32 / float` field that determines the order in which the search results are ranked when a `sort_by` clause is not provided during searching.

This field must indicate some kind of popularity. For example, in a product search application, you could define `num_reviews` field as the `default_sorting_field` to rank products that have the most reviews higher by default.

Additionally, when a word in a search query matches multiple possible words (either during a prefix (partial word) search or because of a typo), this parameter is used to rank such equally matching records.

For e.g. Searching for "ap", will match records with "apple", "apply", "apart", "apron", or any of hundreds of similar words that start with "ap" in your dataset. Also, searching for "jofn", will match records with "john", "joan" and all similar variations that are 1-typo away in your dataset.

For performance reasons though, Typesense will only consider the top `4` prefixes or typo variations by default (the `4` is configurable using the [`max_candidates`](./search.md#ranking-and-sorting-parameters) search parameter, which defaults to `4`).

If `default_sorting_field` is NOT specified in the collection schema, then "top" is defined as the prefixes or typo variations with the most number of matching records.

But let's say you have a field called `popularity` in each record, and you want Typesense to use the value in that field to define the "top" records, you'd set that field as `default_sorting_field: popularity`. Typesense will then use the value of that field to fetch the top `max_candidates` number of terms that are most popular, and as users type in more characters, it will refine the search further to always rank the most popular prefixes highest. | +|:----------------------|:---------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| name | yes | Name of the collection you wish to create. | +| fields | yes | A list of fields that you wish to index for [querying](./search.md#query-parameters), [filtering](./search.md#filter-results), [faceting](./search.md#facet-results), [grouping](./search.md#group-results) and [sorting](./search.md#sort-results). For each field, you have to specify at least it's `name` and [`type`](#field-types).

Eg: ```{"name": "title", "type": "string", "facet": false, "index": true}```

`name` can be a simple string like `"name": "score"`. Or you can also use a RegEx to specify field names matching a pattern. For eg: if you want to specify that all fields starting with `score_` should be an integer, you can set name as `"name": "score_.*"`.

**Declaring a field as optional**
A field can be declared as optional by setting `"optional": true`.

**Declaring a field as a facet**
A field can be declared as a facetable field by setting `"facet": true`. Faceted fields are indexed verbatim without any tokenization or preprocessing. For example, if you are building a product search, `color` and `brand` could be defined as facet fields. Once a field is enabled for faceting in the schema, it can be used in the [`facet_by` search parameter](./search.md#facet-results)..

**Enabling stemming**
Stemming allows you to handle common word variations (singular / plurals, tense changes) of the same root word. For eg: searching for `walking`, will also return results with `walk`, `walked`, `walks`, etc when stemming is enabled.

Stemming can be enabled via two methods: For more details regarding stemming, read the stemming documentation.

**Declaring a field as un-indexed**
You can set a field as un-indexed (you can't search/sort/filter/facet on it) by setting `"index": false`. This is useful when used along with [auto schema detection](#with-auto-schema-detection) and you need to [exclude certain fields from indexing](#indexing-all-but-some-fields).

**Prevent field from being stored on disk**:
Set `"store": false` to ensure that a field value is removed from the document before the document is saved to disk.

**Configuring language-specific tokenization:**
The default tokenizer that Typesense uses works for most languages, especially ones that separate words by spaces. However, based on feedback from users, we've added locale specific customizations for the following languages. You can enable these customizations for a field, by setting a field called `locale` inside the field definition. Eg: `{name: 'title', type: 'string', locale: 'ja'}` will enable the Japanese locale customizations for the field named `title`.

If you are looking to retain the diacritics, setting the `locale` for your language will help.

Here's a non-exhaustive list of language-specific locales: Read this guide article for more information regarding Locale-Specific search.

**Field-level token separators and symbols**

You can configure tokenization at a field level using `token_separators` and `symbols_to_index`.

Eg: ```{"name": "title", "type": "string", "token_separators": ["-"], "symbols_to_index": ["_"]}```

Field-level settings take precedence over collection-level settings.| +| token_separators | no | List of symbols or special characters to be used for splitting the text into individual words _**in addition**_ to space and new-line characters.

For e.g. you can add `-` (hyphen) to this list to make a word like `non-stick` to be split on hyphen and indexed as two separate words.

Read [this guide article](../../guide/tips-for-searching-common-types-of-data.md) for more examples on how to use this setting. | +| symbols_to_index | no | List of symbols or special characters to be indexed.

For e.g. you can add `+` to this list to make the word `c++` indexable verbatim.

Read [this guide article](../../guide/tips-for-searching-common-types-of-data.md) for more examples on how to use this setting. | +| default_sorting_field | no | The name of an `int32 / float` field that determines the order in which the search results are ranked when a `sort_by` clause is not provided during searching.

This field must indicate some kind of popularity. For example, in a product search application, you could define `num_reviews` field as the `default_sorting_field` to rank products that have the most reviews higher by default.

Additionally, when a word in a search query matches multiple possible words (either during a prefix (partial word) search or because of a typo), this parameter is used to rank such equally matching records.

For e.g. Searching for "ap", will match records with "apple", "apply", "apart", "apron", or any of hundreds of similar words that start with "ap" in your dataset. Also, searching for "jofn", will match records with "john", "joan" and all similar variations that are 1-typo away in your dataset.

For performance reasons though, Typesense will only consider the top `4` prefixes or typo variations by default (the `4` is configurable using the [`max_candidates`](./search.md#ranking-and-sorting-parameters) search parameter, which defaults to `4`).

If `default_sorting_field` is NOT specified in the collection schema, then "top" is defined as the prefixes or typo variations with the most number of matching records.

But let's say you have a field called `popularity` in each record, and you want Typesense to use the value in that field to define the "top" records, you'd set that field as `default_sorting_field: popularity`. Typesense will then use the value of that field to fetch the top `max_candidates` number of terms that are most popular, and as users type in more characters, it will refine the search further to always rank the most popular prefixes highest. | ### Field parameters @@ -432,25 +432,26 @@ string, then the next document that contains the field named `title` will be exp Typesense allows you to index the following types of fields: -| `type` | Description | -|:-------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `string` | String values | -| `string[]` | Array of strings | -| `int32` | Integer values up to 2,147,483,647 | -| `int32[]` | Array of `int32` | -| `int64` | Integer values larger than 2,147,483,647 | -| `int64[]` | Array of `int64` | -| `float` | Floating point / decimal numbers | -| `float[]` | Array of floating point / decimal numbers | -| `bool` | `true` or `false` | -| `bool[]` | Array of booleans | -| `geopoint` | Latitude and longitude specified as `[lat, lng]`. Read more [here](geosearch.md). | -| `geopoint[]` | Arrays of Latitude and longitude specified as `[[lat1, lng1], [lat2, lng2]]`. Read more [here](geosearch.md). | -| `object` | Nested objects. Read more [here](#indexing-nested-fields). | -| `object[]` | Arrays of nested objects. Read more [here](#indexing-nested-fields). | -| `string*` | Special type that automatically converts values to a `string` or `string[]`. | -| `image` | Special type that is used to indicate a base64 encoded string of an image used for [Image search](./image-search.md). | -| `auto` | Special type that automatically attempts to infer the data type based on the documents added to the collection. See [automatic schema detection](#with-auto-schema-detection). | +| `type` | Description | +|:-------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `string` | String values | +| `string[]` | Array of strings | +| `int32` | Integer values up to 2,147,483,647 | +| `int32[]` | Array of `int32` | +| `int64` | Integer values larger than 2,147,483,647 | +| `int64[]` | Array of `int64` | +| `float` | Floating point / decimal numbers | +| `float[]` | Array of floating point / decimal numbers | +| `bool` | `true` or `false` | +| `bool[]` | Array of booleans | +| `geopoint` | Latitude and longitude specified as `[lat, lng]`. Read more [here](geosearch.md). | +| `geopoint[]` | Arrays of Latitude and longitude specified as `[[lat1, lng1], [lat2, lng2]]`. Read more [here](geosearch.md). | +| `geopolygon` | Geographic polygon defined by an array of coordinates specified as `[lat1, lng1, lat2, lng2, ...]`. Latitude/longitude pairs must be in counter-clockwise (CCW) or clockwise (CW) order. Read more here. | +| `object` | Nested objects. Read more [here](#indexing-nested-fields). | +| `object[]` | Arrays of nested objects. Read more [here](#indexing-nested-fields). | +| `string*` | Special type that automatically converts values to a `string` or `string[]`. | +| `image` | Special type that is used to indicate a base64 encoded string of an image used for [Image search](./image-search.md). | +| `auto` | Special type that automatically attempts to infer the data type based on the documents added to the collection. See [automatic schema detection](#with-auto-schema-detection). | ### Cloning a collection schema @@ -932,6 +933,91 @@ via the [API](../api/cluster-operations.md#compacting-the-on-disk-database). **Definition** `DELETE ${TYPESENSE_HOST}/collections/:collection` +## Truncate a collection + +You can remove all documents from a collection while keeping the collection and schema intact by using the truncate operation. + + + + + + + + + + + + + +**Sample Response** + + + + + +**Definition** +`DELETE ${TYPESENSE_HOST}/collections/:collection/documents?truncate=true` + +The response includes the number of documents that were deleted in the `num_deleted` field. For an empty collection, this value will be 0. + ## Update or alter a collection Typesense supports adding or removing fields to a collection's schema in-place. @@ -1163,6 +1249,49 @@ curl "http://localhost:8108/collections/companies" \ }' ``` +### Get Schema Change Status + +You can check the status of in-progress schema change operations by using the schema changes endpoint. + + + + + +If no schema changes are in progress, you'll get an empty response. When a schema change is in progress, you'll get details about the operation: + + + + + +The response shows: +- Which collection is being altered +- Number of documents validated against the new schema +- Number of documents that have been altered + +**Definition** +`GET ${TYPESENSE_HOST}/operations/schema_changes` + ### Using an alias If you need to do zero-downtime schema changes, you could also re-create the collection fully with the updated schema and use diff --git a/docs-site/content/28.0/api/geosearch.md b/docs-site/content/28.0/api/geosearch.md index a90b6dee..112c029d 100644 --- a/docs-site/content/28.0/api/geosearch.md +++ b/docs-site/content/28.0/api/geosearch.md @@ -425,6 +425,98 @@ You want to specify the geo-points of the polygon as lat, lng pairs. 'filter_by' : 'location:(48.8662, 2.3255, 48.8581, 2.3209, 48.8561, 2.3448, 48.8641, 2.3469)' ``` +## Geographic Polygons + +You can also store polygonal geographic areas using the `geopolygon` field type and then check if points fall within these areas. + +### Creating a Collection with Geopolygons + +Let's create a collection with a field to store polygon areas: + + + + + +### Adding Polygon Areas + +Add documents containing polygon areas by specifying the coordinates in counter-clockwise (CCW) or clockwise (CW) order: + + + + + +:::warning NOTE +Coordinates must be specified in proper CCW or CW order to form a valid polygon. Incorrect ordering will result in an error. +::: + +### Searching Points in Polygons + +You can search for documents whose polygon areas contain a specific point: + + + + + +This will return all polygons that contain the point (0.5, 0.5). + +**Sample Response** + + + + + ## Sorting by Additional Attributes within a Radius ### exclude_radius @@ -448,4 +540,4 @@ Similarly, you can bucket all geo points into "groups" using the `precision` par 'sort_by' : 'location(48.853, 2.344, precision: 2mi):asc, popularity:desc' ``` -This will bucket the results into 2-mile groups and force records within each bucket into a tie for "geo score", so that the popularity metric can be used to tie-break and sort results within each bucket. \ No newline at end of file +This will bucket the results into 2-mile groups and force records within each bucket into a tie for "geo score", so that the popularity metric can be used to tie-break and sort results within each bucket. diff --git a/docs-site/content/28.0/api/search.md b/docs-site/content/28.0/api/search.md index 9e3bee93..06f1a062 100644 --- a/docs-site/content/28.0/api/search.md +++ b/docs-site/content/28.0/api/search.md @@ -222,10 +222,11 @@ When a `string[]` field is queried, the `highlights` structure will include the ### Filter parameters -| Parameter | Required | Description | -|:-------------------|:---------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Parameter | Required | Description | +|:-------------------|:---------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | filter_by | no | Filter conditions for refining your search results.

A field can be matched against one or more values.

Examples:
- `country: USA`
- `country: [USA, UK]` returns documents that have `country` of `USA` OR `UK`.

**Exact vs Non-Exact Filtering:**
To match a string field's full value verbatim, you can use the `:=` (exact match) operator. For eg: `category := Shoe` will match documents with `category` set as `Shoe` and not documents with a `category` field set as `Shoe Rack`.

Using the `:` (non-exact) operator will do a word-level partial match on the field, without taking token position into account (so is usually faster). Eg: `category:Shoe` will match records with `category` of `Shoe` or `Shoe Rack` or `Outdoor Shoe`.

Tip: If you have a field that doesn't have any spaces in the values across any documents and want to filter on it, you want to use the `:` operator to improve performance, since it will avoid doing token position checks.

**Escaping Special Characters:**
You can also filter using multiple values and use the backtick character to denote a string literal: category:= [\`Running Shoes, Men\`, \`Sneaker (Men)\`, Boots].

**Negation:**
Not equals / negation is supported via the `:!=` operator, e.g. `author:!=JK Rowling` or `id:!=[id1, id2]`. You can also negate multiple values: `author:!=[JK Rowling, Gilbert Patten]`

To exclude results that _contains_ a specific string during filtering you can do `artist:! Jackson` will exclude all documents whose `artist` field value contains the word `jackson`.

**Numeric Filtering:**
Filter documents with numeric values between a min and max value, using the range operator `[min..max]` or using simple comparison operators `>`, `>=` `<`, `<=`, `=`.

You can enable `"range_index": true` on the numerical field schema for fast range queries (will incur additional memory usage for the index though).

Examples:
-`num_employees:<40`
-`num_employees:[10..100]`
-`num_employees:[<10, >100]`
-`num_employees:[10..100, 140]` (Filter docs where value is between 10 to 100 or exactly 140).
-`num_employees:!= [10, 100, 140]` (Filter docs where value is **NOT** 10, 100 or 140).

**Multiple Conditions:**
You can separate multiple conditions with the `&&` operator.

Examples:
- `num_employees:>100 && country: [USA, UK]`
- `categories:=Shoes && categories:=Outdoor`

To do ORs across _different_ fields (eg: color is blue OR category is Shoe), you can use the ` ||` operator.

Examples:
- `color: blue || category: shoe`
- `(color: blue || category: shoe) && in_stock: true`

**Filtering Arrays:**
filter_by can be used with array fields as well.

For eg: If `genres` is a `string[]` field:

- `genres:=[Rock, Pop]` will return documents where the `genres` array field contains `Rock OR Pop`.
- `genres:=Rock && genres:=Acoustic` will return documents where the `genres` array field contains both `Rock AND Acoustic`.

**Prefix filtering:**
You can filter on records that begin with a given prefix string like this:

`company_name: Acm*`

This will will return documents where any of the words in the `company_name` field begin with `acm`, for e.g. a name like `Corporation of Acme`.

You can combine the field-level match operator `:=` with prefix filtering like this:

`name := S*`

This will return documents that have `name: Steve Jobs` but not documents that have `name: Adam Stator`.

**Geo Filtering:**
Read more about [GeoSearch and filtering](geosearch.md) in this dedicated section.

**Embedding Filters in API Keys:**
You can embed the `filter_by` parameter (or parts of it) in a Scoped Search API Key to set up conditional access control for documents and/or enforce filters for any search requests that use that API key. Read more about [Scoped Search API Key](api-keys.md#generate-scoped-search-key) in this dedicated section. | -| enable_lazy_filter | no | Applies the filtering operation incrementally / lazily. Set this to `true` when you are potentially filtering on large values but the tokens in the query are expected to match very few documents. Default: `false`. | +| enable_lazy_filter | no | Applies the filtering operation incrementally / lazily. Set this to `true` when you are potentially filtering on large values but the tokens in the query are expected to match very few documents. Default: `false`. | +| max_filter_by_candidates | no | Controls the number of similar words that Typesense considers during fuzzy search on `filter_by` values. Useful for controlling prefix matches like `company_name:Acm*`. Default: 4. | ### Ranking and Sorting parameters @@ -489,6 +490,210 @@ sort_by=title(missing_values: last):desc The possible values of `missing_values` are: `first` or `last`. + +### Random Sorting + +You can randomly sort search results using the special `_rand()` parameter in `sort_by`. You can optionally provide a seed value, which must be a positive integer. + +For example, with a specific seed: + +```json +{ + "sort_by": "_rand(42)" +} +``` + +Or without a seed value, which will use the current timestamp as the seed: + +```json +{ + "sort_by": "_rand()" +} +``` + +Using a specific seed value will produce the same random ordering across searches, which is useful when you want consistent randomization (e.g., for A/B testing or result sampling). Using `_rand()` without a seed will produce different random orderings on each request. + +You can combine random sorting with other sort fields: + +```json +{ + "sort_by": "_rand():desc,popularity:desc" +} +``` + +:::tip +- When no seed is provided, the current timestamp is used as the seed +- When a seed is provided, it must be a positive integer +- Using the same seed will produce the same random ordering +- Different seed values (or no seed) will produce different random orderings +::: + +### Sorting with a Pivot Value + +You can sort results relative to a specific pivot value using the `pivot` parameter in `sort_by`. This is particularly useful when you want to order items based on their distance from a reference point. + +For example, if you have timestamps and want to sort based on proximity to a specific timestamp: + +```json +{ + "sort_by": "timestamp(pivot: 1728386250):asc" +} +``` + +This will sort results so that: +- With `asc`: Values closest to the pivot value appear first, followed by values further away +- With `desc`: Values furthest from the pivot value appear first, followed by values closer to it + +Example results when sorting in ascending order relative to pivot value 1728386250: +``` +timestamp: 1728386250 (exact match to pivot) +timestamp: 1728387250 (1000 away from pivot) +timestamp: 1728385250 (1000 away from pivot) +timestamp: 1728384250 (2000 away from pivot) +timestamp: 1728383250 (3000 away from pivot) +``` + +You can combine pivot sorting with other sort fields: + +```json +{ + "sort_by": "timestamp(pivot: 1728386250):asc,popularity:desc" +} +``` + +This feature is useful for: +- Sorting by proximity to a reference date/time +- Organizing numerical values around a target number +- Creating "closer to" style sorting experiences + +### Decay Function Sorting + +Decay functions allow you to score and sort results based on how far they are from a target value, with the score decreasing according to various mathematical functions. This is particularly useful for: + +- Boosting recent items in time-based sorting +- Implementing distance-based relevance +- Creating smooth falloffs in numeric ranges + +You can use decay functions in the `sort_by` parameter with the following syntax: + +```json +{ + "sort_by": "field_name(origin: value, func: function_name, scale: value, decay: rate):direction" +} +``` + +#### Parameters + +| Parameter | Required | Description | +|-----------|----------|---------------------------------------------------------------------------------------------------------| +| `origin` | Yes | The reference point from which the decay function is calculated. Must be an integer. | +| `func` | Yes | The decay function to use. Supported values: `gauss`, `linear`, `exp`, or `diff` | +| `scale` | Yes | The distance from origin at which the score should decay by the decay rate. Must be a non-zero integer. | +| `offset` | No | An offset to apply to the origin point. Must be an integer. Defaults to 0. | +| `decay` | No | The rate at which scores decay, between 0.0 and 1.0. Defaults to 0.5. | + +#### Example + +Let's say you have a collection of products with timestamps and want to sort them giving preference to items closer to a specific date: + +```bash +curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \ +"http://localhost:8108/collections/products/documents/search\ +?q=smartphone\ +&query_by=product_name\ +&sort_by=timestamp(origin: 1728385250, func: gauss, scale: 1000, decay: 0.5):desc" +``` + +For a dataset with these records: +```json +{"product_name": "Samsung Smartphone", "timestamp": 1728383250} +{"product_name": "Vivo Smartphone", "timestamp": 1728384250} +{"product_name": "Oneplus Smartphone", "timestamp": 1728385250} +{"product_name": "Pixel Smartphone", "timestamp": 1728386250} +{"product_name": "Moto Smartphone", "timestamp": 1728387250} +``` + +The results would be ordered based on how close each timestamp is to the origin (1728385250), with scores decreasing according to the gaussian function: +1. Oneplus Smartphone (exact match with origin - highest score) +2. Pixel Smartphone (1000 units from origin - decayed score) +3. Vivo Smartphone (1000 units from origin - decayed score) +4. Moto Smartphone (2000 units from origin - further decayed score) +5. Samsung Smartphone (2000 units from origin - further decayed score) + +#### Supported Functions + +- `gauss`: Gaussian decay - smooth bell curve falloff +- `linear`: Linear decay - constant rate of decrease +- `exp`: Exponential decay - rapidly decreasing scores +- `diff`: Difference-based decay - simple linear difference from origin + +:::tip +- The `decay` parameter determines how quickly scores decrease - lower values mean faster decay +- Use `gauss` for smooth falloff, `linear` for constant decrease, and `exp` for rapid decrease +- `scale` determines the distance at which scores decay by the specified rate +::: + +## Text Match Score Bucketing + +When sorting by text match score (`_text_match`), Typesense offers two different approaches to bucket your results: `buckets` and `bucket_size`. Both parameters allow you to group results with similar relevance scores together before applying secondary sorting criteria. + +### Using the `buckets` Parameter + +The `buckets` parameter divides your results into a specified number of equal-sized groups: + +```json +{ + "sort_by": "_text_match(buckets: 10):desc,weighted_score:desc" +} +``` + +This approach: +1. Takes all matching results +2. Divides them into the specified number of equal-sized buckets (e.g., 10 buckets) +3. Forces all results within each bucket to have the same text match score +4. Applies the secondary sort criteria (e.g., `weighted_score`) within each bucket + +For example, if you have 100 results and specify `buckets: 10`, each bucket will contain 10 results that will be treated as having equal relevance, then sorted by the secondary criterion. + +### Using the `bucket_size` Parameter + +Alternatively, the `bucket_size` parameter groups results into fixed-size buckets: + +```json +{ + "sort_by": "_text_match(bucket_size: 3):desc,points:desc" +} +``` + +For example, if you search for "mark" against these records: +```json +[ + {"title": "Mark Antony", "points": 100}, + {"title": "Marks Spencer", "points": 200}, + {"title": "Mark Twain", "points": 100}, + {"title": "Mark Payne", "points": 300}, + {"title": "Marks Henry", "points": 200}, + {"title": "Mark Aurelius", "points": 200} +] +``` + +With `bucket_size: 3`, Typesense will: +1. Group the results into buckets of 3 records based on text match scores +2. Within each bucket, sort by points in descending order + +This ensures that records with similar text match relevance stay together while being ordered by points within their relevance group. + +The `bucket_size` parameter accepts: +- Any positive integer: Groups results into buckets of that size +- `0`: Disables bucketing (default behavior) + +### Choosing Between `buckets` and `bucket_size` + +- Use `buckets` when you want to ensure a specific number of relevance groups, regardless of the total number of results +- Use `bucket_size` when you want to maintain consistent bucket sizes, regardless of the total number of results +- A smaller number of `buckets` or larger `bucket_size` means more emphasis on the secondary sort field +- When `bucket_size` is larger than the number of results, no bucketing occurs + ## Group Results You can aggregate search results into groups or buckets by specify one or more `group_by` fields. diff --git a/docs-site/content/28.0/api/stemming.md b/docs-site/content/28.0/api/stemming.md new file mode 100644 index 00000000..ef7530f8 --- /dev/null +++ b/docs-site/content/28.0/api/stemming.md @@ -0,0 +1,172 @@ +--- +sidebarDepth: 1 +sitemap: + priority: 0.7 +--- + +# Stemming + +Stemming is a technique that helps handle variations of words during search. When stemming is enabled, a search for one form of a word will also match other grammatical forms of that word. For example: + +- Searching for "run" would match "running", "runs", "ran" +- Searching for "walk" would match "walking", "walked", "walks" +- Searching for "company" would match "companies" + +Typesense provides two approaches to handle word variations: + +## Basic Stemming + +Basic stemming uses the [Snowball stemmer](https://snowballstem.org/) algorithm to automatically detect and handle word variations. Being rules-based, it works well for common word patterns in the configured language, but may produce unintended side effects with brand names, proper nouns, and locations. Since these rules are designed primarily for common nouns, applying them to specialized content like company names or locations can sometimes degrade search relevance. + +To enable basic stemming for a field, set `"stem": true` in your collection schema: + + + + + +The language used for stemming is automatically determined from the `locale` parameter of the field. For example, setting `"locale": "fr"` will use French-specific stemming rules. + +## Custom Stemming Dictionaries + +For cases where you need more precise control over word variations, or when dealing with irregular forms that algorithmic stemming can't handle well, you can use stemming dictionaries. These allow you to define exact mappings between words and their root forms. + +### Pre-made Dictionaries + +Typesense provides a pre-made English plurals dictionary that handles common singular/plural variations. You can download it [here](dl.typesense.org/data/stemming/plurals_en_v1.jsonl) + +This dictionary is particularly useful when you need reliable handling of English plural forms without the potential side effects of algorithmic stemming. + +### Creating a Stemming Dictionary + +First, create a JSONL file with your word mappings: + +```json +{"word": "people", "root": "person"} +{"word": "children", "root": "child"} +{"word": "geese", "root": "goose"} +``` + +Then upload it using the stemming dictionary API: + + + + + +#### Sample Response + + + + + +### Using a Stemming Dictionary + +To use a stemming dictionary, specify it in your collection schema using the `stem_dictionary` parameter: + + + + + +:::tip Combining Both Approaches +You can use both basic stemming (`"stem": true`) and dictionary stemming (`"stem_dictionary": "dictionary_name"`) on the same field. When both are enabled, dictionary stemming takes precedence for words that exist in the dictionary. +::: + +### Managing Dictionaries + +#### Retrieve a Dictionary + + + + + +#### List All Dictionaries + + + + + +#### Sample Response + + + + + +## Best Practices + +1. **Start with Basic Stemming**: For most use cases, basic stemming with the appropriate locale setting will handle common word variations well. + +2. **Use Dictionaries for Exceptions**: Add stemming dictionaries when you need to handle: + - Domain-specific variations + - Cases where basic stemming doesn't give desired results + +3. **Language-Specific Considerations**: Remember that basic stemming behavior changes based on the `locale` parameter. Set this appropriately for your content's language. diff --git a/docs-site/content/28.0/api/vector-search.md b/docs-site/content/28.0/api/vector-search.md index 35ba4736..2ae38f10 100644 --- a/docs-site/content/28.0/api/vector-search.md +++ b/docs-site/content/28.0/api/vector-search.md @@ -1390,6 +1390,41 @@ curl 'http://localhost:8108/collections' \ **Note:** The only supported model is `embedding-gecko-001` for now. +### Updating Remote Model API Key + +You can update the API key used for remote embedding models (like OpenAI) without recreating the collection: + + + + + +:::warning +Note: All fields parameters (`name`, `embed.from`, and `model_config` parameters) must be included in the update request. +::: + ### Using GCP Vertex AI API This API also provided by Google under the Google Cloud Platform (GCP) umbrella. @@ -2726,6 +2761,46 @@ won't have an impact on an embedding field mentioned in `query_by`. However, sin must match the length of `query_by`, you can use a placeholder value like `0`. ::: +### Re-ranking Hybrid Matches + +By default, during hybrid search: +- Documents found through keyword search but not through vector search will only have a text match score +- Documents found through vector search but not through keyword search will only have a vector distance score + +You can optionally compute both scores for all matches by setting `rerank_hybrid_matches: true` in your search parameters. When enabled: +- Documents found only through keyword search will also get a vector distance score +- Documents found only through vector search will also get a text match score + +This allows for more comprehensive ranking of results, at the cost of additional computation time. + +Example: + + + + + +Each hit in the response will contain a `text_match_info` and a `vector_distance` score, regardless of whether it was initially found through keyword or vector search. + ### Distance Threshold You can also set a maximum vector distance threshold for results of semantic search and hybrid search. You should set `distance_threshold` in `vector_query` parameter for this. @@ -2923,6 +2998,24 @@ You can set a custom `ef` via the `vector_query` parameter (default value is `10 } ``` +## Distance Metrics + +By default, Typesense uses cosine similarity as the distance metric for vector search. When you use a `distance_threshold` parameter, documents with cosine distances larger than the threshold will: + +- In standalone vector search: be excluded from results +- When used in sorting (`sort_by`): get the maximum possible distance score but remain in results + +You can use this with both cosine similarity and inner product distance metrics. For example: + +```json +{ + "vector_query": "embedding:([], distance_threshold: 0.30)" +} +``` + +This helps filter out less relevant results while still allowing other sort conditions to take effect. +``` + ## Vector Search Parameters Here are all the possible parameters you can use inside the `vector_query` search parameter, that we've covered in the various sections above: diff --git a/docs-site/content/guide/faqs.md b/docs-site/content/guide/faqs.md index 016d6d13..2b085527 100644 --- a/docs-site/content/guide/faqs.md +++ b/docs-site/content/guide/faqs.md @@ -56,11 +56,52 @@ You can use the `token_separators` and `symbols_to_index` parameters to control ### How do I handle singular / plural variations of a keyword? -You can use the stemming feature to allow search queries that contain variations of a word in your dataset (eg: singular / plurals, tense changes, etc) to still match the record. +There are two ways to handle word variations (like singular/plural forms) in Typesense: -For eg: searching for `walking`, will also return results with `walk`, `walked`, `walks`, etc when stemming is enabled. +#### 1. Using Basic Stemming -You can enable stemming by setting the `stem: true` parameter in the field definition in the collection schema. +You can use the built-in stemming feature to automatically handle common variations of words in your dataset (eg: singular/plurals, tense changes, etc). +For eg: searching for `walking` will also return results with `walk`, `walked`, `walks`, etc when stemming is enabled. + +You can enable stemming by setting the `stem: true` parameter in the field definition in the collection schema. + +#### 2. Using Custom Stemming Dictionaries + +:::warning NOTE +Custom stemming dictionaries are only available in `v28.0` and above. +::: + +For more precise control over word variations, you can use custom stemming dictionaries that define exact mappings between words and their root forms. + +First, create a dictionary by uploading a JSONL file that contains your word mappings: + +```json +{"word": "meetings", "root":"meeting"} +{"word": "people", "root":"person"} +{"word": "children", "root":"child"} +``` + +Upload this dictionary using the stemming dictionary API: + +```bash +curl -X POST \ + -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \ + --data-binary @plurals.jsonl \ + "http://localhost:8108/stemming/dictionary/import?id=my_dictionary" +``` + +Then enable the dictionary in your collection schema by setting the `stem_dictionary` parameter: + +```json +{ + "name": "companies", + "fields": [ + {"name": "title", "type": "string", "stem_dictionary": "my_dictionary"} + ] +} +``` + +For more details on stemming, read the stemming documentation. ### When I search for a short string, I don't get all results. How do I address this? @@ -354,4 +395,4 @@ Here's how Typesense Cloud and Self-Hosted (on any VPS or other cloud) compare: ### I don't see my question answered here or in the docs. What do I do? -Read our [Help](/help.md) section for information on how to get additional help. \ No newline at end of file +Read our [Help](/help.md) section for information on how to get additional help. diff --git a/docs-site/content/guide/ranking-and-relevance.md b/docs-site/content/guide/ranking-and-relevance.md index b9af7c4a..127174c4 100644 --- a/docs-site/content/guide/ranking-and-relevance.md +++ b/docs-site/content/guide/ranking-and-relevance.md @@ -87,13 +87,14 @@ If you wish to sort the documents strictly by an indexed numerical or string fie ## Ranking based on Relevance and Popularity If you have a popularity score for your documents that you have either: - 1) calculated on your end in your application using any formula of your choice or 2) calculated using a counter analytics rule in Typesense -You can have Typesense mix your custom scores with the text relevance score it calculates, so results that are more popular (as defined by your custom score) are boosted more in ranking. +You can have Typesense mix your custom scores with the text relevance score it calculates, so results that are more popular (as defined by your custom score) are boosted more in ranking. There are two approaches you can use: + +### Using Fixed Number of Buckets -Here's the search parameter to achieve this: +Here's how to divide results into a specific number of relevance groups: ```json { @@ -104,7 +105,6 @@ Here's the search parameter to achieve this: Where `weighted_score` is a field in your document with your custom score. This will do the following: - 1. Fetch all results matching the query 2. Sort them by text relevance (text match score desc) 3. Divide the results into equal-sized 10 buckets (with the first bucket containing the most relevant results) @@ -112,8 +112,31 @@ This will do the following: 5. This will cause a tie inside each bucket, and then the `weighted_score` will be used to break the tie and re-rank results within each bucket. The higher the number of buckets, the more granular the re-ranking based on your weighted score will be. -For eg, if you have 100 results, and `buckets: 50`, then each bucket will have 2 results, those two results within each bucket will be re-ranked based on your `weighted_score`. +For example, if you have 100 results, and `buckets: 50`, then each bucket will have 2 results, those two results within each bucket will be re-ranked based on your `weighted_score`. + +### Using Fixed Bucket Size + +Alternatively, you can group results into fixed-size relevance groups: + +```json +{ + "sort_by": "_text_match(bucket_size: 3):desc,weighted_score:desc" +} +``` + +This approach will: +1. Fetch all results matching the query +2. Sort them by text relevance +3. Group results into fixed-size buckets (e.g., 3 results per bucket) +4. Apply the `weighted_score` sorting within each fixed-size bucket + +For example, if you have 100 results and `bucket_size: 3`, Typesense will create approximately 33 buckets with 3 results each. Each group of 3 results with similar text relevance will be sorted by their `weighted_score`. + +### Choosing Between Approaches +- Use `buckets` when you want a specific number of relevance groups +- Use `bucket_size` when you want to ensure a consistent number of results are compared by popularity at a time +- Both approaches help you balance between text relevance and popularity in your search results ## Ranking based on Relevance and Recency A common need is to rank results have been published recently higher than older results. @@ -212,4 +235,4 @@ In such cases, you can have Typesense automatically drop words / tokens from the This behavior is controlled by the `drop_tokens_threshold` search parameter, which has a default value of `1`. This means that if a search query only returns 1 or 0 results, Typesense will start dropping search keywords and repeat the search until at least 1 result is found. -To turn this behavior off, set `drop_tokens_threshold=0` \ No newline at end of file +To turn this behavior off, set `drop_tokens_threshold=0` diff --git a/docs-site/content/guide/semantic-search.md b/docs-site/content/guide/semantic-search.md index d08ae35a..d87c0b5c 100644 --- a/docs-site/content/guide/semantic-search.md +++ b/docs-site/content/guide/semantic-search.md @@ -442,6 +442,104 @@ Notice how searching for `Desktop copier` returns `Desktop` as a result which is } ``` +### Re-ranking Hybrid Matches + +When doing hybrid search, by default Typesense returns both keyword matches and semantic matches in the results. For example: + +```bash +curl 'http://localhost:8108/multi_search' \ + -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \ + -X POST \ + -d '{ + "searches": [ + { + "query_by": "product_name,embedding", + "q": "desktop copier", + "collection": "products", + "prefix": "false", + "exclude_fields": "embedding", + "per_page": 2 + } + ] + }' +``` + +A search for "desktop copier" might return results like this: + +```json{18,22,32,36} +{ + "hits": [ + { + "document": { + "id": "2", + "product_name": "Desktop" + }, + "highlight": { + "product_name": { + "matched_tokens": ["Desktop"], + "snippet": "Desktop" + } + }, + "hybrid_search_info": { + "rank_fusion_score": 0.8500000238418579 + }, + "text_match": 1060320051, + "text_match_info": { + "best_field_score": "517734" + }, + "vector_distance": 0.510231614112854 + }, + { + "document": { + "id": "3", + "product_name": "Printer" + }, + "hybrid_search_info": { + "rank_fusion_score": 0.30000001192092896 + }, + "text_match": 0, + "text_match_info": { + "best_field_score": "0" + }, + "vector_distance": 0.4459354281425476 + } + ] +} +``` + +Notice how: +- The first result "Desktop" is a keyword match (high text_match score) +- The second result "Printer" is a semantic match (low vector_distance but zero text_match) + +By default: +- Documents found through keyword search but not through vector search will only have a text match score +- Documents found through vector search but not through keyword search will only have a vector distance score + +You can optionally compute both scores for all matches by setting `rerank_hybrid_matches: true`. When enabled: +- Documents found only through keyword search will also get a vector distance score +- Documents found only through vector search will also get a text match score + +Example with re-ranking enabled: + +```bash{10} +curl 'http://localhost:8108/multi_search' \ + -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \ + -X POST \ + -d '{ + "searches": [ + { + "collection": "products", + "query_by": "embedding,product_name", + "q": "desktop copier", + "rerank_hybrid_matches": true, + "vector_query": "embedding:([], alpha: 0.8)", + "exclude_fields": "embedding" + } + ] + }' +``` + +This provides more comprehensive ranking of results by computing both scores for all matches, at the cost of additional computation time. ### Pagination diff --git a/docs-site/content/guide/tips-for-searching-common-types-of-data.md b/docs-site/content/guide/tips-for-searching-common-types-of-data.md index 50bc56f5..85d57b30 100644 --- a/docs-site/content/guide/tips-for-searching-common-types-of-data.md +++ b/docs-site/content/guide/tips-for-searching-common-types-of-data.md @@ -55,6 +55,24 @@ You can do this by setting