You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: In SurrealDB, Full-Text Search supports advanced features like basic and advanced text matching, proximity searches, result ranking, and keyword highlighting.
3
+
sidebar_label: Working with text
4
+
title: Working with text | Reference guides
5
+
description: In SurrealDB, Full-Text Search supports advanced features like basic and advanced text matching, proximity searches, result ranking, and keyword highlighting.
6
6
---
7
7
8
-
# Full-Text Search
9
-
In SurrealDB, Full-Text Search supports text matching, proximity searches, result ranking, and keyword highlighting.
10
-
It is also [ACID-compliant](https://en.wikipedia.org/wiki/ACID), which ensures data integrity and reliability.
8
+
# Working with text in SurrealDB
11
9
12
-
## Define tables and fields
13
-
To implement Full-Text Search it is important that your data is first defined using SurrealDB [`DEFINE TABLE`](/docs/surrealql/statements/define/table) and [`DEFINE FIELD`](/docs/surrealql/statements/define/field).
10
+
SurrealDB offers a large variety of ways to work with text, from simple operators to fuzzy searching, customized ordering, full-text search and more.
11
+
12
+
## Comparing and sorting text
13
+
14
+
### In `SELECT` queries
15
+
16
+
Take the following data for example.
17
+
18
+
```surql
19
+
CREATE data SET val = 'Zoo';
20
+
CREATE data SET val = 'Ábaco';
21
+
CREATE data SET val = '1';
22
+
CREATE data SET val = '2';
23
+
CREATE data SET val = '11';
24
+
CREATE data SET val = 'kitty';
25
+
```
26
+
27
+
Inside a `SELECT` query, an `ORDER BY` clause can be used to order the output by one or more field names. For the above data, an ordered `SELECT` query looks like this.
28
+
29
+
```surql
30
+
SELECT VALUE val FROM data ORDER BY val;
31
+
```
32
+
33
+
However, in the case of strings, sorting is done by Unicode rank which often leads to output that seems out of order to the human eye. The output of the above query shows the following:
34
+
35
+
```surql title="Output"
36
+
[
37
+
'1',
38
+
'11',
39
+
'2',
40
+
'Zoo',
41
+
'kitty',
42
+
'Ábaco'
43
+
]
44
+
```
45
+
46
+
This is because:
47
+
48
+
* '11' is ordered before '2', because the first character in the string '2' is greater than the first character in the string '1'.
49
+
* 'Zoo' is ordered before 'kitty', because the first character in the string 'Zoo' is 'Z', number 0059 in the [list of Unicode characters](https://en.wikipedia.org/wiki/List_of_Unicode_characters#Basic_Latin). A lowercase 'k' is 0076 on the list and thus "greater", while the 'Á', registered as the "Latin Capital letter A with acute", is 0129 on the list.
50
+
51
+
To sort strings in a more natural manner to the human eye, the keywords `NUMERIC` and `COLLATE` (or both) can be used. `NUMERIC` will instruct strings that parse into numbers to be treated as such.
52
+
53
+
```surql
54
+
SELECT VALUE val FROM data ORDER BY val NUMERIC;
55
+
```
56
+
57
+
```surql title="Numberic strings now sorted as numbers"
58
+
[
59
+
'1',
60
+
'2',
61
+
'11',
62
+
'Zoo',
63
+
'kitty',
64
+
'Ábaco'
65
+
]
66
+
```
67
+
68
+
`COLLATE` instructs unicode strings to sort by alphabetic order, rather than Unicode order.
69
+
70
+
```surql
71
+
SELECT VALUE val FROM data ORDER BY val COLLATE;
72
+
```
73
+
74
+
```surql title="Output"
75
+
[
76
+
'1',
77
+
'11',
78
+
'2',
79
+
'Ábaco',
80
+
'kitty',
81
+
'Zoo'
82
+
]
83
+
```
84
+
85
+
And for the data in this example, `COLLATE NUMERIC` is likely what will be desired.
86
+
87
+
```surql
88
+
SELECT VALUE val FROM data ORDER BY val COLLATE NUMERIC;
89
+
```
90
+
91
+
```surql title="Output"
92
+
[
93
+
'1',
94
+
'2',
95
+
'11',
96
+
'Ábaco',
97
+
'kitty',
98
+
'Zoo'
99
+
]
100
+
```
101
+
102
+
As of SurrealDB 2.2.2, the functions `array::sort_natural`, `array::sort_lexical`, and `array::sort_lexical_natural` can be used on ad-hoc data to return the same output as the `COLLATE` and `NUMERIC` clauses in a `SELECT` statement.
103
+
104
+
## Contains
105
+
106
+
```surql
107
+
-- false
108
+
"Umple" IN "Rumplestiltskin";
109
+
"Rumplestiltskin".contains("Umple");
110
+
string::contains("Rumplestiltskin", "Umple");
111
+
112
+
-- true
113
+
"umple" IN "Rumplestiltskin";
114
+
"Rumplestiltskin".contains("umple");
115
+
string::contains("Rumplestiltskin", "umple");
116
+
```
117
+
118
+
SurrealDB has a number of operators to determine if all or some of the values of one array are contained in another, such as `CONTAINSALL` and `CONTAINSANY`, or `ALLINSIDE` and `ANYINSIDE`. The `CONTAINS` and `INSIDE` operators perform the same behaviour, just in the opposite order.
119
+
120
+
```surql
121
+
-- If 1,2,3 contains each item in 1,2
122
+
[1,2,3] CONTAINSALL [1,2];
123
+
-- then each item in 1,2 is inside 1,2,3
124
+
[1,2] ALLINSIDE [1,2,3];
125
+
```
126
+
127
+
Because strings are essentially arrays of characters, these operators work with strings as well. (Note: this capability was added in SurrealDB version 2.2.2)
128
+
129
+
Both of these queries will return `true`.
130
+
131
+
```surql
132
+
"Rumplestiltskin" CONTAINSALL ["umple", "kin"];
133
+
"kin" ALLINSIDE "Rumplestiltskin";
134
+
```
135
+
136
+
## Equality and fuzzy equality
137
+
138
+
While strings can be compared for strict equality in the same way as with other values, fuzzy searching can also be used to return `true` if two strings are approximately equal.
139
+
140
+
*`~` to check if two strings have fuzzy equality
141
+
*`!~` to check if two strings do not have fuzzy equality:
142
+
*`?~` to check if any strings have fuzzy equality:
143
+
*`*~` to check if all strings have fuzzy equality
144
+
145
+
All of the following will return true.
146
+
147
+
```surql
148
+
"big" ~ "Big";
149
+
"big" !~ "small";
150
+
["Big", "small"] ?~ "big";
151
+
["Big", "big"] *~ "big";
152
+
```
153
+
154
+
Fuzzy matching is based on [an algorithm](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm) that requires some time to understand. It is a convenient option due to the `~` operator, but can sometimes produce surprising results.
155
+
156
+
```surql
157
+
-- true
158
+
"United Kingdom" ~ "United kingdom";
159
+
-- true (second string entirely contained in first)
160
+
"United Kingdom" ~ "ited";
161
+
-- Also true!
162
+
"United Kingdom" ~ "i";
163
+
-- false
164
+
"United Kingdom" ~ "United Kingdóm";
165
+
```
166
+
167
+
The `string::similarity::fuzzy` function can be useful in this case, as it returns a number showing the similarity between strings, not just whether they count as a fuzzy match. In the following example, while the strings "ited" and "i" do have a similarity score above 0, they are ranked much lower than the better matches "United kingdom" and "United Kingdom".
SELECT * FROM $similarities ORDER BY similarity DESC;
177
+
```
178
+
179
+
```surql title="Output"
180
+
[
181
+
{
182
+
similarity: 295,
183
+
word: 'United Kingdom'
184
+
},
185
+
{
186
+
similarity: 293,
187
+
word: 'United kingdom'
188
+
},
189
+
{
190
+
similarity: 75,
191
+
word: 'ited'
192
+
},
193
+
{
194
+
similarity: 15,
195
+
word: 'i'
196
+
},
197
+
{
198
+
similarity: 0,
199
+
word: 'United Kingdóm'
200
+
}
201
+
]
202
+
```
203
+
204
+
Also note that similarity and distance scores are not a measure of absolute equality and ordered similarity scores should only be used in comparisons against the same string. Take the following two queries for example:
While "United" is clearly more similar to "United" than to "United Kingdom", the output of each one is 131. This number is generated from the point of view of the second "United" string, which finds an exact match for itself inside the first string.
212
+
213
+
## Other fuzzy match algorithms
214
+
215
+
SurrealDB offers quite a few other algorithms inside the [string functions module](/docs/surrealql/functions/database/string) for distance or similarity comparison. They are:
These resemble fuzzy searching to a certain extent, but have a different output and may have different requirements. For example, the Hamming distance algorithm was made for strings of equal length, so a query comparing "United Kingdom" to "United" will not work.
For more customized text searching, full-text search can be used.
241
+
242
+
## Full-text search
243
+
244
+
Full-Text search supports text matching, proximity searches, result ranking, and keyword highlighting, making it a much more comprehensive solution when precise text searching is required.
245
+
246
+
It is also [ACID-compliant](https://en.wikipedia.org/wiki/ACID), which ensures data integrity and reliability.
247
+
248
+
### Analyzers
249
+
250
+
The first step to using full-text search is to [define an analyzer](/docs/surrealql/statements/define/analyzer) using a `DEFINE ANALYZER` statement. An analyzer is not defined on a table, but a set of tokenizers (to break up text) and filters (to modify text).
251
+
252
+
The `DEFINE ANALYZER` page contains a detailed explanation of each type of tokenizer and analyzer to choose from. To define the analyzer that most suits your needs, it is recommended to use the [`search::analyze`](/docs/surrealql/functions/database/search#searchanalyze) function which returns the output of an analyzer for an input string.
253
+
254
+
Take the following analyzer for example, which uses `blank` to split a string by whitespace, and `edgengram(3, 10)` to output all of the instances of the first three to ten letters of a word.
If this is not desired, some looking through the `DEFINE ANALYZER` page will turn up another tokenizer called `punct` that can be included, now creating an analyzer that splits on whitespace as well as punctuation. Since punctuation on its own will not
268
+
269
+
```surql
270
+
DEFINE ANALYZER blank_edgengram TOKENIZERS blank, punct FILTERS edgengram(3, 10);
271
+
search::analyze("blank_edgengram", "The Wheel of Time turns, and Ages come and pass, leaving memories that become legend.");
Once the data is in your tables, you can use customized [analyzers](/docs/surrealql/statements/define/analyzer) to define rules for how your textual data should be searched.
17
-
An analyzer includes [tokenizers](/docs/surrealql/statements/define/analyzer#tokenizers) and [filters](/docs/surrealql/statements/define/analyzer#filters) which help break down text into manageable tokens and refine the search.
279
+
280
+
An analyzer includes [tokenizers](/docs/surrealql/statements/define/analyzer#tokenizers) and [filters](/docs/surrealql/statements/define/analyzer#filters) which help break down text into manageable tokens and refine the search.
18
281
19
282
```surql
20
283
-- Combining tokenizers and filters into a custom analyzer
To make a text field searchable, you need to set up a [full-text index](/docs/surrealql/statements/define/indexes#full-text-search-index) on it by using the 'search' keyword.
25
289
26
290
Depending on the use case, each field or column can be associated with a different analyzers
@@ -31,17 +295,19 @@ To enable text highlight on searches, use the `HIGHLIGHTS` keyword when defining
31
295
DEFINE INDEX book_title ON book FIELDS title SEARCH ANALYZER custom_analyzer BM25;
32
296
DEFINE INDEX book_content ON book FIELDS content SEARCH ANALYZER custom_analyzer BM25 HIGHLIGHTS;
33
297
```
34
-
## The MATCHES Operator
298
+
299
+
### The MATCHES Operator
35
300
To find documents that contain the given keywords based on the full-text indexes, the [matches](/docs/surrealql/operators#matches) operator (@@) is used in queries.
36
301
37
302
```surql
38
303
-- Using the MATCHES (@@) operator in a query
39
304
SELECT * FROM book WHERE content @@ 'tools';
40
305
```
41
-
## Search Functions
306
+
307
+
### Search Functions
42
308
If you want to do more with your search results, SurrealDB offers 3 search functions that accompany the matches operator.
43
-
-[`search::highlight`](/docs/surrealql/functions/database/search#searchhighlight):Highlights the matching keywords for the predicate reference number.
44
-
-[`search::offsets`](/docs/surrealql/functions/database/search#searchoffsets):Returns the position of the matching keywords for the predicate reference number.
309
+
-[`search::highlight`](/docs/surrealql/functions/database/search#searchhighlight):Highlights the matching keywords for the predicate reference number.
310
+
-[`search::offsets`](/docs/surrealql/functions/database/search#searchoffsets):Returns the position of the matching keywords for the predicate reference number.
45
311
-[`search::score`](/docs/surrealql/functions/database/search#searchscore): Helps with scoring and ranking the search results based on their relevance to the search terms.
0 commit comments