Skip to content

Commit

Permalink
improve descriptions
Browse files Browse the repository at this point in the history
  • Loading branch information
maxjakob committed Jan 25, 2024
1 parent e24a7b5 commit 41741c3
Showing 1 changed file with 15 additions and 9 deletions.
24 changes: 15 additions & 9 deletions notebooks/search/tokenization.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/search/tokenization.ipynb)\n",
"\n",
"Elasticsearch offers some [semantic search](https://www.elastic.co/what-is/semantic-search) models, most notably [ELSER](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html) and [E5](https://www.elastic.co/search-labs/blog/articles/multilingual-vector-search-e5-embedding-model), to search through documents in a _menaningful_ way. Part of the process is breaking up texts (both for indexing documents and for queries) into tokens. Tokens are commonly thought of as words, but this is not accurate. Other substrings in the text also carry meaning to the semantic models and therefore have to be split out separately. For ELSER, our English-only model, this is done with the [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) tokenizer.\n",
"Elasticsearch offers [semantic search](https://www.elastic.co/what-is/semantic-search) models, most notably [ELSER](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html) and [E5](https://www.elastic.co/search-labs/blog/articles/multilingual-vector-search-e5-embedding-model), to search through documents in a way that takes the text's meaning into account. Part of the semantic search process is breaking up texts into tokens (both for documents and for queries). Tokens are commonly thought of as words, but this is not completely accurate. Different semantic models use different concepts of tokens. Many treat punctuation separately and some break up compound words. For example ELSER (our English language model) uses the [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) tokenizer.\n",
"\n",
"For Elasticsearch users it is important to know how texts are broken up into tokens because currently only the [first 512 tokens per field](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512) are considered. This means that when you index longer texts, all tokens after the 512 will not be represented in your semantic search. Hence it is valuable to know the number of tokens for your input texts.\n",
"For users of Elasticsearch it is important to know how texts are broken up into tokens because currently only the [first 512 tokens per field](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512) are considered. This means that when you index longer texts, all tokens after the 512th are ignored in your semantic search. Hence it is valuable to know the number of tokens for your input texts before choosing the right model and indexing method.\n",
"\n",
"Currently it is not possible to get the token count information via the API, so we share the code for calculating token counts here. This notebook also shows how to break longer text up into chunks of the right size so that no information is lost during indexing, which has to be done by the user (as of version 8.12, future version will remove the necessity and auto-chunk behind the scenes).\n"
"Currently it is not possible to get the token count information via the API, so here we share the code for calculating token counts. This notebook also shows how to break longer text up into chunks of the right size so that no information is lost during indexing. Currently (as of version 8.12) this has to be done by the user. Future versions will remove this necessity and Elasticsearch will automatically create chunks behind the scenes."
]
},
{
Expand Down Expand Up @@ -192,7 +192,7 @@
"print()\n",
"\n",
"movie_tokens = bert_tokenizer.encode(example_movie)\n",
"print(str([bert_tokenizer.decode([t]) for t in movie_tokens]))\n"
"print(str([bert_tokenizer.decode([t]) for t in movie_tokens]))"
]
},
{
Expand All @@ -201,7 +201,7 @@
"source": [
"We can observe:\n",
"- There are special tokens `[CLS]` and `[SEP]` to model the the beginning and end of the text. These two extra tokens will become relevant below.\n",
"- Punctuations are they own tokens.\n",
"- Punctuations are their own tokens.\n",
"- Compounds words are split into two tokens, for example `hitmen` becomes `hit` and `##men`.\n",
"\n",
"Given this behavior, it is easy to see how longer tests yield lots of tokens and can quickly get beyond the 512 tokens limitation mentioned above."
Expand All @@ -215,7 +215,7 @@
"\n",
"We saw how to count the number of tokens using the tokenizers from different models. ELSER uses the BERT tokenizer, so when using `.elser_model_2` it internally splits the text with this method.\n",
"\n",
"Currently there is a limitation that [only the first 512 tokens are considered](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512). To work around this, we can first split the input text into chunks of 512 tokens and feed the chunks to Elasticsearch. Actually, we need to use a limit of 510 to leave space for the two special tokens (`[CLS]` and `[SEP]`) that we saw."
"Currently there is a limitation that [only the first 512 tokens are considered](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512). To work around this, we can first split the input text into chunks of 512 tokens and feed the chunks to Elasticsearch separately. Actually, we need to use a limit of 510 to leave space for the two special tokens (`[CLS]` and `[SEP]`) that we saw."
]
},
{
Expand All @@ -235,7 +235,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Loading a longer example text:"
"Here we load a longer example text:"
]
},
{
Expand All @@ -254,7 +254,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we tokenize the long text, exclude the special tokens, create chunks of size 510 tokens and map the tokens back to text. Notice that on the first run the BERT tokenizer itself is warning us about the 512 tokens limitation."
"Next we tokenize the long text, exclude the special tokens at beginning and end, create chunks of size 510 tokens and map the tokens back to texts. Notice that on the first run of this cell the BERT tokenizer itself is warning us about the 512 tokens limitation of the model."
]
},
{
Expand Down Expand Up @@ -288,8 +288,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now these chunks can be indexed and we can be sure the semantic search model consideres our whole text."
"---\n",
"And there we go. Now these chunks can be indexed together on the same document in a [nested field](https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html) and we can be sure the semantic search model considers our whole text."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
Expand Down

0 comments on commit 41741c3

Please sign in to comment.