Skip to content

Commit

Permalink
reformulate recommendation
Browse files Browse the repository at this point in the history
  • Loading branch information
maxjakob committed Jan 26, 2024
1 parent 4f6dfaa commit c956be6
Showing 1 changed file with 16 additions and 7 deletions.
23 changes: 16 additions & 7 deletions notebooks/search/tokenization.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -192,9 +192,9 @@
"We can observe:\n",
"- There are special tokens `[CLS]` and `[SEP]` to model the the beginning and end of the text. These two extra tokens will become relevant below.\n",
"- Punctuations are their own tokens.\n",
"- Compounds words are split into two tokens, for example `hitmen` becomes `hit` and `##men`.\n",
"- Compound words are split into two tokens, for example `hitmen` becomes `hit` and `##men`.\n",
"\n",
"Given this behavior, it is easy to see how longer tests yield lots of tokens and can quickly get beyond the 512 tokens limitation mentioned above."
"Given this behavior, it is easy to see how longer texts yield lots of tokens and can quickly get beyond the 512 tokens limitation mentioned above."
]
},
{
Expand All @@ -207,7 +207,7 @@
"\n",
"Currently there is a limitation that [only the first 512 tokens are considered](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512). To work around this, we can first split the input text into chunks of 512 tokens and feed the chunks to Elasticsearch separately. Actually, we need to use a limit of 510 to leave space for the two special tokens (`[CLS]` and `[SEP]`) that we saw.\n",
"\n",
"Furthermore, it is best practice to make the chunks overlap (**TODO add reference**). With ELSER, we recommend 50% token overlap (i.e. a 256 token stride)."
"Furthermore, in practice we often see improved performance when using overlapping chunks. With ELSER, we recommend 50% token overlap (i.e. a 255 token stride)."
]
},
{
Expand Down Expand Up @@ -249,14 +249,23 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we tokenize the long text, exclude the special tokens at beginning and end, create chunks of size 510 tokens and map the tokens back to texts."
"Next we tokenize the long text, exclude the special tokens at beginning and end, create chunks of size 510 tokens and map the tokens back to texts.\n",
"\n",
"Side note: Be aware that tokenisation involves a normalisation step that strips away [nonspacing marks](https://www.fileformat.info/info/unicode/category/Mn/list.htm). If decoding is implemented as a reverse lookup from token IDs to vocabulary entries those stripped marks will not be recovered resulting in decoded text that could be slightly different to the original."
]
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Token indices sequence length is longer than the specified maximum sequence length for this model (1242 > 512). Running this sequence through the model will result in indexing errors\n"
]
},
{
"data": {
"text/plain": [
Expand All @@ -267,7 +276,7 @@
" 'later, napoleon and his pigs secretly revise some commandments to clear them of accusations of law - breaking ( such as \" no animal shall drink alcohol \" having \" to excess \" appended to it and \" no animal shall sleep in a bed \" with \" with sheets \" added to it ). the changed commandments are as follows, with the changes bolded : * 4 no animal shall sleep in a bed with sheets. * 5 no animal shall drink alcohol to excess. * 6 no animal shall kill any other animal without cause. eventually these are replaced with the maxims, \" all animals are equal, but some animals are more equal than others \", and \" four legs good, two legs better! \" as the pigs become more human. this is an ironic twist to the original purpose of the seven commandments, which were supposed to keep order within animal farm by uniting the animals together against the humans, and prevent animals from following the humans\\'evil habits. through the revision of the commandments, orwell demonstrates how simply political dogma can be turned into malleable propaganda.']"
]
},
"execution_count": 10,
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
Expand Down

0 comments on commit c956be6

Please sign in to comment.