Skip to content

Commit

Permalink
remove paragraph I wanted to omit
Browse files Browse the repository at this point in the history
  • Loading branch information
maxjakob committed Jan 26, 2024
1 parent c956be6 commit 5c8313b
Showing 1 changed file with 1 addition and 3 deletions.
4 changes: 1 addition & 3 deletions notebooks/search/tokenization.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -249,9 +249,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we tokenize the long text, exclude the special tokens at beginning and end, create chunks of size 510 tokens and map the tokens back to texts.\n",
"\n",
"Side note: Be aware that tokenisation involves a normalisation step that strips away [nonspacing marks](https://www.fileformat.info/info/unicode/category/Mn/list.htm). If decoding is implemented as a reverse lookup from token IDs to vocabulary entries those stripped marks will not be recovered resulting in decoded text that could be slightly different to the original."
"Next we tokenize the long text, exclude the special tokens at beginning and end, create chunks of size 510 tokens and map the tokens back to texts."
]
},
{
Expand Down

0 comments on commit 5c8313b

Please sign in to comment.