Skip to content

Corrected the error probability values in Chapter 6, Section 7 in the zh-CN and English versions. #841

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions chapters/en/chapter6/7.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -64,20 +64,20 @@ So, the sum of all frequencies is 210, and the probability of the subword `"ug"`

Now, to tokenize a given word, we look at all the possible segmentations into tokens and compute the probability of each according to the Unigram model. Since all tokens are considered independent, this probability is just the product of the probability of each token. For instance, the tokenization `["p", "u", "g"]` of `"pug"` has the probability:

$$P([``p", ``u", ``g"]) = P(``p") \times P(``u") \times P(``g") = \frac{5}{210} \times \frac{36}{210} \times \frac{20}{210} = 0.000389$$
$$P([``p", ``u", ``g"]) = P(``p") \times P(``u") \times P(``g") = \frac{17}{210} \times \frac{36}{210} \times \frac{20}{210} = 0.001322$$

Comparatively, the tokenization `["pu", "g"]` has the probability:

$$P([``pu", ``g"]) = P(``pu") \times P(``g") = \frac{5}{210} \times \frac{20}{210} = 0.0022676$$
$$P([``pu", ``g"]) = P(``pu") \times P(``g") = \frac{17}{210} \times \frac{20}{210} = 0.007710$$

so that one is way more likely. In general, tokenizations with the least tokens possible will have the highest probability (because of that division by 210 repeated for each token), which corresponds to what we want intuitively: to split a word into the least number of tokens possible.

The tokenization of a word with the Unigram model is then the tokenization with the highest probability. In the example of `"pug"`, here are the probabilities we would get for each possible segmentation:

```
["p", "u", "g"] : 0.000389
["p", "ug"] : 0.0022676
["pu", "g"] : 0.0022676
["p", "u", "g"] : 0.001322
["p", "ug"] : 0.007709
["pu", "g"] : 0.007709
```

So, `"pug"` would be tokenized as `["p", "ug"]` or `["pu", "g"]`, depending on which of those segmentations is encountered first (note that in a larger corpus, equality cases like this will be rare).
Expand Down
12 changes: 6 additions & 6 deletions chapters/zh-CN/chapter6/7.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -65,20 +65,20 @@ Unigram 模型是一种语言模型,它认为每个符号都与其之前的符
现在,为了对一个给定的单词进行分词,我们会查看所有可能的分词组合,并根据 Unigram 模型计算出每种可能的概率。由于所有的分词都被视为独立的,因此这个单词分词的概率就是每个子词概率的乘积。例如,将 `"pug"` 分词为 `["p", "u", "g"]` 的概率为:


$$P([``p", ``u", ``g"]) = P(``p") \times P(``u") \times P(``g") = \frac{5}{210} \times \frac{36}{210} \times \frac{20}{210} = 0.000389$$
$$P([``p", ``u", ``g"]) = P(``p") \times P(``u") \times P(``g") = \frac{17}{210} \times \frac{36}{210} \times \frac{20}{210} = 0.001322$$

相比之下,将 “pug” 分词为 `["pu", "g"]` 的概率为:

$$P([``pu", ``g"]) = P(``pu") \times P(``g") = \frac{5}{210} \times \frac{20}{210} = 0.0022676$$
$$P([``pu", ``g"]) = P(``pu") \times P(``g") = \frac{17}{210} \times \frac{20}{210} = 0.007710$$

因此,后者的可能性更大。一般来说,分词数最少的分词方式将具有最高的概率(因为每个分词都要除以 210),这正符合我们的直觉:将一个词分割为尽可能少的子词。

利用 Unigram 模型对一个词进行分词,就是找出概率最高的分词方式。以 `"pug"` 为例,我们得到的各种可能分词方式的概率如下:

```
["p", "u", "g"] : 0.000389
["p", "ug"] : 0.0022676
["pu", "g"] : 0.0022676
["p", "u", "g"] : 0.001322
["p", "ug"] : 0.007710
["pu", "g"] : 0.007710
```

因此, `"pug"` 将被分词为 `["p", "ug"]` 或 `["pu", "g"]` ,取决于哪种分词方式排在前面(注意,在更大的语料库中,像这样的相等情况将很少见)。
Expand Down Expand Up @@ -380,4 +380,4 @@ tokenize("This is the Hugging Face course.", model)
['▁This', '▁is', '▁the', '▁Hugging', '▁Face', '▁', 'c', 'ou', 'r', 's', 'e', '.']
```

至此 Unigram 的介绍完毕!期望此刻你已感觉自身如同领域的专家一般。在下一节中,我们将深入探讨🤗Tokenizers 库的基本构造模块,并展示如何使用它们构建自己的 tokenizer
至此 Unigram 的介绍完毕!期望此刻你已感觉自身如同领域的专家一般。在下一节中,我们将深入探讨🤗Tokenizers 库的基本构造模块,并展示如何使用它们构建自己的 tokenizer