huggingface · schrodingercatss · Mar 16, 2025 · Mar 16, 2025
diff --git a/chapters/en/chapter6/7.mdx b/chapters/en/chapter6/7.mdx
@@ -64,20 +64,20 @@ So, the sum of all frequencies is 210, and the probability of the subword `"ug"`
 
 Now, to tokenize a given word, we look at all the possible segmentations into tokens and compute the probability of each according to the Unigram model. Since all tokens are considered independent, this probability is just the product of the probability of each token. For instance, the tokenization `["p", "u", "g"]` of `"pug"` has the probability:
 
-$$P([``p", ``u", ``g"]) = P(``p") \times P(``u") \times P(``g") = \frac{5}{210} \times \frac{36}{210} \times \frac{20}{210} = 0.000389$$
+$$P([``p", ``u", ``g"]) = P(``p") \times P(``u") \times P(``g") = \frac{17}{210} \times \frac{36}{210} \times \frac{20}{210} = 0.001322$$
 
 Comparatively, the tokenization `["pu", "g"]` has the probability:
 
-$$P([``pu", ``g"]) = P(``pu") \times P(``g") = \frac{5}{210} \times \frac{20}{210} = 0.0022676$$
+$$P([``pu", ``g"]) = P(``pu") \times P(``g") = \frac{17}{210} \times \frac{20}{210} = 0.007710$$
 
 so that one is way more likely. In general, tokenizations with the least tokens possible will have the highest probability (because of that division by 210 repeated for each token), which corresponds to what we want intuitively: to split a word into the least number of tokens possible.
 
 The tokenization of a word with the Unigram model is then the tokenization with the highest probability. In the example of `"pug"`, here are the probabilities we would get for each possible segmentation:
 
 ```
-["p", "u", "g"] : 0.000389
-["p", "ug"] : 0.0022676
-["pu", "g"] : 0.0022676
+["p", "u", "g"] : 0.001322
+["p", "ug"] : 0.007709
+["pu", "g"] : 0.007709
 ```
 
 So, `"pug"` would be tokenized as `["p", "ug"]` or `["pu", "g"]`, depending on which of those segmentations is encountered first (note that in a larger corpus, equality cases like this will be rare).

diff --git a/chapters/zh-CN/chapter6/7.mdx b/chapters/zh-CN/chapter6/7.mdx
@@ -65,20 +65,20 @@ Unigram 模型是一种语言模型，它认为每个符号都与其之前的符
 现在，为了对一个给定的单词进行分词，我们会查看所有可能的分词组合，并根据 Unigram 模型计算出每种可能的概率。由于所有的分词都被视为独立的，因此这个单词分词的概率就是每个子词概率的乘积。例如，将 `"pug"` 分词为 `["p", "u", "g"]` 的概率为：
 
 
-$$P([``p", ``u", ``g"]) = P(``p") \times P(``u") \times P(``g") = \frac{5}{210} \times \frac{36}{210} \times \frac{20}{210} = 0.000389$$
+$$P([``p", ``u", ``g"]) = P(``p") \times P(``u") \times P(``g") = \frac{17}{210} \times \frac{36}{210} \times \frac{20}{210} = 0.001322$$
 
 相比之下，将 “pug” 分词为 `["pu", "g"]` 的概率为：
 
-$$P([``pu", ``g"]) = P(``pu") \times P(``g") = \frac{5}{210} \times \frac{20}{210} = 0.0022676$$
+$$P([``pu", ``g"]) = P(``pu") \times P(``g") = \frac{17}{210} \times \frac{20}{210} = 0.007710$$
 
 因此，后者的可能性更大。一般来说，分词数最少的分词方式将具有最高的概率（因为每个分词都要除以 210），这正符合我们的直觉：将一个词分割为尽可能少的子词。
 
 利用 Unigram 模型对一个词进行分词，就是找出概率最高的分词方式。以 `"pug"` 为例，我们得到的各种可能分词方式的概率如下：
 
 ```
-["p", "u", "g"] : 0.000389
-["p", "ug"] : 0.0022676
-["pu", "g"] : 0.0022676
+["p", "u", "g"] : 0.001322
+["p", "ug"] : 0.007710
+["pu", "g"] : 0.007710
 ```
 
 因此， `"pug"` 将被分词为 `["p", "ug"]` 或 `["pu", "g"]` ，取决于哪种分词方式排在前面（注意，在更大的语料库中，像这样的相等情况将很少见）。
@@ -380,4 +380,4 @@ tokenize("This is the Hugging Face course.", model)
 ['▁This', '▁is', '▁the', '▁Hugging', '▁Face', '▁', 'c', 'ou', 'r', 's', 'e', '.']
 ```
 
-至此 Unigram 的介绍完毕！期望此刻你已感觉自身如同领域的专家一般。在下一节中，我们将深入探讨🤗Tokenizers 库的基本构造模块，并展示如何使用它们构建自己的 tokenizer 
+至此 Unigram 的介绍完毕！期望此刻你已感觉自身如同领域的专家一般。在下一节中，我们将深入探讨🤗Tokenizers 库的基本构造模块，并展示如何使用它们构建自己的 tokenizer