Skip to content

Commit d611aec

Browse files
committed
Update documentation to reflect tokenizers refactor under transformers module
1 parent c491bc1 commit d611aec

File tree

2 files changed

+11
-11
lines changed

2 files changed

+11
-11
lines changed

docs/source/api_ref_modules.rst

+6-6
Original file line numberDiff line numberDiff line change
@@ -48,10 +48,10 @@ model specific tokenizers.
4848
:toctree: generated/
4949
:nosignatures:
5050

51-
tokenizers.SentencePieceBaseTokenizer
52-
tokenizers.TikTokenBaseTokenizer
53-
tokenizers.ModelTokenizer
54-
tokenizers.BaseTokenizer
51+
transforms.tokenizers.SentencePieceBaseTokenizer
52+
transforms.tokenizers.TikTokenBaseTokenizer
53+
transforms.tokenizers.ModelTokenizer
54+
transforms.tokenizers.BaseTokenizer
5555

5656
Tokenizer Utilities
5757
-------------------
@@ -61,8 +61,8 @@ These are helper methods that can be used by any tokenizer.
6161
:toctree: generated/
6262
:nosignatures:
6363

64-
tokenizers.tokenize_messages_no_special_tokens
65-
tokenizers.parse_hf_tokenizer_json
64+
transforms.tokenizers.tokenize_messages_no_special_tokens
65+
transforms.tokenizers.parse_hf_tokenizer_json
6666

6767

6868
PEFT Components

docs/source/basics/tokenizers.rst

+5-5
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,7 @@ For example, here we change the ``"<|begin_of_text|>"`` and ``"<|end_of_text|>"`
168168
Base tokenizers
169169
---------------
170170

171-
:class:`~torchtune.modules.tokenizers.BaseTokenizer` are the underlying byte-pair encoding modules that perform the actual raw string to token ID conversion and back.
171+
:class:`~torchtune.modules.transforms.tokenizers.BaseTokenizer` are the underlying byte-pair encoding modules that perform the actual raw string to token ID conversion and back.
172172
In torchtune, they are required to implement ``encode`` and ``decode`` methods, which are called by the :ref:`model_tokenizers` to convert
173173
between raw text and token IDs.
174174

@@ -202,13 +202,13 @@ between raw text and token IDs.
202202
"""
203203
pass
204204
205-
If you load any :ref:`model_tokenizers`, you can see that it calls its underlying :class:`~torchtune.modules.tokenizers.BaseTokenizer`
205+
If you load any :ref:`model_tokenizers`, you can see that it calls its underlying :class:`~torchtune.modules.transforms.tokenizers.BaseTokenizer`
206206
to do the actual encoding and decoding.
207207

208208
.. code-block:: python
209209
210210
from torchtune.models.mistral import mistral_tokenizer
211-
from torchtune.modules.tokenizers import SentencePieceBaseTokenizer
211+
from torchtune.modules.transforms.tokenizers import SentencePieceBaseTokenizer
212212
213213
m_tokenizer = mistral_tokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model")
214214
# Mistral uses SentencePiece for its underlying BPE
@@ -227,7 +227,7 @@ to do the actual encoding and decoding.
227227
Model tokenizers
228228
----------------
229229

230-
:class:`~torchtune.modules.tokenizers.ModelTokenizer` are specific to a particular model. They are required to implement the ``tokenize_messages`` method,
230+
:class:`~torchtune.modules.transforms.tokenizers.ModelTokenizer` are specific to a particular model. They are required to implement the ``tokenize_messages`` method,
231231
which converts a list of Messages into a list of token IDs.
232232

233233
.. code-block:: python
@@ -259,7 +259,7 @@ is because they add all the necessary special tokens or prompt templates require
259259
.. code-block:: python
260260
261261
from torchtune.models.mistral import mistral_tokenizer
262-
from torchtune.modules.tokenizers import SentencePieceBaseTokenizer
262+
from torchtune.modules.transforms.tokenizers import SentencePieceBaseTokenizer
263263
from torchtune.data import Message
264264
265265
m_tokenizer = mistral_tokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model")

0 commit comments

Comments
 (0)