Skip to content

Commit af14f55

Browse files
authored
[ggma] Fix hardcoded bos_id in TokenizerSentencePiece (#16275)
TokenizerSentencePiece should use bos_id from vocabulary. Hardcoded 1 was wrong. ONE-DCO-1.0-Signed-off-by: Sanggyu Lee <sg5.lee@samsung.com>
1 parent 8407043 commit af14f55

1 file changed

Lines changed: 3 additions & 2 deletions

File tree

runtime/ggma/src/tokenize/TokenizerSentencePiece.cc

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -76,10 +76,11 @@ size_t SentencePieceTokenizer::tokenize(const char *text, size_t text_len, int32
7676
int bos_id = _processor->bos_id();
7777
size_t bos_offset = 0;
7878

79+
// TODO: Make BOS token prepending configurable
7980
if (bos_id >= 0 && max_tokens > 0)
8081
{
81-
tokens[0] = 1; // Add BOS token
82-
bos_offset = 1; // Start actual tokens from index 1
82+
tokens[0] = bos_id; // Add BOS token
83+
bos_offset = 1; // Start actual tokens from index 1
8384
}
8485

8586
size_t available_space = max_tokens - bos_offset;

0 commit comments

Comments
 (0)