Skip to content

Commit 79fd101

Browse files
yuzhu-caiyuzhu-cai
andauthored
fix(data): correct length filtering from character to token level (THUDM#548)
Co-authored-by: yuzhu-cai <caiyuzhu@gmail.com>
1 parent 48c16f1 commit 79fd101

File tree

1 file changed

+2
-1
lines changed

1 file changed

+2
-1
lines changed

slime/utils/data.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,8 +72,9 @@ def __init__(
7272

7373
# TODO: this is slow.
7474
if max_length is not None:
75+
raw_prompt_ids = tokenizer.encode(prompt, add_special_tokens=False)
7576
if not multimodal_keys:
76-
if len(prompt) > max_length:
77+
if len(raw_prompt_ids) > max_length:
7778
continue
7879

7980
self.origin_samples.append(

0 commit comments

Comments
 (0)