-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Open
Labels
Description
System Info
H20
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
过滤函数实现:对应链接 https://github.com/verl-project/verl/blob/main/verl/utils/dataset/rl_dataset.py#L241
def doc2len(doc) -> int:
try:
apply_kwargs = dict(**self.apply_chat_template_kwargs)
if self.tool_schemas is not None:
apply_kwargs["tools"] = self.tool_schemas
# Keep explicit tokenization to avoid transformers version default changes.
apply_kwargs.pop("tokenize", None)
apply_kwargs.pop("return_dict", None)
apply_kwargs.pop("return_tensors", None)
tokenized_prompt = tokenizer.apply_chat_template(
doc[prompt_key], add_generation_prompt=True, tokenize=True, **apply_kwargs
)
return len(normalize_token_ids(tokenized_prompt))
except Exception:
print("Error processing one of the samples, skipping...")
traceback.print_exc()
return self.max_prompt_length + 1
dataframe = dataframe.filter(
lambda doc: doc2len(doc) <= self.max_prompt_length,
num_proc=self.num_workers,
desc=f"Filtering prompts longer than {self.max_prompt_length} tokens",
)
使用transformer 4.57.3 运行这段code大概十几分钟,如果升级到5.3.0 需要约两个小时;辛苦帮忙看看 模型=qwen3.5-35b-A22 数据集是aime-2024
4.57.3 结果

5.3.0 结果

Expected behavior
希望transformer 升级后,其tokenizer 的速度没有变化
Reactions are currently unavailable