Skip to content

How to override model's max_seq_length? #3575

@Samoed

Description

@Samoed

It seems that impossible to override model's max length from sentence_bert_config.json.

from sentence_transformers import SentenceTransformer

m = SentenceTransformer("intfloat/e5-small", tokenizer_kwargs={"model_max_length":3})
print(m.tokenize(["hi hi hi hi hi hi hi hi hi hi hi hi hi"]))
# {'input_ids': tensor([[ 101, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632,
#          7632, 7632,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
print(m.tokenize(["hi hi hi hi hi hi hi hi hi hi hi hi hi"], truncation=True))
# {'input_ids': tensor([[ 101, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632,
#         7632, 7632,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
print(m[0].tokenizer(["hi hi hi hi hi hi hi hi hi hi hi hi hi"], truncation=True))
# {'input_ids': [[101, 7632, 102]], 'token_type_ids': [[0, 0, 0]], 'attention_mask': [[1, 1, 1]]}

m.max_seq_length = 3
print(m.tokenize(["hi hi hi hi hi hi hi hi hi hi hi hi hi"]))
# {'input_ids': tensor([[ 101, 7632,  102]]), 'token_type_ids': tensor([[0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1]])}

This is happening because during load it load max_seq_length from sentence_bert_config and then in Transformers it will override max_seq_length only it wasn't set in sentence_bert_config

if max_seq_length is not None and "model_max_length" not in tokenizer_args:
tokenizer_args["model_max_length"] = max_seq_length
self.tokenizer = AutoTokenizer.from_pretrained(
tokenizer_name_or_path if tokenizer_name_or_path is not None else model_name_or_path,
cache_dir=cache_dir,
**tokenizer_args,
)
# No max_seq_length set. Try to infer from model
if max_seq_length is None:
if (
hasattr(self.auto_model, "config")
and hasattr(self.auto_model.config, "max_position_embeddings")
and hasattr(self.tokenizer, "model_max_length")
):
max_seq_length = min(self.auto_model.config.max_position_embeddings, self.tokenizer.model_max_length)
self.max_seq_length = max_seq_length
even if model_max_length is passed in tokenizer_kwargs and then max_seq_length will be used as max_length instead of passed in kwargs
output.update(
self.tokenizer(
*to_tokenize,
padding=padding,
truncation="longest_first",
return_tensors="pt",
max_length=self.max_seq_length,
)
)

Probably this can be fixed by

max_seq_length = min(max_seq_length, self.tokenizer.model_max_length)

Source embeddings-benchmark/mteb#3587 (comment)
I think this is cause of #3187

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions