-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Open
Description
It seems that impossible to override model's max length from sentence_bert_config.json.
from sentence_transformers import SentenceTransformer
m = SentenceTransformer("intfloat/e5-small", tokenizer_kwargs={"model_max_length":3})
print(m.tokenize(["hi hi hi hi hi hi hi hi hi hi hi hi hi"]))
# {'input_ids': tensor([[ 101, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632,
# 7632, 7632, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
print(m.tokenize(["hi hi hi hi hi hi hi hi hi hi hi hi hi"], truncation=True))
# {'input_ids': tensor([[ 101, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632,
# 7632, 7632, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
print(m[0].tokenizer(["hi hi hi hi hi hi hi hi hi hi hi hi hi"], truncation=True))
# {'input_ids': [[101, 7632, 102]], 'token_type_ids': [[0, 0, 0]], 'attention_mask': [[1, 1, 1]]}
m.max_seq_length = 3
print(m.tokenize(["hi hi hi hi hi hi hi hi hi hi hi hi hi"]))
# {'input_ids': tensor([[ 101, 7632, 102]]), 'token_type_ids': tensor([[0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1]])}This is happening because during load it load max_seq_length from sentence_bert_config and then in Transformers it will override max_seq_length only it wasn't set in sentence_bert_config
sentence-transformers/sentence_transformers/models/Transformer.py
Lines 101 to 118 in ad28c0a
| if max_seq_length is not None and "model_max_length" not in tokenizer_args: | |
| tokenizer_args["model_max_length"] = max_seq_length | |
| self.tokenizer = AutoTokenizer.from_pretrained( | |
| tokenizer_name_or_path if tokenizer_name_or_path is not None else model_name_or_path, | |
| cache_dir=cache_dir, | |
| **tokenizer_args, | |
| ) | |
| # No max_seq_length set. Try to infer from model | |
| if max_seq_length is None: | |
| if ( | |
| hasattr(self.auto_model, "config") | |
| and hasattr(self.auto_model.config, "max_position_embeddings") | |
| and hasattr(self.tokenizer, "model_max_length") | |
| ): | |
| max_seq_length = min(self.auto_model.config.max_position_embeddings, self.tokenizer.model_max_length) | |
| self.max_seq_length = max_seq_length |
model_max_length is passed in tokenizer_kwargs and then max_seq_length will be used as max_length instead of passed in kwargs sentence-transformers/sentence_transformers/models/Transformer.py
Lines 319 to 327 in ad28c0a
| output.update( | |
| self.tokenizer( | |
| *to_tokenize, | |
| padding=padding, | |
| truncation="longest_first", | |
| return_tensors="pt", | |
| max_length=self.max_seq_length, | |
| ) | |
| ) |
Probably this can be fixed by
max_seq_length = min(max_seq_length, self.tokenizer.model_max_length)Source embeddings-benchmark/mteb#3587 (comment)
I think this is cause of #3187
Metadata
Metadata
Assignees
Labels
No labels