How to override model's `max_seq_length`?

It seems that impossible to override model's max length from `sentence_bert_config.json`. 

```python
from sentence_transformers import SentenceTransformer

m = SentenceTransformer("intfloat/e5-small", tokenizer_kwargs={"model_max_length":3})
print(m.tokenize(["hi hi hi hi hi hi hi hi hi hi hi hi hi"]))
# {'input_ids': tensor([[ 101, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632,
#          7632, 7632,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
print(m.tokenize(["hi hi hi hi hi hi hi hi hi hi hi hi hi"], truncation=True))
# {'input_ids': tensor([[ 101, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632, 7632,
#         7632, 7632,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
print(m[0].tokenizer(["hi hi hi hi hi hi hi hi hi hi hi hi hi"], truncation=True))
# {'input_ids': [[101, 7632, 102]], 'token_type_ids': [[0, 0, 0]], 'attention_mask': [[1, 1, 1]]}

m.max_seq_length = 3
print(m.tokenize(["hi hi hi hi hi hi hi hi hi hi hi hi hi"]))
# {'input_ids': tensor([[ 101, 7632,  102]]), 'token_type_ids': tensor([[0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1]])}
```

This is happening because during load it load `max_seq_length` from `sentence_bert_config` and then in `Transformers` it will override `max_seq_length` only it wasn't set in `sentence_bert_config` https://github.com/huggingface/sentence-transformers/blob/ad28c0a982acc39c73abdf0019faca10f227ef28/sentence_transformers/models/Transformer.py#L101-L118 even if `model_max_length` is passed in `tokenizer_kwargs` and then `max_seq_length` will be used as `max_length` instead of passed in kwargs https://github.com/huggingface/sentence-transformers/blob/ad28c0a982acc39c73abdf0019faca10f227ef28/sentence_transformers/models/Transformer.py#L319-L327

Probably this can be fixed by

```diff
max_seq_length = min(max_seq_length, self.tokenizer.model_max_length)
```


Source https://github.com/embeddings-benchmark/mteb/pull/3587#discussion_r2542434603
I think this is cause of https://github.com/huggingface/sentence-transformers/issues/3187

	if max_seq_length is not None and "model_max_length" not in tokenizer_args:
	tokenizer_args["model_max_length"] = max_seq_length
	self.tokenizer = AutoTokenizer.from_pretrained(
	tokenizer_name_or_path if tokenizer_name_or_path is not None else model_name_or_path,
	cache_dir=cache_dir,
	**tokenizer_args,
	)

	# No max_seq_length set. Try to infer from model
	if max_seq_length is None:
	if (
	hasattr(self.auto_model, "config")
	and hasattr(self.auto_model.config, "max_position_embeddings")
	and hasattr(self.tokenizer, "model_max_length")
	):
	max_seq_length = min(self.auto_model.config.max_position_embeddings, self.tokenizer.model_max_length)

	self.max_seq_length = max_seq_length

	output.update(
	self.tokenizer(
	*to_tokenize,
	padding=padding,
	truncation="longest_first",
	return_tensors="pt",
	max_length=self.max_seq_length,
	)
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to override model's `max_seq_length`? #3575

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to override model's max_seq_length? #3575

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

How to override model's `max_seq_length`? #3575