max_seq_length should not be larger than any options #3255

amitport · 2025-03-01T06:20:53Z

Hi,

when loading an auto-model, max_seq_length is read directedly from huggingface and it cannot be overwritten easily.

example:

model = SentenceTransformer("BAAI/bge-small-en-v1.5", tokenizer_kwargs={"model_max_length": 32})

assert model.max_seq_length == 32, f"expected 32, but got {model.max_seq_length=}"

This PR ensure that max_seq_length is overwritten, even when it exists

when loading an auto-model, max_seq_length is read directedly from huggingface and it cannot be overwritten easily.

tomaarsen · 2025-03-21T11:41:43Z

Hello!

This seems to be an issue only for the models where a sentence_bert_config.json specifies a max_seq_length: https://huggingface.co/BAAI/bge-small-en-v1.5/blob/main/sentence_bert_config.json

This value is indeed seen as "user-provided", which has priority over any values from transformers (including the values you provide with tokenizer_kwargs). This is indeed a bit frustrating, but I don't really want to change it to just min(...) as I'd like to allow users to set whatever maximum sequence length they want (even if the tokenizer/model disagrees). I also risk backwards incompatibility if I change this.

You can avoid this with:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5", tokenizer_kwargs={"model_max_length": 32})
model.max_seq_length = 32

assert model.max_seq_length == 32, f"expected 32, but got {model.max_seq_length=}"
assert model[0].max_seq_length == 32, f"expected 32, but got {model[0].max_seq_length=}"

but that too isn't ideal.

Tom Aarsen

amitport · 2025-04-21T13:49:07Z

@tomaarsen I used the workaround and it's fine, but the current behavior is still a bug in IMHO (and a silent one, that may make models fail unexpectedly fro the user)

into pr-3255

tomaarsen · 2025-04-22T14:33:31Z

Fair enough, I'll try to revisit this PR and see if there's a solid solution that doesn't break backwards compatibility, but also fixes this issue.
I merged master to avoid some issues with the CI tests.

Tom Aarsen

max_seq_length should not be larger than any options

3eeb3ba

when loading an auto-model, max_seq_length is read directedly from huggingface and it cannot be overwritten easily.

Merge branch 'master' of https://github.com/UKPLab/sentence-transformers

be51100

into pr-3255

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

max_seq_length should not be larger than any options #3255

max_seq_length should not be larger than any options #3255

Uh oh!

amitport commented Mar 1, 2025

Uh oh!

tomaarsen commented Mar 21, 2025

Uh oh!

amitport commented Apr 21, 2025

Uh oh!

tomaarsen commented Apr 22, 2025

Uh oh!

Uh oh!

max_seq_length should not be larger than any options #3255

Are you sure you want to change the base?

max_seq_length should not be larger than any options #3255

Uh oh!

Conversation

amitport commented Mar 1, 2025

Uh oh!

tomaarsen commented Mar 21, 2025

Uh oh!

amitport commented Apr 21, 2025

Uh oh!

tomaarsen commented Apr 22, 2025

Uh oh!

Uh oh!