Skip to content

max_seq_length should not be larger than any options #3255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

amitport
Copy link
Contributor

@amitport amitport commented Mar 1, 2025

Hi,

when loading an auto-model, max_seq_length is read directedly from huggingface and it cannot be overwritten easily.

example:

model = SentenceTransformer("BAAI/bge-small-en-v1.5", tokenizer_kwargs={"model_max_length": 32})

assert model.max_seq_length == 32, f"expected 32, but got {model.max_seq_length=}"

This PR ensure that max_seq_length is overwritten, even when it exists

when loading an auto-model, max_seq_length is read directedly from huggingface and it cannot be overwritten easily.
@tomaarsen
Copy link
Collaborator

Hello!

This seems to be an issue only for the models where a sentence_bert_config.json specifies a max_seq_length: https://huggingface.co/BAAI/bge-small-en-v1.5/blob/main/sentence_bert_config.json

This value is indeed seen as "user-provided", which has priority over any values from transformers (including the values you provide with tokenizer_kwargs). This is indeed a bit frustrating, but I don't really want to change it to just min(...) as I'd like to allow users to set whatever maximum sequence length they want (even if the tokenizer/model disagrees). I also risk backwards incompatibility if I change this.

You can avoid this with:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5", tokenizer_kwargs={"model_max_length": 32})
model.max_seq_length = 32

assert model.max_seq_length == 32, f"expected 32, but got {model.max_seq_length=}"
assert model[0].max_seq_length == 32, f"expected 32, but got {model[0].max_seq_length=}"

but that too isn't ideal.

  • Tom Aarsen

@amitport
Copy link
Contributor Author

@tomaarsen I used the workaround and it's fine, but the current behavior is still a bug in IMHO (and a silent one, that may make models fail unexpectedly fro the user)

@tomaarsen
Copy link
Collaborator

Fair enough, I'll try to revisit this PR and see if there's a solid solution that doesn't break backwards compatibility, but also fixes this issue.
I merged master to avoid some issues with the CI tests.

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants