nomic2vec.py: Work around missing NomicBertModel::*_input_embeddings#32
nomic2vec.py: Work around missing NomicBertModel::*_input_embeddings#32vsrinivas wants to merge 1 commit intofacebookexperimental:mainfrom
Conversation
|
Hi @vsrinivas! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
There's no need to sign the CLA, just make sure your commits have Signed-off-by: tags. I'll take a look, thanks for sending this in |
|
Thanks! By the way, one thing I was curious about in code close to this -- nomic2vec.py can fall back to model2vec.distill_from_sentence_transformer --- I couldn't find distill_from_sentence_transformer anywhere, even in the history of model2vec. How were you using that fallback? |
NomicBertModel is missing get_input_embeddings/set_input_embeddings,
which are required by model2vec's distilation pipeline. Patch in the
two methods.
Before:
...
HuggingFace tokenizer defines a pad_token, but the Skeletoken model does not. Setting it to '<pad>'.
Full traceback of the error:
Traceback (most recent call last):
File "/home/vsrinivas/WORK/semcode/./scripts/nomic2vec.py", line 501, in main
m2v = distill_from_model(
model=model,
...<3 lines>...
device=args.device
)
File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/model2vec/distill/distillation.py", line 107, in distill_from_model
model = reshape_embeddings(model, original_tokenizer_model)
File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/skeletoken/external/transformers.py", line 58, in reshape_embeddings
embedding = model.get_input_embeddings()
File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/transformers/modeling_utils.py", line 1036, in get_input_embeddings
raise NotImplementedError(
f"`get_input_embeddings` not auto‑handled for {self.__class__.__name__}; please override in the subclass."
)
NotImplementedError: `get_input_embeddings` not auto‑handled for NomicBertModel; please override in the subclass.
After:
HuggingFace tokenizer defines a pad_token, but the Skeletoken model does not. Setting it to '<pad>'.
Encoding tokens: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 249999/249999 [2:12:08<00:00, 31.53 tokens/s]
✓ Saved Model2Vec static model to /home/vsrinivas/WORK/semcode/nomic_v2_m2v
Fixing tokenizer configuration for semcode compatibility...
Adding [UNK] token to vocabulary
Set unk_id to 1 for [UNK] token
✓ Updated tokenizer configuration
✓ Updated tokenizer_config.json
⚠ Verification warning: Number of tokens (250000) does not match number of vectors (249999). Please provide a token mapping or ensure the number of tokens matches the number of vectors.
The model was saved successfully but may need additional configuration for some tools
All done. The static model is ready for high‑throughput CPU embedding.
Assisted-by: Claude:gemma-4-26B-A4B
Signed-off-by: Venkatesh Srinivas <venkateshs@chromium.org>
Updated w/ an Assisted-by footer; just found out about them -- this was assisted by Claude Code running against a local gemma4 model. |
NomicBertModel is missing get_input_embeddings/set_input_embeddings, which are required by model2vec's distilation pipeline. Patch in the two methods.
Before:
...
HuggingFace tokenizer defines a pad_token, but the Skeletoken model does not. Setting it to ''.
Full traceback of the error:
Traceback (most recent call last):
File "/home/vsrinivas/WORK/semcode/./scripts/nomic2vec.py", line 501, in main
m2v = distill_from_model(
model=model,
...<3 lines>...
device=args.device
)
File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/model2vec/distill/distillation.py", line 107, in distill_from_model
model = reshape_embeddings(model, original_tokenizer_model)
File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/skeletoken/external/transformers.py", line 58, in reshape_embeddings
embedding = model.get_input_embeddings()
File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/transformers/modeling_utils.py", line 1036, in get_input_embeddings
raise NotImplementedError(
f"
get_input_embeddingsnot auto‑handled for {self.class.name}; please override in the subclass.")
NotImplementedError:
get_input_embeddingsnot auto‑handled for NomicBertModel; please override in the subclass.After:
HuggingFace tokenizer defines a pad_token, but the Skeletoken model does not. Setting it to ''. Encoding tokens: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 249999/249999 [2:12:08<00:00, 31.53 tokens/s] ✓ Saved Model2Vec static model to /home/vsrinivas/WORK/semcode/nomic_v2_m2v Fixing tokenizer configuration for semcode compatibility...
Adding [UNK] token to vocabulary
Set unk_id to 1 for [UNK] token
✓ Updated tokenizer configuration
✓ Updated tokenizer_config.json
⚠ Verification warning: Number of tokens (250000) does not match number of vectors (249999). Please provide a token mapping or ensure the number of tokens matches the number of vectors.
The model was saved successfully but may need additional configuration for some tools
All done. The static model is ready for high‑throughput CPU embedding.