Skip to content

nomic2vec.py: Work around missing NomicBertModel::*_input_embeddings#32

Open
vsrinivas wants to merge 1 commit intofacebookexperimental:mainfrom
vsrinivas:vec2
Open

nomic2vec.py: Work around missing NomicBertModel::*_input_embeddings#32
vsrinivas wants to merge 1 commit intofacebookexperimental:mainfrom
vsrinivas:vec2

Conversation

@vsrinivas
Copy link
Copy Markdown

NomicBertModel is missing get_input_embeddings/set_input_embeddings, which are required by model2vec's distilation pipeline. Patch in the two methods.

Before:
...
HuggingFace tokenizer defines a pad_token, but the Skeletoken model does not. Setting it to ''.

Full traceback of the error:
Traceback (most recent call last):
File "/home/vsrinivas/WORK/semcode/./scripts/nomic2vec.py", line 501, in main
m2v = distill_from_model(
model=model,
...<3 lines>...
device=args.device
)
File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/model2vec/distill/distillation.py", line 107, in distill_from_model
model = reshape_embeddings(model, original_tokenizer_model)
File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/skeletoken/external/transformers.py", line 58, in reshape_embeddings
embedding = model.get_input_embeddings()
File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/transformers/modeling_utils.py", line 1036, in get_input_embeddings
raise NotImplementedError(
f"get_input_embeddings not auto‑handled for {self.class.name}; please override in the subclass."
)
NotImplementedError: get_input_embeddings not auto‑handled for NomicBertModel; please override in the subclass.

After:
HuggingFace tokenizer defines a pad_token, but the Skeletoken model does not. Setting it to ''. Encoding tokens: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 249999/249999 [2:12:08<00:00, 31.53 tokens/s] ✓ Saved Model2Vec static model to /home/vsrinivas/WORK/semcode/nomic_v2_m2v Fixing tokenizer configuration for semcode compatibility...
Adding [UNK] token to vocabulary
Set unk_id to 1 for [UNK] token
✓ Updated tokenizer configuration
✓ Updated tokenizer_config.json
⚠ Verification warning: Number of tokens (250000) does not match number of vectors (249999). Please provide a token mapping or ensure the number of tokens matches the number of vectors.
The model was saved successfully but may need additional configuration for some tools

All done. The static model is ready for high‑throughput CPU embedding.

@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented Apr 10, 2026

Hi @vsrinivas!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@masoncl
Copy link
Copy Markdown
Contributor

masoncl commented Apr 10, 2026

There's no need to sign the CLA, just make sure your commits have Signed-off-by: tags. I'll take a look, thanks for sending this in

@vsrinivas
Copy link
Copy Markdown
Author

Thanks!

By the way, one thing I was curious about in code close to this -- nomic2vec.py can fall back to model2vec.distill_from_sentence_transformer --- I couldn't find distill_from_sentence_transformer anywhere, even in the history of model2vec. How were you using that fallback?

NomicBertModel is missing get_input_embeddings/set_input_embeddings,
which are required by model2vec's distilation pipeline. Patch in the
two methods.

Before:
...
HuggingFace tokenizer defines a pad_token, but the Skeletoken model does not. Setting it to '<pad>'.

Full traceback of the error:
Traceback (most recent call last):
  File "/home/vsrinivas/WORK/semcode/./scripts/nomic2vec.py", line 501, in main
    m2v = distill_from_model(
        model=model,
    ...<3 lines>...
        device=args.device
    )
  File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/model2vec/distill/distillation.py", line 107, in distill_from_model
    model = reshape_embeddings(model, original_tokenizer_model)
  File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/skeletoken/external/transformers.py", line 58, in reshape_embeddings
    embedding = model.get_input_embeddings()
  File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/transformers/modeling_utils.py", line 1036, in get_input_embeddings
    raise NotImplementedError(
        f"`get_input_embeddings` not auto‑handled for {self.__class__.__name__}; please override in the subclass."
    )
NotImplementedError: `get_input_embeddings` not auto‑handled for NomicBertModel; please override in the subclass.

After:
HuggingFace tokenizer defines a pad_token, but the Skeletoken model does not. Setting it to '<pad>'.
Encoding tokens: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 249999/249999 [2:12:08<00:00, 31.53 tokens/s]
✓ Saved Model2Vec static model to /home/vsrinivas/WORK/semcode/nomic_v2_m2v
Fixing tokenizer configuration for semcode compatibility...
  Adding [UNK] token to vocabulary
  Set unk_id to 1 for [UNK] token
  ✓ Updated tokenizer configuration
  ✓ Updated tokenizer_config.json
⚠ Verification warning: Number of tokens (250000) does not match number of vectors (249999). Please provide a token mapping or ensure the number of tokens matches the number of vectors.
  The model was saved successfully but may need additional configuration for some tools

All done. The static model is ready for high‑throughput CPU embedding.

Assisted-by: Claude:gemma-4-26B-A4B
Signed-off-by: Venkatesh Srinivas <venkateshs@chromium.org>
@vsrinivas
Copy link
Copy Markdown
Author

There's no need to sign the CLA, just make sure your commits have Signed-off-by: tags. I'll take a look, thanks for sending this in

Updated w/ an Assisted-by footer; just found out about them -- this was assisted by Claude Code running against a local gemma4 model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants