nomic2vec.py: Work around missing NomicBertModel::*_input_embeddings by vsrinivas · Pull Request #32 · facebookexperimental/semcode

vsrinivas · 2026-04-10T15:47:21Z

NomicBertModel is missing get_input_embeddings/set_input_embeddings, which are required by model2vec's distilation pipeline. Patch in the two methods.

Before:
...
HuggingFace tokenizer defines a pad_token, but the Skeletoken model does not. Setting it to ''.

Full traceback of the error:
Traceback (most recent call last):
File "/home/vsrinivas/WORK/semcode/./scripts/nomic2vec.py", line 501, in main
m2v = distill_from_model(
model=model,
...<3 lines>...
device=args.device
)
File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/model2vec/distill/distillation.py", line 107, in distill_from_model
model = reshape_embeddings(model, original_tokenizer_model)
File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/skeletoken/external/transformers.py", line 58, in reshape_embeddings
embedding = model.get_input_embeddings()
File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/transformers/modeling_utils.py", line 1036, in get_input_embeddings
raise NotImplementedError(
f"get_input_embeddings not auto‑handled for {self.class.name}; please override in the subclass."
)
NotImplementedError: get_input_embeddings not auto‑handled for NomicBertModel; please override in the subclass.

After:
HuggingFace tokenizer defines a pad_token, but the Skeletoken model does not. Setting it to ''. Encoding tokens: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 249999/249999 [2:12:08<00:00, 31.53 tokens/s] ✓ Saved Model2Vec static model to /home/vsrinivas/WORK/semcode/nomic_v2_m2v Fixing tokenizer configuration for semcode compatibility...
Adding [UNK] token to vocabulary
Set unk_id to 1 for [UNK] token
✓ Updated tokenizer configuration
✓ Updated tokenizer_config.json
⚠ Verification warning: Number of tokens (250000) does not match number of vectors (249999). Please provide a token mapping or ensure the number of tokens matches the number of vectors.
The model was saved successfully but may need additional configuration for some tools

All done. The static model is ready for high‑throughput CPU embedding.

meta-cla · 2026-04-10T15:47:28Z

Hi @vsrinivas!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

masoncl · 2026-04-10T15:57:18Z

There's no need to sign the CLA, just make sure your commits have Signed-off-by: tags. I'll take a look, thanks for sending this in

vsrinivas · 2026-04-10T21:19:00Z

Thanks!

By the way, one thing I was curious about in code close to this -- nomic2vec.py can fall back to model2vec.distill_from_sentence_transformer --- I couldn't find distill_from_sentence_transformer anywhere, even in the history of model2vec. How were you using that fallback?

NomicBertModel is missing get_input_embeddings/set_input_embeddings, which are required by model2vec's distilation pipeline. Patch in the two methods. Before: ... HuggingFace tokenizer defines a pad_token, but the Skeletoken model does not. Setting it to '<pad>'. Full traceback of the error: Traceback (most recent call last): File "/home/vsrinivas/WORK/semcode/./scripts/nomic2vec.py", line 501, in main m2v = distill_from_model( model=model, ...<3 lines>... device=args.device ) File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/model2vec/distill/distillation.py", line 107, in distill_from_model model = reshape_embeddings(model, original_tokenizer_model) File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/skeletoken/external/transformers.py", line 58, in reshape_embeddings embedding = model.get_input_embeddings() File "/home/vsrinivas/semcode-vectors2/lib/python3.13/site-packages/transformers/modeling_utils.py", line 1036, in get_input_embeddings raise NotImplementedError( f"`get_input_embeddings` not auto‑handled for {self.__class__.__name__}; please override in the subclass." ) NotImplementedError: `get_input_embeddings` not auto‑handled for NomicBertModel; please override in the subclass. After: HuggingFace tokenizer defines a pad_token, but the Skeletoken model does not. Setting it to '<pad>'. Encoding tokens: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 249999/249999 [2:12:08<00:00, 31.53 tokens/s] ✓ Saved Model2Vec static model to /home/vsrinivas/WORK/semcode/nomic_v2_m2v Fixing tokenizer configuration for semcode compatibility... Adding [UNK] token to vocabulary Set unk_id to 1 for [UNK] token ✓ Updated tokenizer configuration ✓ Updated tokenizer_config.json ⚠ Verification warning: Number of tokens (250000) does not match number of vectors (249999). Please provide a token mapping or ensure the number of tokens matches the number of vectors. The model was saved successfully but may need additional configuration for some tools All done. The static model is ready for high‑throughput CPU embedding. Assisted-by: Claude:gemma-4-26B-A4B Signed-off-by: Venkatesh Srinivas <venkateshs@chromium.org>

vsrinivas · 2026-04-14T22:04:47Z

There's no need to sign the CLA, just make sure your commits have Signed-off-by: tags. I'll take a look, thanks for sending this in

Updated w/ an Assisted-by footer; just found out about them -- this was assisted by Claude Code running against a local gemma4 model.

vsrinivas force-pushed the vec2 branch from 004711d to 224f2a0 Compare April 14, 2026 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nomic2vec.py: Work around missing NomicBertModel::*_input_embeddings#32

nomic2vec.py: Work around missing NomicBertModel::*_input_embeddings#32
vsrinivas wants to merge 1 commit intofacebookexperimental:mainfrom
vsrinivas:vec2

vsrinivas commented Apr 10, 2026

Uh oh!

meta-cla Bot commented Apr 10, 2026

Uh oh!

masoncl commented Apr 10, 2026

Uh oh!

vsrinivas commented Apr 10, 2026

Uh oh!

vsrinivas commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vsrinivas commented Apr 10, 2026

Uh oh!

meta-cla Bot commented Apr 10, 2026

Action Required

Process

Uh oh!

masoncl commented Apr 10, 2026

Uh oh!

vsrinivas commented Apr 10, 2026

Uh oh!

vsrinivas commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants