encoder.embed_tokens.weight is smaller than the size of the TextTokenizer vocabulary

Hi,

Thank you for this tremendously useful codebase! I am playing around with extending the TextTokenizer vocabulary and found out that the size of the text embeddings i.e. `min_dalle.encoder.embed_tokens.weight.shape[0]` is smaller than the size of the vocabulary i.e. `len(tokenizer.subword_from_tokens)`. Here's the code I am using to get those numbers.

```
    from min_dalle import MinDalle
    model = MinDalle(
        models_root='./pretrained',
        dtype=torch.float32,
        device='cuda',
        is_mega=False,
        is_reusable=True
    )
    print(model.encoder.embed_tokens.weight.shape, len(model.tokenizer.token_from_subword))
```
The output is as follow:
```
torch.Size([50264, 1024]) 50265
```
In case of DALL-E Mega, the embeddings are larger than the vocabulary size:
```
torch.Size([50272, 2048]) 50265
```
Practically, these discrepancies can be worked with by bounding the text tokens, so I am not too concerned about it. But just wanted to make it known that there's a potential issue. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

encoder.embed_tokens.weight is smaller than the size of the TextTokenizer vocabulary #93

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

encoder.embed_tokens.weight is smaller than the size of the TextTokenizer vocabulary #93

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions