-
Notifications
You must be signed in to change notification settings - Fork 483
Open
Description
Most files share similar data reading code, like
Lines 18 to 22 in a9e8be5
| train = list(read_dataset("../data/classes/train.txt")) | |
| w2i = defaultdict(lambda: UNK, w2i) | |
| dev = list(read_dataset("../data/classes/test.txt")) | |
| nwords = len(w2i) | |
| ntags = len(t2i) |
In most of the examples, the variable nwords is used as the effective vocabulary size, for instance, when we allocate parameters for embedding matrix.
Line 30 in a9e8be5
| W_emb = model.add_lookup_parameters((nwords, EMB_SIZE)) # Word embeddings |
However, there are likely many new words in dev/test set that might be added in w2i... their values are mapped to UNK, but they are still counted in len(w2i) which is likely not intended. Often this overcounting does not change the results, but it can be problematic in some cases.
Metadata
Metadata
Assignees
Labels
No labels