Computing the number of words

Most files share similar data reading code, like

https://github.com/neubig/nn4nlp-code/blob/a9e8be5b101cc1de50f27d918187d6271fc26c8d/01-intro/cbow.py#L18-L22

In most of the examples, the variable `nwords` is used as the effective vocabulary size, for instance, when we allocate parameters for embedding matrix.

https://github.com/neubig/nn4nlp-code/blob/a9e8be5b101cc1de50f27d918187d6271fc26c8d/01-intro/cbow.py#L30

However, there are likely many new words in dev/test set that might be added in `w2i`... their values are mapped to `UNK`, but they are **still counted** in `len(w2i)` which is _likely_ not intended. Often this overcounting does not change the results, but it can be problematic in some cases.

	train = list(read_dataset("../data/classes/train.txt"))
	w2i = defaultdict(lambda: UNK, w2i)
	dev = list(read_dataset("../data/classes/test.txt"))
	nwords = len(w2i)
	ntags = len(t2i)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Computing the number of words #34

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Computing the number of words #34

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions