Subsampling of frequent words

I was looking through your implementation of subsampling of frequent words in https://github.com/will-thompson-k/deeplearning-nlp-models/blob/master/nlpmodels/utils/elt/skipgram_dataset.py#L68 and specifically how you generate your sampling table in https://github.com/will-thompson-k/deeplearning-nlp-models/blob/d3afac437b1ddf563c8d2694ada950bf20ab1c34/nlpmodels/utils/vocabulary.py#L159. Looks like your implementation slightly differs from the original paper: https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L407.

Something like this worked for me if I pass a collections.Counter or dict with the item counts.

```
def sampling_probabilities(item_counts, sample=1e-5):
    counts = np.array(list(item_counts.values()))
    total_count = counts.sum()
    probabilities = (np.sqrt(counts / (sample * total_count)) + 1) * (sample * total_count) / counts
    # Only useful if you wish to plot the probability distribution
    #probabilities = np.minimum(probabilities, 1.0)
    return {k: probabilities[i] for i, k in enumerate(item_counts.keys())}
```

Using 1e-5 for sampling for one of my smaller datasets I get a around a 17% chance of keeping the most frequent item. This will of course differ a lot from dataset to dataset. There is a StackOverflow thread discussing the sampling: https://stackoverflow.com/questions/58772768/word2vec-subsampling-implementation. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Subsampling of frequent words #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Subsampling of frequent words #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions