Skip to content
This repository was archived by the owner on Jul 12, 2021. It is now read-only.
This repository was archived by the owner on Jul 12, 2021. It is now read-only.

Subsampling of frequent words #8

@thomalm

Description

@thomalm

I was looking through your implementation of subsampling of frequent words in https://github.com/will-thompson-k/deeplearning-nlp-models/blob/master/nlpmodels/utils/elt/skipgram_dataset.py#L68 and specifically how you generate your sampling table in

def get_word_discard_probas(self):
. Looks like your implementation slightly differs from the original paper: https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L407.

Something like this worked for me if I pass a collections.Counter or dict with the item counts.

def sampling_probabilities(item_counts, sample=1e-5):
    counts = np.array(list(item_counts.values()))
    total_count = counts.sum()
    probabilities = (np.sqrt(counts / (sample * total_count)) + 1) * (sample * total_count) / counts
    # Only useful if you wish to plot the probability distribution
    #probabilities = np.minimum(probabilities, 1.0)
    return {k: probabilities[i] for i, k in enumerate(item_counts.keys())}

Using 1e-5 for sampling for one of my smaller datasets I get a around a 17% chance of keeping the most frequent item. This will of course differ a lot from dataset to dataset. There is a StackOverflow thread discussing the sampling: https://stackoverflow.com/questions/58772768/word2vec-subsampling-implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions