This repository was archived by the owner on Jul 12, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 8
This repository was archived by the owner on Jul 12, 2021. It is now read-only.
Subsampling of frequent words #8
Copy link
Copy link
Open
Description
I was looking through your implementation of subsampling of frequent words in https://github.com/will-thompson-k/deeplearning-nlp-models/blob/master/nlpmodels/utils/elt/skipgram_dataset.py#L68 and specifically how you generate your sampling table in
| def get_word_discard_probas(self): |
Something like this worked for me if I pass a collections.Counter or dict with the item counts.
def sampling_probabilities(item_counts, sample=1e-5):
counts = np.array(list(item_counts.values()))
total_count = counts.sum()
probabilities = (np.sqrt(counts / (sample * total_count)) + 1) * (sample * total_count) / counts
# Only useful if you wish to plot the probability distribution
#probabilities = np.minimum(probabilities, 1.0)
return {k: probabilities[i] for i, k in enumerate(item_counts.keys())}
Using 1e-5 for sampling for one of my smaller datasets I get a around a 17% chance of keeping the most frequent item. This will of course differ a lot from dataset to dataset. There is a StackOverflow thread discussing the sampling: https://stackoverflow.com/questions/58772768/word2vec-subsampling-implementation.
Metadata
Metadata
Assignees
Labels
No labels