-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathword-embedding.Rmd
195 lines (135 loc) · 6.52 KB
/
word-embedding.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: "Word Embeddings: Finding Word Analogies"
author: "Ryan Wesslen"
date: "July 27, 2017"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
```
## Word Embeddings - GloVe model
### References
The first two references show demos of using the R `text2vec` package to run GloVe. The third reference is by the original authors to summarize details about the model.
* [Reference 1](http://dsnotes.com/post/glove-enwiki/)
* [Reference 2](https://cran.r-project.org/web/packages/text2vec/vignettes/glove.html)
* [Original GloVe Website](http://nlp.stanford.edu/projects/glove/)
### Load the Data
First, let's load the data. We'll use the `readr` package that is embedded in the `tidyverse` package.
```{r load data}
library(tidyverse) #install.packages("tidyverse") if you do not have the package
# one column csv of text -- column text
file <- "../data/pres_tweets.csv"
data <- read_csv(file)
```
### Preprocessing
Let's first run preprocessing to tokenize and create the vocabulary.
Notice that the `text2vec` model does not use the standard `quanteda` framework for preprocessing as it creates a co-occurence matrix rather than the document-term matrix (`dfm`).
```{r}
library(text2vec); #install.packages("text2vec"), also will need stringr and quanteda for two functions
#remove punctuation
text <- stringr::str_replace_all(data$body,"[[:punct:]]","")
# Create iterator over tokens
tokens <- regexp_tokenizer(quanteda::char_tolower(text), " ", TRUE)
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it, stopwords = c(quanteda::stopwords("english"),""))
```
Let's remove sparse and very frequent terms.
```{r}
# Note, words with less than 1%
vocab <- prune_vocabulary(vocab, doc_proportion_min = 0.005, doc_proportion_max = 0.99)
```
Let's plot the terms by documen
```{r}
library(scatterD3)
scatterD3(vocab$vocab$terms_counts,
vocab$vocab$doc_counts,
lab = vocab$vocab$terms,
xlab = "Word Counts (# of times word is used in corpus)",
ylab = "Document Counts (# of docs word is in)")
```
Let's now create the co-occurrence matrix.
```{r}
# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab,
# don't vectorize input
grow_dtm = FALSE,
# use window of 5 for context words
skip_grams_window = 5L)
tcm <- create_tcm(it, vectorizer)
```
Let's use three out of my computer's four cores and run the GloVe model.
```{r}
RcppParallel::setThreadOptions(numThreads = 3)
GloVe = GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10)
GloVe$fit(tcm, n_iter = 50)
```
Let's get the 50 feature vectors for the words and then plot the first two vectors for each word...
```{r}
word_vectors <- GloVe$get_word_vectors()
#write.csv(word_vectors, "./GloVe-features.csv", row.names = T)
# install.packages("Rtsne")
tsne <- Rtsne::Rtsne(word_vectors, dims = 2, verbose=TRUE, max_iter = 500)
# create data frame
df <- data.frame(X = tsne$Y[,1], Y = tsne$Y[,2])
scatterD3(df$X,
df$Y,
lab = row.names(word_vectors),
xlab = "1st Vector",
ylab = "2nd Vector")
```
This doesn't look like much because this is only considering 2 of the 50 total dimensions! In this sense, this method is like principal components analysis in which we "reduced" the information from the vector space (co-occurrence counts) to 50 features. Therefore, to visualize, we need a better way to analyse the similarlity between words.
Therefore, let's run cosine similarity which is a distance function that can give a measurement (from 0 to 1) on how similar two words are based on each word's (50) factor values.
Let's run cosine similarity to create a matrix on how similar each of the words are to one another.
```{r}
cos_sim = sim2(x = word_vectors, y = word_vectors, method = "cosine", norm = "l2")
#write_csv(as.data.frame(cos_sim), "cos-sim.csv")
```
First, let's run the raw network.
```{r}
library(visNetwork); library(igraph)
g <- igraph::graph.adjacency(cos_sim, mode="undirected", weighted=TRUE, diag=FALSE)
E(g)$width <- abs(E(g)$weight)
t <- merge(data.frame(name = as.character(V(g)$name), stringsAsFactors = F),
data.frame(name = as.character(vocab$vocab$terms),
Count = vocab$vocab$terms_counts, stringsAsFactors = F),
by = "name")
```
Next, let's use an arbitrary threshold cosine similarity (we can modify), we can create a network plot of how similar the actions are using the GloVe model.
```{r}
cutoff = 0.47
g <- igraph::graph.adjacency(cos_sim, mode="undirected", weighted=TRUE, diag=FALSE)
E(g)$width <- abs(E(g)$weight)*10
t <- merge(data.frame(name = as.character(V(g)$name), stringsAsFactors = F),
data.frame(name = as.character(vocab$vocab$terms),
Count = vocab$vocab$terms_counts, stringsAsFactors = F),
by = "name")
V(g)$size <- t$Count[match(V(g)$name, t$name)] / 20 + 5
g <- delete.edges(g, E(g)[ abs(weight) < cutoff])
E(g)$color <- ifelse(E(g)$weight > 0, "blue","red")
E(g)$weight <- abs(E(g)$weight)
iso <- V(g)[degree(g)==0]
g <- delete.vertices(g, iso)
clp <- igraph::cluster_label_prop(g)
class(clp)
V(g)$color <- RColorBrewer::brewer.pal(12, "Set3")[as.factor(clp$membership)]
visIgraph(g) %>% visOptions(highlightNearest = list(enabled = TRUE, algorithm = "hierarchical")) %>%
visInteraction(navigationButtons = TRUE)
```
This shows the connection between words that are used in a similar context (5 words). Think of this like a "localized" or "micro" topic model where document level is switched to a rolling five word context.
## Linear Substructures
First, we can find, given a word (say "border"), what words were used most similarly (i.e., in the same context).
```{r}
word <- word_vectors["border", , drop = FALSE]
cos_sim_ex = sim2(x = word_vectors, y = word, method = "cosine", norm = "l2")
head(sort(cos_sim_ex[,1], decreasing = TRUE), 10)
```
Linear substructures allow "algebra" on words: plus (+) or minus (-). This allows us to see how the context changes when we include "shows".
```{r}
word <- word_vectors["trump", , drop = FALSE] +
word_vectors["president", , drop = FALSE] -
word_vectors["wall", , drop = FALSE]
cos_sim_ex = sim2(x = word_vectors, y = word, method = "cosine", norm = "l2")
head(sort(cos_sim_ex[,1], decreasing = TRUE), 10)
```
See the [GloVe website](http://nlp.stanford.edu/projects/glove/) for a better explanation.