-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathtextasdata-topicmodeling.Rmd
186 lines (130 loc) · 5.7 KB
/
textasdata-topicmodeling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
title: "Unsupervised: Topic Modeling"
author: Ryan Wesslen
date: July 27, 2017
output: html_document
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
```
### Unsupervised machine learning
In this exercise, we're going to run topic modeling on Tweets.
Topic modeling is an unsupervised algorithm that finds hidden word occurrence patterns ("topics") within our data.
One problem with topic modeling on Twitter is that Tweets are so small, it's tough to find topics given the short document-length. One "quick fix" is to aggregate Tweets by user (e.g. combines all of my tweets and treat them as one document).
A benefit of topic modeling is that you can assign document-level probabilities per topic. This way you can measure the mix of topics across documents. In this example, our documents "are" Twitter users. Therefore, we can analyze a user-topic probability matrix and get an idea of who is tweeting.
### Step 1: Read in the data.
Let's read in our Charlotte sample dataset.
```{r}
library(tidyverse)
#remove one of the "." if you are running as chunks
tweets <- read_csv('../data/CharlotteTweets20Sample.csv')
source('./functions.R')
#updated for quanteda version 0.9.9-50
library(quanteda)
twcorpus <- corpus(tweets$body)
```
Let's load in the user ID number as a document (originally on Tweet level) attribute. We'll use the `groups` optition on `dfm` to aggregate our Tweets by the user (actor.id).
Let's create the DFM object and trim all words used less than 40 times.
```{r}
docvars(twcorpus, "actor.id") <- as.character(tweets$actor.id)
twdfm <- dfm(twcorpus,
groups = "actor.id",
remove = c(stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"),
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
remove_url = TRUE,
ngrams= 1L)
twdfm <- dfm_trim(twdfm, min_docfreq = 3)
```
Let's look at the top 50 words.
```{r}
topfeatures(twdfm, 50)
```
No surprise Charlotte-related terms are the most popular. Recall - this dataset only obtained geolocated Tweets (excluded non-geolocated Tweets).
### Step 2: Text exploratory analysis.
Let's do some exploratory analysis to understand the content.
```{r warning=FALSE}
library(RColorBrewer)
textplot_wordcloud(twdfm,
scale=c(3.5, .75),
colors=brewer.pal(8, "Dark2"),
random.order = F,
rot.per=0.1,
max.words=250)
```
We can also rerun the word cloud to weight not by the term frequency by the TF-IDF weights.
```{r warning = FALSE}
textplot_wordcloud(tfidf(twdfm),
scale=c(3.5, .75),
colors=brewer.pal(8, "Dark2"),
random.order = F,
rot.per=0.1,
max.words=250)
```
We can also run a word clustering...
```{r}
numWords <- 35
wordDfm <- sort(weight(twdfm, "tfidf"))
wordDfm <- t(wordDfm)[1:numWords,] # keep the top numWords words
wordDistMat <- dist(wordDfm)
wordCluster <- hclust(wordDistMat)
plot(wordCluster, xlab="", main="TF-IDF Frequency weighting")
```
### Step 3: Run topic modeling.
Now, let's run a simple 20 topic model.
We'll use the package `topicmodels`.
```{r}
# install.packages("topicmodels")
library(topicmodels)
# we now export to a format that we can run the topic model with
dtm <- convert(twdfm, to="topicmodels")
# estimate LDA with K topics
K <- 20
lda <- LDA(dtm, k = K, method = "Gibbs",
control = list(verbose=25L, seed = 123, burnin = 100, iter = 500))
```
Let's explore the topics.
```{r}
term <- terms(lda, 10)
colnames(term) <- paste("Topic",1:K)
term
```
How can we interpret these topics?
### Step 4: Interactive topic visualization.
If you want to run a LDA interactive visualization, run this chunk:
```{r, eval=FALSE, include=T}
#Create Json for LDAVis
library(LDAvis)
json <- topicmodels_json_ldavis(lda,twdfm,dtm)
new.order <- RJSONIO::fromJSON(json)$topic.order
# Topic #'s reordered!!
term <- term[,new.order]
serVis(json, out.dir = 'charlotteLDA', open.browser = TRUE)
```
### Step 5: Finding Topic "Experts"
Like topics are probability distribution of words, in LDA documents are probability distributions of topics. In our case, since documents are on the user-level, we can infer that users' Twitter logs are assumed to be a distribution of topics.
Accordingly, we can rank Twitter users by those who rank highest for a topic.
First, let's extract the document-topic probability matrix.
```{r}
# to get topic probabilities per actor ID (Twitter user)
postlist <- posterior(lda)
probtopics <- data.frame(postlist$topics)
#probtopics <- probtopics[,new.order]
colnames(probtopics) <- paste("Topic",1:K)
```
Next, let's find the "expert" for Topic 8.
```{r}
filter.topic <- "Topic 8"
row <- order(-probtopics[,filter.topic])
actorid <- rownames(probtopics[row[1],])
filter.data <- subset(tweets, actor.id == actorid)
filter.data <- filter.data[order(filter.data$postedTime),]
```
Explore "experts" for other topics. Do they make sense?
### Step 6: Deeper LDA Analysis
Further analysis could take as a covariate:
1. Device (e.g. Foursquare vs Untappd vs Twitter iPhone, etc.)
2. Profile location description (e.g. Charlotte, Fort Mill, Gastonia, etc.)
3. Profile description keywords (e.g. husband, wife, student, etc.)
A new extension to LDA has been created ("structural topic model") by social scientists to create a framework to consider covariates impact on topic model results. For more details on running more advanced versions of LDA like CTM and STM, I created a newer [workshop on topic modeling](https://github.com/wesslen/Topic-Modeling-Workshop-with-R).