-
Notifications
You must be signed in to change notification settings - Fork 0
- Saturday 5 March - Researching Tf-Idf and exploring possible solutions - 2 Hours
- Saturday 19 March - Manual news article dissection - 2 Hours
Some ideas from manually going through news articles
- Dissection of title into individual words and phrases
- Cross check the dissection of title with frequency within article
- Frequency of phrase / word becomes significant when exceeding a certain value calculated from the length of the article (More significant if theres 3 mentions in an article of 100 words, rather than 10 mentions in 2000 words etc)
- Top phrases of 3 words
- Top phrases of 2 words
Related resources
http://www.online-utility.org/text/analyzer.jsp
When putting this news article
http://www.cbsnews.com/news/thousands-protest-donald-trump-in-new-york-city-election-2016/
Which is a recent article about recent protests of Donald Trump in Manhattan New York.
Through the text analyser, the top (individual words) (excluding common words such as "the" "in" "a" etc) are...
"trump" at 12 counts "protest" at 6 counts "protesters" at 5 counts "donald" at 5 counts "york" at 4 counts "new" at 4 counts
So with this example word frequency is definitely bringing in substantial results.
Porter' suffix stripping algorithm http://tartarus.org/martin/PorterStemmer/
Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). - Introduction to information retrieval
Clustering is the most common form of unsupervised learning. No super- vision means that there is no human expert who has assigned documents to classes.
Given a set of documents D = {d1, d2, d3 ... dN}, and a desired number of clusters and an objective function that evaluates the quality of the clustering, we want to compute : D → {1, . . . , K} that minimizes (or, in other cases, maximizes) the objective function. The objective function is often described as the distance or similarity between documents. This can be done through Vector Space Classification.
Vector Space Classification
Represents each document as a vector with one real valued component (such as the tf-idf weight).
The Vector for a document is defined with one component in the vector for each term in the document. The set of documents in a collection can then be viewed as a set of vectors in a vector space.
Quantifying Distance
1.) Maybe consider the magnitude of the vector difference between two document vectors. Can have drawbacks. 2 documents with very similar content may have a large difference due to document length.
2.) To compensate for this, we can compute the cosine similarity of their vector representations.