Zoe

Jump to bottom

Zoe Cai edited this page Sep 18, 2016 · 59 revisions

Worklog

Thursday 3 March: Research tf-idf and possible solutions and tools (2 hours)
Friday 4 March: Research k-means clustering and tools – Sklearn, Weka, JavaML (5 hours)
Sunday 13 March: Research articles & papers (2 hours)
Sunday 20 March: Set up Maven project, added tfidf code, watched Weka tutorials (4 hours)
Friday 22 April: Planning, research (4 hours)
Saturday 23 April: Research, learning about Python and NLTK and Scikit-learn. Implemented preprocessor and tf-idf. (12 hours)
Sunday 24 April: Research, implemented cosine similarity, PCA, and visualisation. (10 hours)
Saturday 30 April: Interim report and research (10 hours)
Saturday 7 May: Interim report and research (5 hours)
Sunday 8 May: Interim report and research (8 hours)
June: Last weeks of the semester, everything from other courses are due! Exams and semester break
Wednesday 6 July: Hierarchical clustering (10 hours)
Sunday 10 July: Testing tf-idf and clustering accuracy (2 hours)
Monday 11 July: Testing (2 hours)
Tuesday 12 July: Research for web app: Udacity courses (5 hours)
Friday 15 July: Research for web app: Udacity courses (3 hours)
Saturday 16 July: Research for web app: Udacity courses (4 hours)
Monday 18 July: Research for web app: Completed Udacity course: Intro to Backend (2 hours)
Tuesday 19 July: Research for web app: Udacity courses (4 hours)
Thursday 21 July — Sunday 31 July: Laptop water damage
Saturday 6 August: Udacity course (6 hours)
Sunday 7 August: Udacity course (4 hours)
Tuesday 9 August: Udacity course (3 hours)
Wednesday 10 August: Udacity course (4 hours)
Thursday 11 August: Completed Udacity course: Full Stack Foundations, experimented with the Guardian API, added API result retrieval from Guardian (4 hours)
Friday 12 August: Completed Udacity course: AJAX, looked into graphing/charting libraries (5 hours)
Saturday 13 August: Refactored processing & clustering code, improved accuracy of results (now completely correct with one test set! 😃 ), set up web application (12 hours)
Sunday 14 August: Set up data visualisation working with web app, configuring and debugging visualisation (15 hours)
Monday 15 August: Added labels to nodes to show titles of articles upon hover, added option to retrieve a different number of articles (instead of the default 10), added zoom and pan and responsive canvas, made length between nodes and centroids represent the dis-similarity between articles and centroids, improved UI and visualisation (8 hours)
Tuesday 16 August: linked centroids to main centroid, added loading screen, refactored css code out of html, added loading text and animations, added background image, substantially improved UI: using natural language form input, added date pickers calendars, added options to specify number of clusters and number of articles to retrieve, added and styled responsive article panel that displays an article's contents when its node is clicked, added top feature words to each centroid. 🎉 WEB APP COMPLETE! 🎉 Now on to refining the clustering algorithm: looked at DBSCAN, looking at hashing vectorizer, and LSA. Brought back 2D plots to show accuracy of clustering, K-means seems unstable and not the most accurate, so will need to use agglom or another method. Added x-means, changed the way top feature words are retrieved for centroids in an attempt to possibly debug the code. Turned out to be correct. (15 hours)
Wednesday 17 August: altered article retrieval to retrieve live blogs with more than one article on a page properly. Increased default limit of articles retrieved. Tried to deploy to AWS. Added requirements.txt for project setup. (6 hours)
Thursday 18 August: Set up AWS EC2 (still having trouble deploying though). Research into Stanford NLP for use in tfidf calculations. Added Named Entity Recognition to improve accuracy of input data to clustering algorithms. Fixed major bug, clustering working much better now! Reverting back to original tfidf approach (12 hours)
Friday 19 August: Testing and validating clusters, setting up EC2 (4 hours)
Sunday 21 August: Integrated clustering algorithms to improve accuracy, improved hierarchical clustering, and refactored code, cleaned up console output & better commenting. Fine tuned hierarchical clustering and removed "unfair" comparison of the "goodness" of the clusters (8 hours)
Monday 22 August: Better fonts for UI, testing responsiveness. Deployed web app and it's finally up! 🎆 Added review app & git workflow (14 hours)
Saturday 27 August: Lowered default settings due to deployment server restrictions (0.5 hour)
Tuesday 30 August: Seminar prep (4 hours)
Thursday 1 September: Poster (10 hours)
Friday 2 September: Poster (5 hours)
Saturday 3 September: Seminar prep (3 hours)
Sunday 4 September: Poster done + Seminar prep (5 hours)
Monday 5 September: Seminar prep (5 hours)
Tuesday 6 September: Seminar prep (7 hours)
Wednesday 7 September: Seminar prep (8 hours)
Thursday 8 September: Seminar prep (8 hours)
Friday 9 September: Seminar! (4 hours)
Saturday-Monday 10-12 September: Final report (25 hours)
Thursday 15 September: Set up poster for exhibition (1 hour)
Friday 16 September: Exhibition (9.5 hours)
Sunday 18 September: Final code commenting & documentation for Compendium submission (2.5 hours)