Skip to content

This is my personal project on exploring how text data is used to do analysis before creating a small data product (classifier).

Notifications You must be signed in to change notification settings

Ken-NPK/NST-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NST-Classifier

Description

This is my personal project on exploring how text data is used to do analysis before creating a small data product (classifier).

Objective

The end of this project is to explore the news data and try to create a topic classifier (Exclusive, Crime-courts, Nation, Government-public-policy, Politics) based on the news obtained from NST website.

Part I Web Scrapping (Completed)

I need to scrape some web data from the news portal. My thought is finding local english news website so that is easily understand by me. I found out that NST website can be used for this purpose. However, the news are dynamic loaded with JavaScript. Therefore, I need to use a screen scrapper to scrape the data and store in a structure format.

Part II Early Data Cleaning (Completed)

Before getting into the the Exploratory Data Analysis, I feel that data cleaning should be carry out to increase the accuracy of the value of statistics. These following steps are taken during this part:

Essential Steps

  1. Checking and dropping duplicated data.
  2. Front part teaser cleaning.
  3. Dropping the rows that do not contain teaser.
  4. Transform contractions to full English text.
  5. Format the category.

Extra Steps

  1. Drop the rows that do not contain exact time.
  2. Convert the feature Time_Created from String to datetime

The outputs are two csv file which are stored as early_data_cleaning_without_time.csv and early_data_cleaning_with_time.csv

Python External Library used in this project

  1. Pandas
  2. Selenium

Resource

  1. New Strait Times main page
    URL: https://www.nst.com.my/
    Used on: Raw data collection

  2. Expanding English Language Contractions in Python
    URL: https://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
    Used on: Transforming contraction to full English text

About

This is my personal project on exploring how text data is used to do analysis before creating a small data product (classifier).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published