NST-Classifier

Description

This is my personal project on exploring how text data is used to do analysis before creating a small data product (classifier).

Objective

The end of this project is to explore the news data and try to create a topic classifier (Exclusive, Crime-courts, Nation, Government-public-policy, Politics) based on the news obtained from NST website.

Part I Web Scrapping (Completed)

I need to scrape some web data from the news portal. My thought is finding local english news website so that is easily understand by me. I found out that NST website can be used for this purpose. However, the news are dynamic loaded with JavaScript. Therefore, I need to use a screen scrapper to scrape the data and store in a structure format.

Part II Early Data Cleaning (Completed)

Before getting into the the Exploratory Data Analysis, I feel that data cleaning should be carry out to increase the accuracy of the value of statistics. These following steps are taken during this part:

Essential Steps

Checking and dropping duplicated data.
Front part teaser cleaning.
Dropping the rows that do not contain teaser.
Transform contractions to full English text.
Format the category.

Extra Steps

Drop the rows that do not contain exact time.
Convert the feature Time_Created from String to datetime

The outputs are two csv file which are stored as early_data_cleaning_without_time.csv and early_data_cleaning_with_time.csv

Python External Library used in this project

Pandas
Selenium

Resource

New Strait Times main page
URL: https://www.nst.com.my/
Used on: Raw data collection
Expanding English Language Contractions in Python
URL: https://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
Used on: Transforming contraction to full English text

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.idea		.idea
Part I		Part I
Part II		Part II
dataset		dataset
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NST-Classifier

Description

Objective

Part I Web Scrapping (Completed)

Part II Early Data Cleaning (Completed)

Python External Library used in this project

Resource

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Ken-NPK/NST-Classifier

Folders and files

Latest commit

History

Repository files navigation

NST-Classifier

Description

Objective

Part I Web Scrapping (Completed)

Part II Early Data Cleaning (Completed)

Python External Library used in this project

Resource

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages