This is my personal project on exploring how text data is used to do analysis before creating a small data product (classifier).
The end of this project is to explore the news data and try to create a topic classifier (Exclusive, Crime-courts, Nation, Government-public-policy, Politics) based on the news obtained from NST website.
I need to scrape some web data from the news portal. My thought is finding local english news website so that is easily understand by me. I found out that NST website can be used for this purpose. However, the news are dynamic loaded with JavaScript. Therefore, I need to use a screen scrapper to scrape the data and store in a structure format.
Before getting into the the Exploratory Data Analysis, I feel that data cleaning should be carry out to increase the accuracy of the value of statistics. These following steps are taken during this part:
Essential Steps
- Checking and dropping duplicated data.
- Front part teaser cleaning.
- Dropping the rows that do not contain teaser.
- Transform contractions to full English text.
- Format the category.
Extra Steps
- Drop the rows that do not contain exact time.
- Convert the feature
Time_Created
from String to datetime
The outputs are two csv file which are stored as early_data_cleaning_without_time.csv
and early_data_cleaning_with_time.csv
- Pandas
- Selenium
-
New Strait Times main page
URL: https://www.nst.com.my/
Used on: Raw data collection -
Expanding English Language Contractions in Python
URL: https://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
Used on: Transforming contraction to full English text