Group: 8
Class: CS236
Colaborators: Kelsey Musolf, Shikhar Kumar, Ahmad Mersaghian, Viswanadh Rahul, Demetreous Stillman
In this project we are going to analyze three different dataset using mySQL and create a web interface at the end of our analysis.
Dataset → Spark (processing) → MySQL (store result) → Flask API → Web UI (JS, HTML)
We’ve started preprocessing and cleaning the Glassdoor reviews dataset.
The raw dataset should be placed here:
data/glassdoor/all_reviews.csv
The cleaned dataset will be stored here after running the script:
data/cleaned_data/glassdoor/all_reviews_cleaned.csv
The script for cleaning and analyzing this dataset is:
glassdoor_analysis.py
It performs the following:
- Cleans invalid rows (missing job titles, ratings)
- Standardizes text (e.g., job titles)
- Adds length metadata for textual fields
- Saves the cleaned data as a single CSV for database integration
Due to its size (~8GB after preprocessing), the dataset is not pushed to GitHub. Each team member must download it locally.
pip install kaggle- Visit: https://www.kaggle.com/account
- Click “Create New API Token”
- Move the downloaded
kaggle.jsonfile:
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.jsonmkdir -p data/glassdoor_jobs
kaggle datasets download -d davidgauthier/glassdoor-job-reviews-2 -p data/glassdoor_jobs
unzip data/glassdoor_jobs/glassdoor-job-reviews-2.zip -d data/glassdoor_jobsOnce unzipped, the main file will be:
data/glassdoor_jobs/all_reviews.csv
After setup, run:
python3 glassdoor_analysis.pyThis will:
- Load
all_reviews.csv - Clean and analyze it
- Output a cleaned version to:
data/cleaned_data/glassdoor/all_reviews_cleaned.csv
You will repeat a similar process for:
- IMDB Ratings
- Artworks Metadata
Details and scripts for those will be added as development progresses.
- All
.csvfiles are.gitignoredto avoid repository bloat. - Intermediate outputs are stored in
output/ordata/cleaned_data/.