CS236-project

Group: 8

Class: CS236

Colaborators: Kelsey Musolf, Shikhar Kumar, Ahmad Mersaghian, Viswanadh Rahul, Demetreous Stillman

In this project we are going to analyze three different dataset using mySQL and create a web interface at the end of our analysis.

Dataset → Spark (processing) → MySQL (store result) → Flask API → Web UI (JS, HTML)

Dataset 1: Glassdoor Job Reviews

We’ve started preprocessing and cleaning the Glassdoor reviews dataset.

Folder Structure

The raw dataset should be placed here:

data/glassdoor/all_reviews.csv

The cleaned dataset will be stored here after running the script:

data/cleaned_data/glassdoor/all_reviews_cleaned.csv

Cleaning & Analysis

The script for cleaning and analyzing this dataset is:

glassdoor_analysis.py

It performs the following:

Cleans invalid rows (missing job titles, ratings)
Standardizes text (e.g., job titles)
Adds length metadata for textual fields
Saves the cleaned data as a single CSV for database integration

Why the Dataset Isn’t Included

Due to its size (~8GB after preprocessing), the dataset is not pushed to GitHub. Each team member must download it locally.

How to Download the Dataset

Step 1: Install Kaggle CLI

pip install kaggle

Step 2: Set Up Your Kaggle API Token

Visit: https://www.kaggle.com/account
Click “Create New API Token”
Move the downloaded kaggle.json file:

mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

Step 3: Download & Unzip Dataset

mkdir -p data/glassdoor_jobs

kaggle datasets download -d davidgauthier/glassdoor-job-reviews-2 -p data/glassdoor_jobs

unzip data/glassdoor_jobs/glassdoor-job-reviews-2.zip -d data/glassdoor_jobs

Once unzipped, the main file will be:

data/glassdoor_jobs/all_reviews.csv

Running the Script

After setup, run:

python3 glassdoor_analysis.py

This will:

Load all_reviews.csv
Clean and analyze it

Output a cleaned version to:

data/cleaned_data/glassdoor/all_reviews_cleaned.csv

Other Datasets

You will repeat a similar process for:

IMDB Ratings
Artworks Metadata

Details and scripts for those will be added as development progresses.

Notes

All .csv files are .gitignored to avoid repository bloat.
Intermediate outputs are stored in output/ or data/cleaned_data/.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
Documentations		Documentations
IMDB		IMDB
data		data
data_scraping_scripts		data_scraping_scripts
dependencies		dependencies
ui		ui
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
artwork_analysis.ipynb		artwork_analysis.ipynb
test_spark_installation.py		test_spark_installation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CS236-project

Dataset 1: Glassdoor Job Reviews

Folder Structure

Cleaning & Analysis

Why the Dataset Isn’t Included

How to Download the Dataset

Step 1: Install Kaggle CLI

Step 2: Set Up Your Kaggle API Token

Step 3: Download & Unzip Dataset

Running the Script

Other Datasets

Notes

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

ahmad00m/CS236-project

Folders and files

Latest commit

History

Repository files navigation

CS236-project

Dataset 1: Glassdoor Job Reviews

Folder Structure

Cleaning & Analysis

Why the Dataset Isn’t Included

How to Download the Dataset

Step 1: Install Kaggle CLI

Step 2: Set Up Your Kaggle API Token

Step 3: Download & Unzip Dataset

Running the Script

Other Datasets

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages