Skip to content

ahmad00m/CS236-project

Repository files navigation

CS236-project

Group: 8

Class: CS236

Colaborators: Kelsey Musolf, Shikhar Kumar, Ahmad Mersaghian, Viswanadh Rahul, Demetreous Stillman

In this project we are going to analyze three different dataset using mySQL and create a web interface at the end of our analysis.

Dataset → Spark (processing) → MySQL (store result) → Flask API → Web UI (JS, HTML)

Dataset 1: Glassdoor Job Reviews

We’ve started preprocessing and cleaning the Glassdoor reviews dataset.

Folder Structure

The raw dataset should be placed here:

data/glassdoor/all_reviews.csv

The cleaned dataset will be stored here after running the script:

data/cleaned_data/glassdoor/all_reviews_cleaned.csv

Cleaning & Analysis

The script for cleaning and analyzing this dataset is:

glassdoor_analysis.py

It performs the following:

  • Cleans invalid rows (missing job titles, ratings)
  • Standardizes text (e.g., job titles)
  • Adds length metadata for textual fields
  • Saves the cleaned data as a single CSV for database integration

Why the Dataset Isn’t Included

Due to its size (~8GB after preprocessing), the dataset is not pushed to GitHub. Each team member must download it locally.


How to Download the Dataset

Step 1: Install Kaggle CLI

pip install kaggle

Step 2: Set Up Your Kaggle API Token

mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

Step 3: Download & Unzip Dataset

mkdir -p data/glassdoor_jobs

kaggle datasets download -d davidgauthier/glassdoor-job-reviews-2 -p data/glassdoor_jobs

unzip data/glassdoor_jobs/glassdoor-job-reviews-2.zip -d data/glassdoor_jobs

Once unzipped, the main file will be:

data/glassdoor_jobs/all_reviews.csv

Running the Script

After setup, run:

python3 glassdoor_analysis.py

This will:

  • Load all_reviews.csv
  • Clean and analyze it
  • Output a cleaned version to:
    data/cleaned_data/glassdoor/all_reviews_cleaned.csv
    

Other Datasets

You will repeat a similar process for:

  • IMDB Ratings
  • Artworks Metadata

Details and scripts for those will be added as development progresses.


Notes

  • All .csv files are .gitignored to avoid repository bloat.
  • Intermediate outputs are stored in output/ or data/cleaned_data/.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •