Identified $1.8M/year in potential savings by analyzing HR data and building predictive models to reduce employee turnover.
- 7-Project Trap: 100% attrition rate for employees handling 7+ projects
- 5-Year Cliff: Employees at 5-year tenure quit 2X more than average
- Overwork Signal: 275+ monthly hours → 3X higher quit probability
This project is to analyze a dataset and build predictive models that can provide insights to the Human Resources (HR) department of a large consulting firm.
Salifort’s senior leadership team is concerned about how many employees are leaving the company. Salifort strives to create a corporate culture that supports employee success and professional development. Further, the high turnover rate is costly in the financial sense. Salifort makes a big investment in recruiting, training, and upskilling its employees. As a first step, the leadership team asks Human Resources to survey a sample of employees to learn more about what might be driving turnover. The dataset that will be using in this lab contains 15,000 rows and 10 columns for the variables .Dataset avaialble on Kaggle. We used PACE workflow here to structure the analysis and modeling
Import packages Load dataset
Understand your variables Clean dataset (missing data, redundant data, outliers)
EDA Continuation and Visualisation
Fit a model that predicts the outcome variable using two or more independent variables Logistic Regression,Decision Trees and Random Forest Check model assumptions Evaluate the model
Interpret model Evaluate model performance using metrics Prepare results, visualizations, and actionable steps to share with stakeholders
Fit a model that predicts the outcome variable using two or more independent variables Check model assumptions Evaluate the model
The logistic regression model achieved precision of 79%, recall of 82%, f1-score of 80% (all weighted averages), and accuracy of 82%, on the test set.
After conducting feature engineering, the random forest model achieved ruc_auc score of 93.72%, precision of 95.91%, recall of 85.71%, f1-score of 90.45%, and accuracy of 88.01%, on the test set. The random forest modestly outperformed the decision tree model.