-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathEDA n Analysis.Rmd
More file actions
2727 lines (2288 loc) · 145 KB
/
EDA n Analysis.Rmd
File metadata and controls
2727 lines (2288 loc) · 145 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "EDA n Analysis"
author: "quanggiang169"
date: "2025-05-16"
output:
html_document
---
# A. Data cleanning
## 1. Loading libraries
```{r , include=FALSE}
library(tidyverse)
library(janitor)
library(stringr)
library(dplyr)
library(e1071)
library(knitr)
library(tibble)
library(kableExtra)
library(ggplot2)
library(gridExtra)
library(ggcorrplot)
library(reshape2)
library(ggpubr)
library(rlang)
library(rstatix)
library(rcompanion)
library(DescTools)
library(scales)
library(broom)
library(vip)
library(ranger)
library(caret)
```
library(tidyverse)
library(janitor)
library(stringr)
library(dplyr)
library(e1071)
library(knitr)
library(tibble)
library(kableExtra)
library(ggplot2)
library(gridExtra)
library(ggcorrplot)
library(reshape2)
library(ggpubr)
library(rlang)
library(rstatix)
library(rcompanion)
library(DescTools)
library(scales)
library(car)
library(broom)
library(vip)
library(ranger)
library(caret)
## 2. Loading data set
```{r}
df_hratt <- read_csv("hr_attrition_data.csv",col_types = cols())
```
## 3. Previewing the first 10 rows of the dataset
```{r echo=FALSE}
kable(head(df_hratt, 10))
```
## 4. Renaming columns
```{r echo=FALSE}
df_hratt_v2 <- clean_names(df_hratt)
```
I also used the `clean_names` function from the `janitor` package to standardize the column names. This helps ensure consistency by converting all column names to lowercase and replacing spaces or special characters with underscores, making the data easier to work with throughout the analysis.
## 5. Data Dictionary
The IBM HR Analytics Employee Attrition & Performance dataset is a fictional dataset created by IBM data scientists. Since not all features have descriptions, I have made interpretations of what they represent.
You can find the dataset [here](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset).
| Feature Name | Data Type | Description |
|--------------------------------|--------------------|------------------------------------------------------------------------------------------------------|
| Age | integer | Employee's age in years |
| Attrition | factor | If the employee stayed or left the company (Yes ; No) |
| BusinessTravel | factor | How often the employee has business travel (Travel_Rarely ; Travel_Frequently ; Non-Travel) |
| DailyRate | integer | Employee's daily rate in USD. |
| Department | factor | Which department the employee belongs to (Sales ; Research & Development ; Human Resources) |
| DistanceFromHome | integer | How far the employee lives from work in kilometers. |
| Education | integer | Employee's education level (1 = Below College ; 2 = College ; 3 = Bachelor ; 4 = Master ; 5 = Doctor)|
| EducationField | factor | Employee's education field (Life Sciences ; Medical ; Marketing ; Technical Degree, Human Resources ; Other)|
| EmployeeCount | integer | How many employees the current record represents. |
| EmployeeNumber | integer | Employee's unique identification number |
| EnvironmentSatisfaction | integer | How satisfied the employee is with the company's environment (1 = Low ; 2 = Medium ; 3 = High ; 4 = Very High) |
| Gender | factor | Employee's gender (Female ; Male) |
| HourlyRate | integer | Employee's hourly rate in USD |
| JobInvolvement | integer | How involved the employee feels with his/her job (1 = Low ; 2 = Medium ; 3 = High ; 4 = Very High) |
| JobLevel | integer | Employee's job level (1 = Junior ; 2 = Mid ; 3 = Senior ; 4 = Manager ; 5 = Director) |
| JobRole | factor | Employee's job role (Sales Executive ; Research Scientist ; Laboratory Technician ; Manufacturing Director ; Healthcare Representative ; Manager ; Sales Representative ; Research Director ; Human Resources) |
| JobSatisfaction | integer | How satisfied the employee feels with his/her job (1 = Low ; 2 = Medium ; 3 = High ; 4 = Very High) |
| MaritalStatus | factor | Employee's marital status (Single ; Married ; Divorced) |
| MonthlyIncome | integer | Employee's monthly income in USD |
| MonthlyRate | integer | Employee's monthly rate in USD |
| NumCompaniesWorked | integer | Number of companies that the employee has already worked. |
| Over18 | factor | If the employee is over 18 years old (Yes) |
| OverTime | factor | If the employee makes overtime (Yes ; No) |
| PercentSalaryHike | integer | The percentage of the amount a salary is increased |
| PerformanceRating | integer | Employee's performance rating (1 = Low ; 2 = Good ; 3 = Excellent ; 4 = Outstanding) |
| RelationshipSatisfaction | integer | How satisfied the employee feels with the relationship with his/her manager (1 = Low ; 2 = Medium ; 3 = High ; 4 = Very High) |
| StandardHours | integer | Employee's standard hours of work per day |
| StockOptionLevel | integer | Employee's stock option level (refer to: What You Should Know About Option Trading Levels) |
| TotalWorkingYears | integer | Total years that the employee has professionally worked |
| TrainingTimesLastYear | integer | Total times that the employee had a training session the last year |
| WorkLifeBalance | integer | How the employee feels about his/her work-life balance (1 = Bad ; 2 = Good ; 3 = Better ; 4 = Best) |
| YearsAtCompany | integer | Total years that the employee has worked at the company |
| YearsInCurrentRole | integer | Total years that the employee has worked in his/her current job role |
| YearsSinceLastPromotion | integer | Total years since the employee had his/her last promotion at the company |
| YearsWithCurrManager | integer | Total years that the employee has worked under his/her current manager |
## 6. Checking data dimensions
```{r echo = FALSE}
# Print the number of rows and columns in the data frame
cat('Number of rows:', nrow(df_hratt_v2), '\n')
cat('Number of cols:', ncol(df_hratt_v2), '\n')
```
## 7. Checking data types
```{r echo = FALSE}
# Identify character columns
char_cols <- sapply(df_hratt_v2, is.character)
# Convert character columns to factor
df_hratt_v2[char_cols] <- lapply(df_hratt_v2[char_cols], as.factor)
# Display the structure of the dataframe after conversion
str(df_hratt_v2)
# Count the number of columns for each data type after conversion
dtype_counts <- table(sapply(df_hratt_v2, class))
cat("Count of columns by data type:\n")
print(dtype_counts)
```
After checking the data types, I found that the dataset contains two main types of variables: factor, representing ordinal categorical data, and numeric, which includes both continuous data and some ordinal categorical variables encoded as numbers. This distinction is important because it guides my analysis approach. For factor variables, I will focus on frequency counts and mode-based summaries, while for numeric variables, I will apply statistical measures such as mean, median, and standard deviation.
Additionally, I need to be cautious with numeric variables that actually represent ordinal categories and treat them as categorical to avoid misleading interpretations.
## 8. Checking missing data
```{r echo = FALSE}
# Count the number of missing values in each column of the data frame
missing_values <- colSums(is.na(df_hratt_v2))
print(missing_values)
```
## 9. Checking duplicate data
```{r echo = FALSE}
# Print the number of duplicate rows in the data frame
num_duplicates <- sum(duplicated(df_hratt_v2))
cat('Number of duplicate rows:', num_duplicates, '\n')
```
## 10. Trim whitespace from character columns
```{r echo=FALSE}
# Identify factor columns
factor_cols <- sapply(df_hratt_v2, is.factor)
# Trim whitespace for factor columns
df_hratt_v2[factor_cols] <- lapply(df_hratt_v2[factor_cols], function(x) {
factor(str_trim(as.character(x)))
})
```
I have trimmed leading and trailing whitespace characters from all character columns to ensure data consistency and prevent errors during analysis.
## 11. Feature Engineering
### 11.1. Column filtering
```{r echo = FALSE}
# Define columns to drop
cols_drop <- c('over18', 'standard_hours', 'employee_count', 'employee_number')
# Drop the columns
df_hratt_v2 <- df_hratt_v2[, !(names(df_hratt_v2) %in% cols_drop)]
# Select a random sample row as a tibble for nicer formatting
sample_row <- as_tibble(df_hratt_v2[sample(nrow(df_hratt_v2), 1), ])
# Print the sample row as a table
kable(sample_row, caption = "Random Sample from Cleaned Dataset")
```
The dataset contains several columns that do not add value to the analysis:
| Column Name | Reason for Removal |
|------------------|-----------------------------------------------------------------|
| `over18` | All employees are over 18, so this column has no variation. |
| `standard_hours` | Same standard working hours for all employees, no differentiation. |
| `employee_count` | Constant total number of employees, offers no unique insight. |
| `employee_number`| Serves as an ID, but does not contribute to analysis or prediction.|
To simplify the dataset and focus on relevant features for analyzing employee attrition, these columns are removed.
### 11.2. Label Encoding to Categorical Mapping
```{r echo=FALSE}
# Convert education levels
df_hratt_v2 <- df_hratt_v2 %>%
mutate(education = as.factor(recode(education,
`1` = "Below College",
`2` = "College",
`3` = "Bachelor",
`4` = "Master",
`5` = "Doctor")))
# Convert environment satisfaction levels
df_hratt_v2 <- df_hratt_v2 %>%
mutate(environment_satisfaction = as.factor(recode(environment_satisfaction,
`1` = "Low",
`2` = "Medium",
`3` = "High",
`4` = "Very High")))
# Convert job involvement levels
df_hratt_v2 <- df_hratt_v2 %>%
mutate(job_involvement = as.factor(recode(job_involvement,
`1` = "Low",
`2` = "Medium",
`3` = "High",
`4` = "Very High")))
# Convert job levels
df_hratt_v2 <- df_hratt_v2 %>%
mutate(job_level = as.factor(recode(job_level,
`1` = "Junior",
`2` = "Mid",
`3` = "Senior",
`4` = "Manager",
`5` = "Director")))
# Convert job satisfaction levels
df_hratt_v2 <- df_hratt_v2 %>%
mutate(job_satisfaction = as.factor(recode(job_satisfaction,
`1` = "Low",
`2` = "Medium",
`3` = "High",
`4` = "Very High")))
# Convert performance rating levels
df_hratt_v2 <- df_hratt_v2 %>%
mutate(performance_rating = as.factor(recode(performance_rating,
`1` = "Low",
`2` = "Good",
`3` = "Excellent",
`4` = "Outstanding")))
# Convert relationship satisfaction levels
df_hratt_v2 <- df_hratt_v2 %>%
mutate(relationship_satisfaction = as.factor(recode(relationship_satisfaction,
`1` = "Low",
`2` = "Medium",
`3` = "High",
`4` = "Very High")))
# Convert work life balance levels
df_hratt_v2 <- df_hratt_v2 %>%
mutate(work_life_balance = as.factor(recode(work_life_balance,
`1` = "Bad",
`2` = "Good",
`3` = "Better",
`4` = "Best")))
```
This step converts several numeric variables representing categorical information into factors with meaningful labels. Doing so improves clarity and interpretability when analyzing or visualizing the data.
| Variable Name | Original Codes | Recoded Labels | Description |
|-------------------------|----------------|------------------------------------------|------------------------------------|
| `education` | 1, 2, 3, 4, 5 | Below College, College, Bachelor, Master, Doctor | Levels of educational attainment |
| `environment_satisfaction` | 1, 2, 3, 4 | Low, Medium, High, Very High | Satisfaction with work environment |
| `job_involvement` | 1, 2, 3, 4 | Low, Medium, High, Very High | Degree of job involvement |
| `job_level` | 1, 2, 3, 4, 5 | Junior, Mid, Senior, Manager, Director | Hierarchical job position levels |
| `job_satisfaction` | 1, 2, 3, 4 | Low, Medium, High, Very High | Satisfaction with the job |
| `performance_rating` | 1, 2, 3, 4 | Low, Good, Excellent, Outstanding | Performance evaluation ratings |
| `relationship_satisfaction` | 1, 2, 3, 4 | Low, Medium, High, Very High | Satisfaction with workplace relationships |
| `work_life_balance` | 1, 2, 3, 4 | Bad, Good, Better, Best | Quality of work-life balance |
Each variable is recoded from numeric codes into descriptive factor levels to facilitate better data understanding and analysis.
### 11.3. Checkpoint
```{r echo=FALSE}
# Save the current data set state
write.csv(df_hratt_v2, file = "hr_attrition_data_v2.csv", row.names = FALSE)
```
This step saves the cleaned and processed dataset to a CSV file named `"hr_attrition_data_v2.csv"`. Saving the file ensures that the current version of the data can be easily accessed and reused for future analysis without repeating earlier processing steps.
# B. Descriptive statistics
## 1. Univariate Analysis
```{r echo=FALSE}
#selects only numerical attributes
num_attributes <- df_hratt_v2 %>% select(where(is.numeric))
#selects only categorical attributes
cat_attributes <- df_hratt_v2 %>% select(where(negate(is.numeric)))
```
### 1.1. Univariate Analysis: Numerical Variables
```{r, echo=FALSE}
# Calculate central tendency: mean, median
mean_values <- colMeans(num_attributes, na.rm = TRUE)
median_values <- apply(num_attributes, 2, median, na.rm = TRUE)
# Calculate distribution: std, min, max, range, skew, kurtosis
std_values <- apply(num_attributes, 2, sd, na.rm = TRUE)
min_values <- apply(num_attributes, 2, min, na.rm = TRUE)
max_values <- apply(num_attributes, 2, max, na.rm = TRUE)
range_values <- max_values - min_values
skewness_values <- apply(num_attributes, 2, function(x) skewness(x, na.rm = TRUE))
kurtosis_values <- apply(num_attributes, 2, function(x) kurtosis(x, na.rm = TRUE))
# Create summary statistics table
summary_stats <- data.frame(
attributes = names(num_attributes),
min = min_values,
max = max_values,
range = range_values,
mean = mean_values,
median = median_values,
std = std_values,
skewness = skewness_values,
kurtosis = kurtosis_values
)
# Display the summary statistics table
kable(summary_stats, format = "simple", row.names = FALSE)
```
```{r histogram-plotting-num, echo=FALSE, fig.width=10, fig.height=8}
# Set up to show two plots per row
par(mfrow = c(2, 2), mar = c(5, 5, 4, 2) + 0.1, oma = c(2, 2, 2, 2))
# Loop through each numerical attribute to create a clear histogram
for (col in names(num_attributes)) {
hist(num_attributes[[col]],
main = paste("Histogram of", col),
xlab = col,
col = "lightblue",
breaks = 30,
cex.main = 1.5, # Title size for readability
cex.lab = 1.3, # Label size
cex.axis = 1.2, # Axis size
border = "darkblue",
las = 1) # Make axis labels horizontal for clarity
}
# Reset layout after plotting
par(mfrow = c(1, 1))
```
| **Key Findings** | **Explanation** |
|------------------|-----------------|
| **Age**: Approximately symmetric distribution with slight right skewness (skewness ~0.41), median age is 36 years | Most employees are middle-aged; age distribution is fairly balanced without strong skew. |
| **Daily Rate & Hourly Rate**: Nearly symmetric distributions with very small skewness (close to 0) | Compensation rates per day and hour are fairly evenly distributed without extreme low or high values. |
| **Distance from Home**: Right skewed (skewness ~0.96), indicating many employees live close to the workplace and fewer live farther away | Majority of employees live near the office; a small number reside at greater distances. |
| **Monthly Income**: Noticeably right skewed (skewness ~1.37) with positive kurtosis (~1), indicating a long tail on the high-income side | Many employees earn low to medium income, with a few earning significantly higher incomes. |
| **Number of Companies Worked**: Right skewed (skewness ~1.02), suggesting most employees have worked at few companies | Majority of employees have limited job-hopping experience; few have worked at multiple companies. |
| **Percent Salary Hike**: Mild right skew (skewness ~0.82) | Most salary increases are moderate, with a few instances of higher raises. |
| **Years at Company** and **Years Since Last Promotion**: Strong right skewness (skewness ~1.76 and ~1.98) with high kurtosis (3.9 and 3.6 respectively) | Many employees are relatively new or have been promoted recently; a smaller group has long tenure and longer duration since last promotion. |
| **Other variables (stock option level, training times last year, years in current role, years with current manager)**: Slight to moderate right skew | These variables mostly concentrate at lower to medium values, with fewer high-value cases. |
**Summary**
- Time-related variables (tenure, promotion history, salary hike) show right-skewed distributions, reflecting that most employees are clustered at lower to medium values, with a small group exhibiting higher values.
- Monthly income and distance from home also show right skewness with long right tails.
- Compensation rates (daily and hourly) are more symmetrically distributed.
### 1.2. Univariate Analysis: Categorical Variables
#### 1.2.1. Check Unique Value Counts for Categorical Attributes
```{r, echo=FALSE}
# Create a data frame with variable name, number of unique values, and the unique values themselves
unique_summary <- data.frame(
`no.unique values` = sapply(cat_attributes, function(x) length(unique(x))),
`unique values` = sapply(cat_attributes, function(x) paste(unique(x), collapse = ", "))
)
# Print the table nicely using kable
kable(unique_summary, caption = "Summary of Unique Values in Categorical Variables")
```
#### 1.2.2 Frequency Distribution of Categorical Variables
This section provides a descriptive overview of the categorical variables in the dataset to establish a foundational understanding of the employee profile and organizational characteristics.
```{r frequency-tables-cat, echo=FALSE, results='asis'}
# Function to create and print a frequency table with counts and percentages
print_freq_table <- function(vec, var_name) {
counts <- table(vec) # Calculate counts of each category
props <- prop.table(counts) * 100 # Calculate percentage for each category
freq_table <- data.frame(
Category = names(counts), # Category names
Count = as.vector(counts), # Counts as numeric vector
Percentage = round(as.vector(props), 2) # Percentages rounded to 2 decimals
)
# Print a header for the frequency table
cat("### Frequency Table for", var_name, "\n")
# Generate a formatted table with caption and column names
kable(freq_table,
caption = paste("Distribution of", var_name),
col.names = c("Category", "Count", "Percentage (%)"),
align = c("l", "r", "r")) %>%
kable_styling(full_width = FALSE, position = "left") %>% # Styling the table for better display
print() # Print the table to the output
cat("\n") # Add a line break after each table for spacing
}
# Loop through all categorical variables to print frequency tables
for (i in seq_along(cat_attributes)) {
print_freq_table(cat_attributes[[i]], names(cat_attributes)[i])
}
```
```{r histogram-plotting-cat, echo=FALSE, fig.width=14, fig.height=12}
# Define a custom pastel color palette
pastel_colors <- c("#1F4E79", "#76B041", "#F4A261", "#2A9D8F", "#E9C46A", "#6A4C93", "#F08A5D")
# Loop through each categorical column to create horizontal count plots
# Setting up a layout of 3x3 for every 9 plots
num_plots <- ncol(cat_attributes)
plots_per_page <- 9
for (i in seq_along(cat_attributes)) {
# Check if a new page is needed and set up layout
if ((i - 1) %% plots_per_page == 0) {
par(mfrow = c(3, 3), mar = c(5, 10, 4, 2) + 0.1, oma = c(2, 2, 2, 2)) # Increased left margin
}
# Create a table of counts for the category
counts <- table(cat_attributes[[i]])
colors <- pastel_colors[1:length(counts)]
# Create a horizontal bar plot with narrow bars and clear labels
barplot(counts,
main = paste("Count for", names(cat_attributes)[i]),
xlab = names(cat_attributes)[i],
col = colors,
border = "darkblue",
cex.main = 1.2, # Title size for readability
cex.lab = 1.1, # Label size
cex.axis = 1.1, # Axis size
cex.names = 1.1, # Label size on bars
horiz = TRUE, # Horizontal bar plot
las = 1, # Horizontal axis labels
space = 1.5, # Increased spacing between bars
width = 0.6) # Narrower bar width for better spacing
}
# Reset layout after plotting
par(mfrow = c(1, 1))
```
## 2. Coss-variable connections
### 2.1. Correlation between numerical attributes
```{r, echo=FALSE, fig.width= 12, fig.height= 6}
# Calculate the correlation matrix
correlation <- cor(num_attributes, method = "pearson")
# Plot the correlation heatmap
ggcorrplot(correlation,
method = "square",
lab = TRUE, # Display correlation coefficients in cells
lab_size = 3, # Slightly larger font size for better readability
colors = c("blue", "white", "orange"), # High-contrast color palette
title = "Correlation Heatmap of Numerical Variables",
ggtheme = theme_minimal(base_size = 12), # Increase base font size for larger plot
legend.title = "Correlation") +
theme(
plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.text.y = element_text(size = 10)
)
```
### 2.2. Associations between categorical attributes
```{r, echo=FALSE, fig.width= 12, fig.height= 6}
# Function to compute Cramér's V between two categorical variables
get_cramers_v <- function(x, y) {
tbl <- table(x, y)
return(DescTools::CramerV(tbl, bias.correct = TRUE))
}
# Create a matrix of Cramér's V values
vars <- names(cat_attributes)
n <- length(vars)
cramer_matrix <- matrix(NA, nrow = n, ncol = n, dimnames = list(vars, vars))
for (i in seq_len(n)) {
for (j in seq_len(n)) {
if (i <= j) {
cramer_matrix[i, j] <- get_cramers_v(cat_attributes[[i]], cat_attributes[[j]])
} else {
cramer_matrix[i, j] <- cramer_matrix[j, i] # Symmetric
}
}
}
# Reshape for heatmap
cramer_long <- melt(cramer_matrix, na.rm = TRUE)
colnames(cramer_long) <- c("Var1", "Var2", "CramerV")
# Draw the heatmap
ggplot(cramer_long, aes(x = Var1, y = Var2, fill = CramerV)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "lightblue", high = "darkblue", na.value = "white", name = "Cramér's V") +
geom_text(aes(label = round(CramerV, 2)), size = 3, color = "white") +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank(),
axis.title = element_blank(),
legend.position = "right")
```
### 2.3. Interaction Between Numeric and Categorical Variables
```{r, echo=FALSE, fig.width= 12, fig.height= 6}
# Combine factor and numeric variables into one data frame
df <- cbind(cat_attributes, num_attributes)
# Initialize result storage
results <- data.frame(
Factor_Variable = character(),
Numeric_Variable = character(),
P_Value = numeric(),
stringsAsFactors = FALSE
)
# Perform Kruskal-Wallis test for each factor-numeric pair
for (cat_var in colnames(cat_attributes)) {
for (num_var in colnames(num_attributes)) {
sub_df <- df[, c(cat_var, num_var)] %>% na.omit()
formula <- as.formula(paste(num_var, "~", cat_var))
test_result <- kruskal.test(formula, data = sub_df)
results <- rbind(results, data.frame(
Factor_Variable = cat_var,
Numeric_Variable = num_var,
P_Value = round(test_result$p.value, 5)
))
}
}
# Sort by p-value
results <- results %>% arrange(P_Value)
# Filter for p-value < 0.05
significant <- results %>% filter(P_Value < 0.05)
# Display only significant results
kable(significant, caption = "Statistically Significant Factor-Numeric Relationships (Kruskal-Wallis, p < 0.05)") %>%
kable_styling(full_width = FALSE, position = "center")
```
# C. Inferential Analysis
## 1. Bivariate Analysis with Attrition
```{r, include=FALSE}
#Loading data set
df_hratt <- read_csv("hr_attrition_data_v2.csv",col_types = cols())
```
In this step, I explore how each selected feature relates to employee attrition. Using visualizations, I compare the attrition rate across different groups or ranges of each variable. This helps me identify patterns or potential drivers that may point to HR policies needing improvement.
### 1.1. Compensation & Rewards
#### 1.1.1. Monthly Income
```{r, echo=FALSE, fig.width=12, fig.height=6}
# Divide monthly_income into 10 equal-sized groups (deciles)
df_hratt_grouped <- df_hratt %>%
mutate(income_bin = ntile(monthly_income, 10)) %>%
group_by(income_bin) %>%
summarise(attrition_rate = mean(attrition == "Yes"))
# Calculate min and max income for each decile
income_ranges <- df_hratt %>%
mutate(income_bin = ntile(monthly_income, 10)) %>%
group_by(income_bin) %>%
summarise(min_income = min(monthly_income),
max_income = max(monthly_income)) %>%
mutate(label = paste0("$", format(min_income, big.mark = ","), " – $", format(max_income, big.mark = ",")))
# Merge labels into grouped data
df_hratt_grouped <- df_hratt_grouped %>%
left_join(income_ranges, by = "income_bin")
# Bar chart: Attrition rate across income deciles with labels
p1 <- ggplot(df_hratt_grouped, aes(x = factor(income_bin), y = attrition_rate)) +
geom_col(fill = "steelblue") +
labs(title = "Attrition Rate across Monthly Income Levels",
x = "Monthly Income Range (Deciles)",
y = "Attrition Rate") +
scale_x_discrete(labels = df_hratt_grouped$label) +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(size = 18),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(angle = 45, hjust = 1)) # rotate x-axis labels for readability
# Boxplot: Distribution of Monthly Income by Attrition
p2 <- ggplot(df_hratt, aes(x = attrition, y = monthly_income, fill = attrition)) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("lightgray", "blue")) +
labs(title = "Monthly Income Distribution by Attrition",
x = "Attrition",
y = "Monthly Income") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(size = 18),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
legend.position = "none")
# Arrange both plots side by side
gridExtra::grid.arrange(p1, p2, ncol = 2)
```
The variable Monthly Income shows a clear relationship with attrition when analyzed through two charts. A bar chart dividing income into 10 deciles indicates that attrition rates are significantly high among the lowest income groups (decile 1 and 2), exceeding 0.3. However, at decile 3, the rate drops sharply to below 0.2—marking a notable turning point. From decile 4 to decile 6, the attrition rate continues to decline gradually, reaching below 0.1. From decile 7 onward, the rate fluctuates slightly, increasing again at decile 9 (around 0.15) before falling dramatically to under 0.05 in decile 10, the highest income group. This suggests that the relationship between income and attrition is not linear, but may be influenced by certain income thresholds that affect employee behavior. The boxplot further supports this insight, showing that the group of employees who left had noticeably lower median, Q1, and Q3 income levels than those who stayed, reflecting a clear trend of higher attrition among low-income employees.
From this analysis, the following hypotheses can be proposed:
- **H0**: The minimum income of 2696 does not affect employees' likelihood of attrition.
- **H1**: The minimum income of 2696 helps reduce employees' likelihood of attrition. **<span style="color: green;">TRUE</span>**
```{r, echo=FALSE, message=FALSE, warning=FALSE}
# Create income group based on threshold 2696
df_hratt <- df_hratt %>%
mutate(income_group = ifelse(monthly_income < 2696, "< 2696", ">= 2696"))
# Create contingency table between income group and attrition
table_income_attrition <- table(df_hratt$income_group, df_hratt$attrition)
# Display the contingency table in a clean format using kable
knitr::kable(table_income_attrition, caption = "Contingency Table: Income Group vs Attrition")
# Perform Chi-squared test (suitable when expected frequencies are sufficiently large)
test_result <- chisq.test(table_income_attrition)
# Show the test result
test_result
```
The test results show a Chi-squared statistic of 61.145 with 1 degree of freedom, and a p-value of 5.30 × 10⁻¹⁵, which is significantly less than the 0.05 significance level. This allows us to reject the null hypothesis (H₀), providing statistical evidence that income level is associated with attrition rate.
Specifically, employees earning less than 2,696 have a substantially higher attrition rate (31.3%) compared to those earning 2,696 or more (12.3%). This supports the alternative hypothesis (H₁) that a minimum income level of 2,696 contributes to reducing attrition risk.
#### 1.1.2. Percent Salary Hike
```{r, echo=FALSE, fig.width=12, fig.height=6}
# Divide percent_salary_hike into 10 equal-sized groups (deciles)
df_hratt_grouped <- df_hratt %>%
mutate(psh_bin = ntile(percent_salary_hike, 10)) %>%
group_by(psh_bin) %>%
summarise(attrition_rate = mean(attrition == "Yes"))
# Calculate min and max percent_salary_hike for each decile
psh_ranges <- df_hratt %>%
mutate(psh_bin = ntile(percent_salary_hike, 10)) %>%
group_by(psh_bin) %>%
summarise(min_psh = min(percent_salary_hike),
max_psh = max(percent_salary_hike)) %>%
mutate(label = paste0(min_psh, "% – ", max_psh, "%"))
# Merge decile range labels into grouped data
df_hratt_grouped <- df_hratt_grouped %>%
left_join(psh_ranges, by = "psh_bin")
# Bar chart: Attrition rate across psh deciles with range labels
p1 <- ggplot(df_hratt_grouped, aes(x = factor(psh_bin), y = attrition_rate)) +
geom_col(fill = "steelblue") +
labs(title = "Attrition Rate across Percent Salary Hike Levels",
x = "Percent Salary Hike Range (Deciles)",
y = "Attrition Rate") +
scale_x_discrete(labels = df_hratt_grouped$label) +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(size = 18),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(angle = 45, hjust = 1)) # rotate x-axis labels for readability
# Boxplot: Distribution of Percent Salary Hike by Attrition
p2 <- ggplot(df_hratt, aes(x = attrition, y = percent_salary_hike, fill = attrition)) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("lightgray", "blue")) +
labs(title = "Percent Salary Hike Distribution by Attrition",
x = "Attrition",
y = "Percent Salary Hike") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(size = 18),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
legend.position = "none")
# Arrange both plots side by side
gridExtra::grid.arrange(p1, p2, ncol = 2)
```
The variable Percent Salary Hike shows no clear relationship with attrition when examined through two visualizations. In the bar chart, as Percent Salary Hike increases, the attrition rate fluctuates slightly within a range from below 0.125 to nearly 0.2, without exhibiting a consistent upward or downward trend. This suggests that the annual percentage increase in salary does not serve as a strong predictor of employee attrition. The boxplot reinforces this observation, as the median and Q1 values for both the attrition and non-attrition groups are nearly identical, with only a marginally higher Q3 value for the group that stayed. However, this difference is not substantial enough to indicate a meaningful disparity. Overall, the distribution of salary hike percentages between the two groups does not provide strong evidence that higher raises are effective in improving retention.
#### 1.1.3. Stock Option Level
```{r, echo=FALSE, fig.width=12, fig.height=6}
# Calculate attrition rate for each unique stock_option_level
df_hratt_grouped <- df_hratt %>%
group_by(stock_option_level) %>%
summarise(attrition_rate = mean(attrition == "Yes"))
# Bar chart: Attrition rate by stock_option_level
p1 <- ggplot(df_hratt_grouped, aes(x = factor(stock_option_level), y = attrition_rate)) +
geom_col(fill = "steelblue") +
labs(title = "Attrition Rate by Stock Option Level",
x = "Stock Option Level",
y = "Attrition Rate") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(size = 18),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14))
# Boxplot: Distribution of stock_option_level by attrition status
p2 <- ggplot(df_hratt, aes(x = attrition, y = stock_option_level, fill = attrition)) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("lightgray", "blue")) +
labs(title = "Stock Option Level Distribution by Attrition",
x = "Attrition",
y = "Stock Option Level") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(size = 18),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
legend.position = "none")
# Display both plots side by side
gridExtra::grid.arrange(p1, p2, ncol = 2)
```
The Stock Option Level variable shows a clear relationship between the level of stock options and the attrition rate. The bar chart on the left indicates that employees without stock options (Stock Option Level = 0) have the highest attrition rate, approximately 0.24. When the stock option level increases to 1, the attrition rate drops sharply to around 0.1, suggesting that having stock options may help reduce attrition. However, at higher levels such as 2 and 3, the attrition rates fluctuate mildly, about 0.075 and 0.175 respectively, not following a strictly decreasing trend.
The boxplot also shows that the median Stock Option Level for employees who did not leave (Attrition = No) is 1, higher than the median of 0 for those who left (Attrition = Yes), while the Q1 and Q3 values for both groups are relatively similar.
From this, we can propose the following hypotheses:
- **H0**:Attrition rate is the same between employees with no stock option (level 0) and employees with at least one stock option (level ≥1).
- **H1**: Employees with no stock option level have higher attrition rate than those with stock option (one-sided test).**<span style="color: green;">TRUE</span>**
```{r, echo=FALSE, message=FALSE, warning=FALSE}
# Create a binary variable: No Stock Option (0) vs Has Stock Option (1+)
df_hratt <- df_hratt %>%
mutate(stock_option_bin = ifelse(stock_option_level == 0, "No Stock Option", "Has Stock Option"))
# Create contingency table for Chi-squared test
table_stock_attrition <- table(df_hratt$stock_option_bin, df_hratt$attrition)
# Summarise counts and attrition rates by stock_option_bin
summary_table <- df_hratt %>%
group_by(stock_option_bin) %>%
summarise(
Total = n(),
Attrition_Yes = sum(attrition == "Yes"),
Attrition_No = sum(attrition == "No"),
Attrition_Rate = round(Attrition_Yes / Total, 3)
)
# Show summary table with formatting
summary_table %>%
kable(col.names = c("Stock Option Group", "Total Employees", "Attrition = Yes", "Attrition = No", "Attrition Rate"),
caption = "Attrition Summary by Stock Option Group") %>%
kable_styling(full_width = FALSE, position = "center")
# Perform Chi-squared test
chisq_result <- chisq.test(table_stock_attrition)
# Print contingency table and Chi-squared test results
chisq_result
```
The Chi-squared test results show a statistic of 55.025 with 1 degree of freedom and a p-value of 1.19 × 10⁻¹³, which is significantly less than the 0.05 significance level. Therefore, we reject the null hypothesis (H₀), indicating strong statistical evidence to support the alternative hypothesis (H₁): Employees without stock options have a higher attrition rate than those with stock options.
In practice, the attrition rate among employees without stock options is 24.4%, which is approximately 2.5 times higher than that of employees with stock options (9.9%).
### 1.2 Career Development & Advancement
#### 1.2.1. Job Level
```{r, echo=FALSE, fig.width=12, fig.height=6}
df_hratt$job_level <- factor(df_hratt$job_level,
levels = c("Junior", "Mid", "Senior", "Manager", "Director"))
# This plot shows the proportion of attrition across each job level
ggplot(df_hratt, aes(x = job_level, fill = attrition)) +
geom_bar(position = "fill") + # Show proportions
scale_fill_manual(values = c("lightgray", "blue")) +
labs(title = "Attrition Probabilities by Job Level",
x = "Job Level",
y = "Proportion",
fill = "Attrition") +
theme_minimal(base_size = 15) +
theme(plot.title = element_text(size = 20),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16),
legend.title = element_text(size = 15),
legend.text = element_text(size = 15))
```
The variable Job Level demonstrates a clear relationship between organizational hierarchy and employee attrition. The bar chart shows that employees at lower job levels tend to have significantly higher attrition rates. Specifically, the Junior level has the highest attrition rate at around 0.25, followed by Senior level employees at just over 0.18, and Mid-level employees at approximately 0.15. In contrast, higher-level positions such as Manager and Director have notably low attrition rates, both hovering around 0.09.
This trend may reflect differences in job stability, compensation, or motivation across job levels within the organization.
From these observations, we can propose the following hypotheses:
- **H0**: The attrition rate of the Junior group is not higher than the attrition rate of the other groups.
- **H1**: The attrition rate of the Junior group is higher than the attrition rate of the other groups.**<span style="color: green;">TRUE</span>**
```{r, echo=FALSE, message=FALSE, warning=FALSE}
# Create binary variable: Junior vs Others
df_hratt <- df_hratt %>%
mutate(job_level_binary = ifelse(job_level == "Junior", "Junior", "Others"))
# Create contingency table between binary job_level and attrition
table_binary_job_attrition <- table(df_hratt$job_level_binary, df_hratt$attrition)
# Perform Chi-squared test instead of Fisher's Exact Test
chisq_result_binary <- chisq.test(table_binary_job_attrition)
# Summarize counts and attrition rates by binary job_level
summary_table_binary <- df_hratt %>%
group_by(job_level_binary) %>%
summarise(
Total = n(),
Attrition_Yes = sum(attrition == "Yes"),
Attrition_No = sum(attrition == "No"),
Attrition_Rate = round(Attrition_Yes / Total, 3)
)
# Display the summary table with formatted layout
summary_table_binary %>%
kable(col.names = c("Job Level Group", "Total Employees", "Attrition = Yes", "Attrition = No", "Attrition Rate"),
caption = "Attrition Summary: Junior vs Others") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
# Display Chi-squared Test results
chisq_result_binary
```
The one-sided Chi-squared test results show a statistic of X-squared = 65.219 with 1 degree of freedom and a p-value of 6.701 × 10⁻¹⁶, which is much smaller than the significance level of 0.05. Therefore, we reject the null hypothesis (H₀) and have sufficient statistical evidence to confirm that the Junior group has a significantly higher attrition rate compared to other groups.
Specifically, the attrition rate in the Junior group is 26.3%, while in the other groups it is only 10.1%, meaning the attrition rate in the Junior group is nearly 2.6 times higher.
#### 1.2.2. Years Since Last Promotion
```{r, echo=FALSE, fig.width=12, fig.height=6}
# Group data by unique years_since_last_promotion values and calculate attrition rate
df_hratt_grouped <- df_hratt %>%
group_by(years_since_last_promotion) %>%
summarise(attrition_rate = mean(attrition == "Yes")) %>%
arrange(years_since_last_promotion)
# Bar chart: Attrition rate by years_since_last_promotion
p1 <- ggplot(df_hratt_grouped, aes(x = factor(years_since_last_promotion), y = attrition_rate)) +
geom_col(fill = "steelblue") +
labs(title = "Attrition Rate by Years Since Last Promotion",
x = "Years Since Last Promotion",
y = "Attrition Rate") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(size = 18),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14))
# Boxplot: Distribution of years_since_last_promotion by attrition status
p2 <- ggplot(df_hratt, aes(x = attrition, y = years_since_last_promotion, fill = attrition)) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("lightgray", "blue")) +
labs(title = "Years Since Last Promotion Distribution by Attrition",
x = "Attrition",
y = "Years Since Last Promotion") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(size = 18),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
legend.position = "none")
# Arrange the two plots side by side
gridExtra::grid.arrange(p1, p2, ncol = 2)
```
The chart reveals a complex fluctuation in attrition rates based on the number of years since an employee’s last promotion. During the first 3 years after promotion (0–3 years), attrition rates remain elevated — ranging from 13% to over 17%. This suggests that the early post-promotion period does not guarantee retention. Instead, it may involve two subgroups: those who are disengaged due to unmet expectations and those who feel their promotion has not led to meaningful change or support.
This period (0–3 years) can be considered a “fragile engagement phase,” where employees may still be assessing the real value of their advancement. Without proper career guidance, enriched responsibilities, or recognition beyond the title change, some employees may start seeking external opportunities even shortly after being promoted.
From year 4 to year 5, attrition drops significantly to its lowest level (4–8%), indicating a “stability phase” where employees have settled into their new role, possibly benefiting from accumulated experience, increased competence, or greater internal visibility.
However, starting from year 6, attrition begins to rise again sharply, with noticeable spikes in years 7, 9, and 15 — often exceeding 20%. This pattern suggests a potential “career stagnation zone,” where the absence of new development signals or promotion opportunities may drive employees to leave.
Supporting this interpretation, boxplot analysis shows that both attrition groups share a median of 0 years since promotion — indicating many leavers were either recently promoted or have waited too long. However, the attrition group has a narrower spread, reinforcing the view that people tend to leave at both ends of the post-promotion spectrum.
These findings suggest a psychological lifecycle after promotion: early uncertainty (0–3 years), mid-stage stability (4–5 years), and late-stage disengagement (6+ years).
From this, we can propose the following hypotheses:
- **H0**: There is no significant difference in the number of years since the last promotion between employees who left the company and those who stayed.
- **H1**: There is a significant difference in the number of years since the last promotion between employees who left the company and those who stayed.**<span style="color: green;">TRUE</span>**
```{r, echo=FALSE, message=FALSE, warning=FALSE}
# Group 'years_since_last_promotion' into categories
df_hratt <- df_hratt %>%
mutate(promotion_group = case_when(
years_since_last_promotion <= 3 ~ "0–3 years",
years_since_last_promotion >= 4 & years_since_last_promotion <= 5 ~ "4–5 years",
TRUE ~ "6+ years"
))
# Create contingency table between promotion group and attrition
table_promotion_attrition <- table(df_hratt$promotion_group, df_hratt$attrition)
# Perform Chi-square test
chi_result <- chisq.test(table_promotion_attrition)
# Summarize counts and attrition rates by promotion group
summary_promotion <- df_hratt %>%
group_by(promotion_group) %>%
summarise(
Total = n(),
Attrition_Yes = sum(attrition == "Yes"),
Attrition_No = sum(attrition == "No"),
Attrition_Rate = round(Attrition_Yes / Total, 3)
)
# Display summary table
summary_promotion %>%
kable(col.names = c("Promotion Group", "Total Employees", "Attrition = Yes", "Attrition = No", "Attrition Rate"),
caption = "Attrition by Years Since Last Promotion") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
# Display Chi-square test result
chi_result
```
The Pearson's Chi-squared test yielded a p-value of 0.02109, which is less than the significance level of 0.05. Therefore, we reject the null hypothesis. This suggests a statistically significant relationship between Years Since Last Promotion and Attrition — meaning the number of years since an employee's last promotion is associated with their likelihood of leaving the company.
Further analysis on specific subgroups could help identify the most sensitive periods, enabling the development of targeted retention strategies.
#### 1.2.3. Training Times Last Year
```{r, echo=FALSE, fig.width=12, fig.height=6}
# Group by unique training_times_last_year values and calculate attrition rate
df_hratt_grouped <- df_hratt %>%
group_by(training_times_last_year) %>%
summarise(attrition_rate = mean(attrition == "Yes")) %>%
arrange(training_times_last_year)
# Bar chart: Attrition rate by training_times_last_year
p1 <- ggplot(df_hratt_grouped, aes(x = factor(training_times_last_year), y = attrition_rate)) +
geom_col(fill = "steelblue") +
labs(title = "Attrition Rate by Training Times Last Year",
x = "Training Times Last Year",
y = "Attrition Rate") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(size = 18),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14))
# Boxplot: Distribution of training_times_last_year by attrition status
p2 <- ggplot(df_hratt, aes(x = attrition, y = training_times_last_year, fill = attrition)) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("lightgray", "blue")) +
labs(title = "Training Times Last Year Distribution by Attrition",
x = "Attrition",
y = "Training Times Last Year") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(size = 18),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
legend.position = "none")
# Arrange both plots side by side
gridExtra::grid.arrange(p1, p2, ncol = 2)
```
Preliminary analysis suggests that employees who did not participate in any training sessions during the year had a higher attrition rate compared to those who attended at least some training. Meanwhile, employees who participated in 1 to 6 training sessions tended to be more likely to stay, although the differences between these groups were not entirely consistent. This may indicate that training participation is related to employee engagement, though the relationship is not clearly linear.
The boxplot further supports this: the quartiles (Q1, median, Q3) of training times in the attrition group are slightly lower than those in the retention group, but the difference is not substantial. This implies that further statistical analysis is needed to draw a definitive conclusion.
From this, we can propose the following hypotheses:
- **H0**: The attrition rate of employees who received no training (0 times) is not significantly different from that of employees who received training (≥1 time).
- **H1**:* Employees who received no training have a significantly higher attrition rate.**<span style="color: green;">TRUE</span>**
```{r, echo=FALSE, message=FALSE, warning=FALSE}
# Create binary variable: No Training (0) vs. Had Training (1+)
df_hratt <- df_hratt %>%
mutate(training_group = ifelse(training_times_last_year == 0, "No Training", "Had Training"))
# Create contingency table between training group and attrition
table_training_attrition <- table(df_hratt$training_group, df_hratt$attrition)
# Perform Chi-squared Test instead of Fisher's Test
chisq_result <- chisq.test(table_training_attrition)
# Summarize counts and attrition rates
summary_table <- df_hratt %>%
group_by(training_group) %>%
summarise(
Total = n(),
Attrition_Yes = sum(attrition == "Yes"),
Attrition_No = sum(attrition == "No"),
Attrition_Rate = round(Attrition_Yes / Total, 3)
)
# Display summary table in formatted style
summary_table %>%
kable(
col.names = c("Training Group", "Total Employees", "Attrition = Yes", "Attrition = No", "Attrition Rate"),
caption = "Attrition Summary by Training Group"
) %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
# Display Chi-squared test results
chisq_result
```
The Chi-squared test with Yates' correction shows a statistic of X-squared = 4.7722, with 1 degree of freedom and a p-value = 0.02892, which is less than the significance level of 0.05. Therefore, we reject the null hypothesis (H₀) and have sufficient statistical evidence to conclude that employees who have not received training have a significantly higher attrition rate compared to those who have received training.