This project explores student behavior using anonymized educational data collected from a survey conducted at IIT Guwahati. The goal is to uncover hidden patterns in academic performance, stress levels, and personal relationships by applying data cleaning, visualization, and machine learning techniques.
The work was completed as part of a multi-level project under the guidance of the Student Wellness and Experience Board, with the final objective of predicting whether a student is likely to be in a romantic relationship based on lifestyle and academic features.
This project follows a structured, level-based approach:
- Feature Interpretation: Understand what the anonymized features represent.
- Data Integrity Audit: Handle missing values and clean the dataset.
- Exploratory Data Analysis (EDA): Visualize trends and relationships between variables.
- Predictive Modeling: Build and evaluate a machine learning model to predict romantic relationships.
Each level builds upon the previous one, forming a complete pipeline from raw data to prediction.
- Interpret the meaning of anonymized features (
Feature_1
,Feature_2
,Feature_3
) - Clean the dataset by handling missing values
- Perform exploratory analysis to uncover meaningful trends
- Build a predictive model to estimate the likelihood of a student being in a romantic relationship
- Evaluate model performance using accuracy, F1-score, and ROC-AUC
The dataset contains 649 student records with 33 features covering:
Category | Examples |
---|---|
Demographics | school, sex, age, address, family size |
Academic Info | grades (G1, G2, G3), failures, absences |
Behavioral | alcohol consumption, free time, going out |
Social | guardian, reason for choosing school |
Psychological | Stress_Level, Year_of_Study |
Three key features were initially labeled as Feature_1
, Feature_2
, and Feature_3
. Based on distribution, correlation, and contextual clues, these were interpreted and renamed:
Original Name | Interpreted Meaning | Justification |
---|---|---|
Feature_1 |
Age |
Values between 15–22, moderate positive correlation with grades |
Feature_2 |
Year_of_Study |
Integer values (1–4), weak negative correlation with grades |
Feature_3 |
Stress_Level |
Scale of 1–5, strong positive correlation with Dalc (daily alcohol use) |
These interpretations helped make the dataset more understandable and usable for further analysis.
Several columns had missing values:
higher 76
Fedu 73
traveltime 73
absences 69
famsize 50
Feature_2 46
freetime 45
Feature_3 39
Feature_1 38
G2 35
Missing values were handled using appropriate strategies:
- Categorical Features: Mode imputation
- Numerical Features: Median imputation
All missing values were successfully addressed without deleting any rows or columns.
After cleaning, the dataset was fully ready for modeling and analysis.
We explored five core questions through visualizations:
- Visualization: Scatterplot with regression line
- Insight: Slight upward trend — older students report higher stress levels
- Visualization: Scatterplot with regression line
- Insight: Weak negative correlation — more free time may lead to lower grades
- Visualization: Box plot of
G3
grouped by absence bins - Insight: Students with >15 absences consistently have lower grades
- Visualization: Bar chart comparing average
G3
for LE3 vs GT3 - Insight: Slightly higher grades in smaller families
- Visualization: Violin plot
- Insight: Parental guardianship correlates with lower stress levels
We built a Random Forest Classifier to predict whether a student is likely to be in a romantic relationship (romantic = yes/no
) using features like Age
, Stress_Level
, Year_of_Study
, and others.
- Handles mixed data types effectively
- Resists overfitting
- Provides feature importance
- No need for scaling
Metric | Score |
---|---|
Accuracy | ~62% |
F1-Score | ~62% |
ROC-AUC Score | ~65% |
These metrics indicate moderate performance, showing that meaningful patterns exist in the data.
A classification report and confusion matrix were used to interpret results in detail, revealing that the model performs slightly better at predicting students not in a romantic relationship than those who are.
- Python Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn
- Algorithms: Random Forest Classifier
- Techniques: Label Encoding, Imputation, Standardization, EDA
- Improve model accuracy using hyperparameter tuning
- Apply synthetic oversampling (e.g., SMOTE) to balance classes
- Deploy findings in a dashboard or API for board use
If you're interested in this work or would like help extending it, feel free to reach out!