This project applies machine learning and deep learning techniques to predict passenger survival on the RMS Titanic disaster. Using the Titanic dataset from kaggle, we explore data cleaning, feature engineering, model training, and evaluation to build predictive models.
-
Rows: 891 passengers
-
Target:
Survived(0 = No, 1 = Yes) -
Key Features:
Pclassβ Passenger class (1st, 2nd, 3rd)Sexβ GenderAgeβ Age in years (with missing values)SibSpβ Siblings/Spouses aboardParchβ Parents/Children aboardFareβ Ticket priceEmbarkedβ Port of embarkation (C/Q/S)
-
Filled missing Age with median.
-
Filled missing Embarked with mode.
-
Dropped Cabin (too many missing values).
-
Created new features:
- FamilySize =
SibSp + Parch + 1 - IsAlone (binary flag)
- Title extracted from names (Mr, Mrs, Miss, etc.).
- FamilySize =
-
Encoded categorical variables (
Sex,Embarked,Title). -
Scaled continuous features (
Age,Fare) for distance-based models.
-
Baseline Models:
- Logistic Regression
- Random Forest
- Gradient Boosting
- SVM
- KNN
- Naive Bayes
-
Advanced Models:
- XGBoost (with hyperparameter tuning & class imbalance handling)
- Stacking Ensemble (Random Forest + Gradient Boosting + XGBoost, meta-learner = Logistic Regression)
- Neural Network (Keras Sequential with hidden layers)
| Model | Accuracy (Test Set) |
|---|---|
| Logistic Regression | 0.782 |
| Random Forest | 0.832 |
| Gradient Boosting | 0.838 |
| SVM | 0.832 |
| KNN | 0.804 |
| Naive Bayes | 0.782 |
| XGBoost (tuned) | 0.855 |
| Stacking Ensemble | 0.844 |
| Neural Network | ~0.81 |
β Best Model: XGBoost with class imbalance handling (Accuracy β 85.5%)
- Sex and Title were among the strongest predictors of survival.
- Scaling significantly improved performance for SVM and KNN.
- Handling class imbalance in XGBoost boosted test accuracy.
- Ensemble methods (Stacking) gave performance close to the best single tuned model.
- Python (Pandas, NumPy, Matplotlib, Scikit-learn, XGBoost, TensorFlow/Keras)
- Jupyter Notebook for analysis and visualization
- Try more feature engineering (e.g., ticket grouping, cabin deck extraction).
- Experiment with cross-validation and ensemble blending.
- Deploy the best model with Streamlit/Flask for interactive predictions.