This repository contains my solution for the Kaggle Titanic - Machine Learning from Disaster competition. The goal of this challenge is to predict which passengers survived the sinking of the RMS Titanic.
This project explores various feature engineering techniques and machine learning models to predict passenger survival on the Titanic. The current submission achieved a score of 0.7512 on the Kaggle leaderboard.
I plan to continue iterating on this project, incorporating new ideas for feature engineering and fine-tuning algorithms to improve the prediction accuracy.
titanic_submission.ipynb: This Jupyter notebook contains all the code for data loading, preprocessing, feature engineering, model training, and prediction.input/: This directory is expected to contain thetrain.csvandtest.csvdatasets from the Kaggle competition.
The titanic_submission.ipynb notebook implements a robust Scikit-learn Pipeline to handle data preprocessing and feature engineering steps. Custom transformers were created to encapsulate specific transformations, making the pipeline modular and reproducible.
Key feature engineering steps include:
WasAccompainedTransformer: Creates a binary feature indicating whether a passenger was accompanied by family (siblings, spouses, parents, or children).HadCabinTransformer: Extracts a binary feature indicating if a passenger had a recorded cabin, and then drops the originalCabincolumn due to a high number of missing values.FareOutlierRemover: Handles outliers in theFarecolumn by replacing extreme values with the mean of the non-outlier fares.TitleTransformer: Extracts titles from passenger names (e.g., Mr., Miss, Mrs., Master) and groups less common titles into a "Rare" category. The originalNamecolumn is then dropped.AgeInputer: Imputes missingAgevalues based on the median age associated with their extracted title. This leverages the insight that age distributions often differ significantly across titles (e.g., "Master" typically refers to young boys).IsChildTransformer: Creates a binary feature indicating if a passenger is a minor (age < 18).FamilySizeTransformer: Calculates the total family size for each passenger (SibSp+Parch+ 1).AgeGroupTransformer: (Defined in the notebook but not used in the final pipeline for OneHotEncoding) This transformer creates categorical features for different age groups (Baby, Child, Teenager, Adult, OldAdult, Old).OneHotTransformer: Applies One-Hot Encoding to categorical features such asSex,Embarked, and the newly createdTitle.NaDropper: Drops any remaining rows with missing values after imputation.
The solution explores the use of several powerful machine learning models, including:
- XGBoost Classifier
- LightGBM Classifier
- CatBoost Classifier
- Voting Classifier: An ensemble method combining the predictions of multiple individual classifiers to leverage their strengths.
The models are evaluated using cross-validation to ensure robust performance metrics.
The current best submission achieved a score of 0.7512 on the Kaggle public leaderboard. This indicates a good starting point, and there's definitely room for further improvement!
- Advanced Feature Engineering: Explore more complex feature interactions, such as combining
PclasswithFareorAgewithSex. - Hyperparameter Tuning: Conduct more extensive hyperparameter optimization for each model using techniques like GridSearchCV or RandomizedSearchCV.
- Ensemble Methods: Experiment with different weighting strategies or stacking ensembles to further improve prediction accuracy.
- Error Analysis: Analyze misclassified samples to gain insights into areas where the model struggles and identify potential new features.