This Python-based data science project analyses passenger data from the Titanic disaster to predict survival using data preprocessing, Exploratory Data Analysis (EDA), and machine learning. The code is structured for use in vscode and consists of modular .py script files.
Explore and model the Titanic dataset to identify key features affecting survival and build a machine learning model for prediction. This project uses the Kaggle Titanic dataset to explore data cleaning, visualization, and machine learning for survival prediction.
titanic_survival_project/
├── data/
│ └── titanic.csv # Place original dataset here
│ └── titanic_cleaned.csv # Output of cleaning script
├── figures/
│ └── age_distribution.png
│ └── survival_by_sex.png
├── models/
│ └── encoder_embarked.pkl
│ └── encoder_sex.pkl
│ └── logistics_model.pkl
├── src/
│ ├── data_cleaning.py
│ ├── exploratory_analysis.py
│ └── model_training.py
│ └── utils.py
├── .gitignore
├── power_point_presentation.py
├── README.md
├── requirements.txt
└── Titanic_Survival_Prediction_Project.pptx
- Python 3.8+
- pandas, numpy
- seaborn, matplotlib
- scikit-learn, joblib
- powerpoint
- csv file
- vscode
- Clone the repository
https://github.com/AAdewunmi/titanic_survival_project.git
- Create a virtual environment (macOS/Linux)
# Set up virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run scripts
python src/data_cleaning.py
python src/eda.py
python src/model.py- Dropped columns with excessive missing values (
deck,embark_town) - Filled missing age values with the median
- Encoded categorical variables like
sexandembarked
- Visualized survival rates by gender and class
- Created a correlation heatmap
- Discovered that
sex,pclass, andfareare strong indicators of survival
Selected key features:
pclass,sex,age,sibsp,parch,fare,embarked
- Used
RandomForestClassifierfrom scikit-learn - Trained and tested the model using an 80/20 train-test split
- Achieved ~80% accuracy
- Evaluated with precision, recall, and F1 score
- Visualized feature importance
- Gender (
sex) and class (pclass) were the most influential features. - Model demonstrated good generalization on unseen data.
- The logistic regression model achieved an accuracy of 0.8101 on the test set. This means the model correctly predicted survival for about 81% of passengers.
Here is a link to a power point presentation detailing the key findings of the project.
It's been created using "power_point_presentation.py".
Titanic Survival Prediction Project
- Age Distribution
- Survival by Gender
- Explore other classification models (e.g., Random Forest, Support Vector Machines).
- Perform feature scaling and selection.
- Conduct more in-depth EDA.
- Fine-tune the model hyperparameters.
If you have questions or suggestions, feel free to reach out or open an issue.
Adrian Adewunmi – GitHub

