What is it for ?

This project aims to solve the public competition challenge from Kaggle based on Titanic disaster. Goal is to make predictions of survival for some passengers, based on a given training dataset which contains some information for almost 900 passengers and the fact that each of them survived or not. We have to learn from that in order to make as accurate as possible predictions for a test dataset.

Disclaimer

This project is only for personal challenge and educational purpose, no other pretention than those ones.
My goal was not to (obviously) reivent the wheel, you will find nothing really new here. (Almost) eveything comes from public Kaggle kernels and this 'work' is highly inspired from several (good) readings
In the end, objective was also to discover, understand and improve my personal skills in data exploration, correlation and manipulation with pandas, seaborn packages.

Credits - Good readings before starting

Kaggle kernel from Manav Sehgal
PyconUK tutorial notebooks
this EDA notebook
and others public Kaggle kernels/tutorials to follow

Module dependencies (requirements)

This project works with Python 3.6.x (not 3.7 as Tensorflow backend is not yet supported for more than 3.6). If not already installed, use pip to install those packages (versions used during this work are specified, for information):

keras (2.2.4) (deep learning and neural networks made easy)
tensorflow (1.12.0) (backend used by keras)
scikit-learn (0.20.2) (machine learning)
numpy (1.15.4)
pandas (0.23.4) (data manipulation)
seaborn (0.9.0) (data visualization)

How to use it and assumptions

Goal of this project is either:

to clean and prepare training and test datasets
to build and train a ML model based on those cleaned datasets (so you have to, at least for once, clean and prepare datas) and make predictions

EDA

Under the notebooks folder, there are 2 python Jupyter notebooks used to:

basically discover data with this notebook
visualize the datasets and if there are some correlations with this notebook

Directory structure

Some files & folders are not published in this repository, you will have to create them:

train and test datasets that you can download from Kaggle and save under a datasets directory directly in the project tree structure. At least, this program assumes that files are stored here (adapt it to fit your needs if necessary).
a results directory should be created to store the predictions CSV files generated by the chosen model

Main script usage

For that, the main.py takes 2 arguments (both are mandatory):

-m (or "--model"), mandatory: shoud equals rf or nn, depending on which model you would like to use
- rf stands for RandomForest classifier
- nn stands for Neural Network
-o (or "--objective"), mandatory: should equals prepare or predict (or both), depending on what you would like to do
- prepare will read the datasets, clean them, fill missing values and perform other operations and, in the end, save the transformed datasets in the same directory (i.e datasets)
- predict will build the chosen model (with -m argument) and make some predictions from the transformed test dataset
- both will perform 'prepare' then 'predict' operations

Example: main.py -m rf -o predict

Results

With a basic RandomForest, I have been able to enter in the top 20% of the competition with 0.79425 accuracy rate. Which is good but not excellent. The NN does not give better results.
Next ? Perhaps give a try to another classification model: Logistic Regression or another one...

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
models		models
notebooks		notebooks
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

What is it for ?

Disclaimer

Credits - Good readings before starting

Module dependencies (requirements)

How to use it and assumptions

EDA

Directory structure

Main script usage

Results

About

Uh oh!

Releases

Packages

Languages

nidragedd/datascience-titanic-challenge

Folders and files

Latest commit

History

Repository files navigation

What is it for ?

Disclaimer

Credits - Good readings before starting

Module dependencies (requirements)

How to use it and assumptions

EDA

Directory structure

Main script usage

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages