This project aims to solve the public competition challenge from Kaggle based on Titanic disaster. Goal is to make predictions of survival for some passengers, based on a given training dataset which contains some information for almost 900 passengers and the fact that each of them survived or not. We have to learn from that in order to make as accurate as possible predictions for a test dataset.
- This project is only for personal challenge and educational purpose, no other pretention than those ones.
- My goal was not to (obviously) reivent the wheel, you will find nothing really new here. (Almost) eveything comes from public Kaggle kernels and this 'work' is highly inspired from several (good) readings
- In the end, objective was also to discover, understand and improve my personal skills in data exploration, correlation and manipulation with pandas, seaborn packages.
- Kaggle kernel from Manav Sehgal
- PyconUK tutorial notebooks
- this EDA notebook
- and others public Kaggle kernels/tutorials to follow
This project works with Python 3.6.x (not 3.7 as Tensorflow backend is not yet supported for more than 3.6). If not already installed, use pip to install those packages (versions used during this work are specified, for information):
- keras (2.2.4) (deep learning and neural networks made easy)
- tensorflow (1.12.0) (backend used by keras)
- scikit-learn (0.20.2) (machine learning)
- numpy (1.15.4)
- pandas (0.23.4) (data manipulation)
- seaborn (0.9.0) (data visualization)
Goal of this project is either:
- to clean and prepare training and test datasets
- to build and train a ML model based on those cleaned datasets (so you have to, at least for once, clean and prepare datas) and make predictions
Under the notebooks
folder, there are 2 python Jupyter notebooks used to:
- basically discover data with this notebook
- visualize the datasets and if there are some correlations with this notebook
Some files & folders are not published in this repository, you will have to create them:
- train and test datasets that you can download from Kaggle and save under a
datasets
directory directly in the project tree structure. At least, this program assumes that files are stored here (adapt it to fit your needs if necessary). - a
results
directory should be created to store the predictions CSV files generated by the chosen model
For that, the main.py takes 2 arguments (both are mandatory):
- -m (or "--model"), mandatory: shoud equals
rf
ornn
, depending on which model you would like to userf
stands for RandomForest classifiernn
stands for Neural Network
- -o (or "--objective"), mandatory: should equals
prepare
orpredict
(orboth
), depending on what you would like to doprepare
will read the datasets, clean them, fill missing values and perform other operations and, in the end, save the transformed datasets in the same directory (i.edatasets
)predict
will build the chosen model (with -m argument) and make some predictions from the transformed test datasetboth
will perform 'prepare' then 'predict' operations
Example:
main.py -m rf -o predict
With a basic RandomForest, I have been able to enter in the top 20% of the competition with 0.79425 accuracy rate. Which
is good but not excellent. The NN does not give better results.
Next ? Perhaps give a try to another classification model: Logistic Regression or another one...