Skip to content

nidragedd/datascience-titanic-challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What is it for ?

This project aims to solve the public competition challenge from Kaggle based on Titanic disaster. Goal is to make predictions of survival for some passengers, based on a given training dataset which contains some information for almost 900 passengers and the fact that each of them survived or not. We have to learn from that in order to make as accurate as possible predictions for a test dataset.

Disclaimer

  • This project is only for personal challenge and educational purpose, no other pretention than those ones.
  • My goal was not to (obviously) reivent the wheel, you will find nothing really new here. (Almost) eveything comes from public Kaggle kernels and this 'work' is highly inspired from several (good) readings
  • In the end, objective was also to discover, understand and improve my personal skills in data exploration, correlation and manipulation with pandas, seaborn packages.

Credits - Good readings before starting

Module dependencies (requirements)

This project works with Python 3.6.x (not 3.7 as Tensorflow backend is not yet supported for more than 3.6). If not already installed, use pip to install those packages (versions used during this work are specified, for information):

  • keras (2.2.4) (deep learning and neural networks made easy)
  • tensorflow (1.12.0) (backend used by keras)
  • scikit-learn (0.20.2) (machine learning)
  • numpy (1.15.4)
  • pandas (0.23.4) (data manipulation)
  • seaborn (0.9.0) (data visualization)

How to use it and assumptions

Goal of this project is either:

  • to clean and prepare training and test datasets
  • to build and train a ML model based on those cleaned datasets (so you have to, at least for once, clean and prepare datas) and make predictions

EDA

Under the notebooks folder, there are 2 python Jupyter notebooks used to:

  • basically discover data with this notebook
  • visualize the datasets and if there are some correlations with this notebook

Directory structure

Some files & folders are not published in this repository, you will have to create them:

  • train and test datasets that you can download from Kaggle and save under a datasets directory directly in the project tree structure. At least, this program assumes that files are stored here (adapt it to fit your needs if necessary).
  • a results directory should be created to store the predictions CSV files generated by the chosen model

Main script usage

For that, the main.py takes 2 arguments (both are mandatory):

  • -m (or "--model"), mandatory: shoud equals rf or nn, depending on which model you would like to use
  • -o (or "--objective"), mandatory: should equals prepare or predict (or both), depending on what you would like to do
    • prepare will read the datasets, clean them, fill missing values and perform other operations and, in the end, save the transformed datasets in the same directory (i.e datasets)
    • predict will build the chosen model (with -m argument) and make some predictions from the transformed test dataset
    • both will perform 'prepare' then 'predict' operations

Example: main.py -m rf -o predict

Results

With a basic RandomForest, I have been able to enter in the top 20% of the competition with 0.79425 accuracy rate. Which is good but not excellent. The NN does not give better results.
Next ? Perhaps give a try to another classification model: Logistic Regression or another one...

About

Discover, understand Data Science through Kaggle Titanic Challenge

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published