ml-python-challenge

A simple code challenge in Python applied to a Machine Learning problem

Code Challenge description

The challenge requests to create Machine Learning model, according to dataset provided (file dataset.csv) which contains a list of ships and related features. The model should be able to predict 'crew' when receives input values. dataset.csv file has the fields below:

Ship_name
Cruise_line
Age
Tonnage
passengers
length
cabins
passenger_density
crew

Solution

The solution is a Regression Analysis Model (Linear Regression, Polynomial Regression, Random Forests, etc.) because of continuous variables. We build the solutions on three steps:

data analysis focusing on find features correlated to target variable
build the simplest model (es. Linear Regression) and evaluate it, if the result is not so good
build more complex model (es. Random Forests) and evaluate it

Data Analysis

For this step we can use:

scatterplot matrix, to select correlated variables
covariance matrix, to evaluate the best correlated variable (for a first linear regression model)

Running covariance_matrix.py python code we can plot the two images below:

looking on the first image we see that we have to investigate “Tonnage” and “passengers” variables;

while looking on the second image we see that we have to investigate “cabins” and “length” variables too: the four variable seems to be correlated to crew (target variable).

Using covariance matrix:

we see that passengers variable is the candidate to evaluate a Linear Regression Model.

Linear Regression

Linear Regression Model is the simplest one and it is very similar to Adaptive Linear Neuron Model. We can use the implementation in Scikit-Learn library which has an optimized implementation (file sklearn.linear.py).

For evaluation we use two measures:

mean squared error
coefficient of determination

below the results:

0.93 for test dataset is already a good result.

Random Forests

We investigated also Random Forests. This model is based on decision tree (random_forest.py).

below the resutls:

0.98 for train dataset and 0.95 for test dataset is the best results.

The source code is taken from:

preferred-citation: type: book authors: "Raschka Sebastian" website: "https://sebastianraschka.com/news/2019-news/"

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ml-python-challenge

Code Challenge description

Solution

Data Analysis

Linear Regression

Random Forests

About

Uh oh!

Releases

Packages

Languages

gcapuzzi/ml-python-challenge

Folders and files

Latest commit

History

Repository files navigation

ml-python-challenge

Code Challenge description

Solution

Data Analysis

Linear Regression

Random Forests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages