|
2 | 2 | "cells": [ |
3 | 3 | { |
4 | 4 | "cell_type": "markdown", |
5 | | - "id": "7c7f9ca5", |
| 5 | + "id": "2433c099", |
6 | 6 | "metadata": {}, |
7 | 7 | "source": [ |
8 | 8 | "---\n", |
|
34 | 34 | "## The penguins dataset\n", |
35 | 35 | "We're going to be using the penguins dataset of Allison Horst, published [here](https://github.com/allisonhorst/palmerpenguins), The dataset contains 344 size measurements for three penguin species (Chinstrap, Gentoo and Adélie) observed on three islands in the Palmer Archipelago, Antarctica.\n", |
36 | 36 | "\n", |
37 | | - "\n", |
| 37 | + "\n", |
38 | 38 | "\n", |
39 | 39 | "The physical attributes measured are flipper length, beak length, beak width, body mass, and sex.\n", |
40 | | - "\n", |
| 40 | + "\n", |
41 | 41 | "\n", |
42 | 42 | "In other words, the dataset contains 344 rows with 7 features i.e. 5 physical attributes, species and the island where the observations were made.\n", |
43 | 43 | "\n", |
|
126 | 126 | "~~~\n", |
127 | 127 | "{: .language-python}\n", |
128 | 128 | "\n", |
129 | | - "\n", |
| 129 | + "\n", |
130 | 130 | "\n", |
131 | 131 | "As there are four measurements for each penguin, we need quite a few plots to visualise all four dimensions against each other. Here is a handy Seaborn function to do so:\n", |
132 | 132 | "\n", |
|
136 | 136 | "~~~\n", |
137 | 137 | "{: .language-python}\n", |
138 | 138 | "\n", |
139 | | - "\n", |
| 139 | + "\n", |
140 | 140 | "\n", |
141 | 141 | "We can see that penguins from each species form fairly distinct spatial clusters in these plots, so that you could draw lines between those clusters to delineate each species. This is effectively what many classification algorithms do. They use the training data to delineate the observation space, in this case the 4 measurement dimensions, into classes. When given a new observation, the model finds which of those class areas the new observation falls in to.\n", |
142 | 142 | "\n", |
143 | 143 | "\n", |
144 | 144 | "## Classification using a decision tree\n", |
145 | 145 | "We'll first apply a decision tree classifier to the data. Decisions trees are conceptually similar to flow diagrams (or more precisely for the biologists: dichotomous keys). They split the classification problem into a binary tree of comparisons, at each step comparing a measurement to a value, and moving left or right down the tree until a classification is reached.\n", |
146 | 146 | "\n", |
147 | | - "\n", |
| 147 | + "\n", |
148 | 148 | "\n", |
149 | 149 | "\n", |
150 | 150 | "Training and using a decision tree in Scikit-Learn is straightforward:\n", |
|
183 | 183 | "~~~\n", |
184 | 184 | "{: .language-python}\n", |
185 | 185 | "\n", |
186 | | - "\n", |
| 186 | + "\n", |
187 | 187 | "\n", |
188 | 188 | "The first first question (`depth=1`) splits the training data into \"Adelie\" and \"Gentoo\" categories using the criteria `flipper_length_mm <= 206.5`, and the next two questions (`depth=2`) split the \"Adelie\" and \"Gentoo\" categories into \"Adelie & Chinstrap\" and \"Gentoo & Chinstrap\" predictions. \n", |
189 | 189 | "\n", |
|
214 | 214 | "~~~\n", |
215 | 215 | "{: .language-python}\n", |
216 | 216 | "\n", |
217 | | - "\n", |
| 217 | + "\n", |
218 | 218 | "\n", |
219 | 219 | "## Tuning the `max_depth` hyperparameter\n", |
220 | 220 | "\n", |
|
244 | 244 | "~~~\n", |
245 | 245 | "{: .language-python}\n", |
246 | 246 | "\n", |
247 | | - "\n", |
| 247 | + "\n", |
248 | 248 | "\n", |
249 | 249 | "Here we can see that a `max_depth=2` performs slightly better on the test data than those with `max_depth > 2`. This can seem counter intuitive, as surely more questions should be able to better split up our categories and thus give better predictions?\n", |
250 | 250 | "\n", |
|
260 | 260 | "~~~\n", |
261 | 261 | "{: .language-python}\n", |
262 | 262 | "\n", |
263 | | - "\n", |
| 263 | + "\n", |
264 | 264 | "\n", |
265 | 265 | "It looks like our decision tree has split up the training data into the correct penguin categories and more accurately than the `max_depth=2` model did, however it used some very specific questions to split up the penguins into the correct categories. Let's try visualising the classification space for a more intuitive understanding:\n", |
266 | 266 | "~~~\n", |
|
277 | 277 | "~~~\n", |
278 | 278 | "{: .language-python}\n", |
279 | 279 | "\n", |
280 | | - "\n", |
| 280 | + "\n", |
281 | 281 | "\n", |
282 | 282 | "Earlier we saw that the `max_depth=2` model split the data into 3 simple bounding boxes, whereas for `max_depth=5` we see the model has created some very specific classification boundaries to correctly classify every point in the training data.\n", |
283 | 283 | "\n", |
|
454 | 454 | "- **`C`**: Balances smoothness of the decision boundary and misclassifications; start with `C=1`, increase for tighter boundaries, decrease to prevent overfitting.\n", |
455 | 455 | "\n", |
456 | 456 | "\n", |
457 | | - "\n", |
| 457 | + "\n", |
458 | 458 | "\n", |
459 | 459 | "While this SVM model performs slightly worse than our decision tree (95.6% vs. 98.5%), it's likely that the non-linear boundaries will perform better when exposed to more and more real data, as decision trees are prone to overfitting and requires complex linear models to reproduce simple non-linear boundaries. It's important to pick a model that is appropriate for your problem and data trends!\n", |
460 | 460 | "\n", |
|
0 commit comments