|
2 | 2 | "cells": [ |
3 | 3 | { |
4 | 4 | "cell_type": "markdown", |
5 | | - "id": "e53f64ca", |
| 5 | + "id": "a29779d5", |
6 | 6 | "metadata": {}, |
7 | 7 | "source": [ |
8 | 8 | "---\n", |
|
60 | 60 | { |
61 | 61 | "cell_type": "code", |
62 | 62 | "execution_count": null, |
63 | | - "id": "ac76ffb1", |
| 63 | + "id": "b4b5673e", |
64 | 64 | "metadata": {}, |
65 | 65 | "outputs": [], |
66 | 66 | "source": [ |
|
74 | 74 | }, |
75 | 75 | { |
76 | 76 | "cell_type": "markdown", |
77 | | - "id": "2d62170e", |
| 77 | + "id": "b5b55fea", |
78 | 78 | "metadata": {}, |
79 | 79 | "source": [ |
80 | 80 | "We can see that we have seven columns in total: 4 continuous (numerical) columns named `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, and `body_mass_g`; and 3 discrete (categorical) columns named `species`, `island`, and `sex`. We can also see from a quick inspection of the first 5 samples that we have some missing data in the form of `NaN` values. Missing data is a fairly common occurrence in real-life data, so let's go ahead and remove any rows that contain `NaN` values:" |
|
83 | 83 | { |
84 | 84 | "cell_type": "code", |
85 | 85 | "execution_count": null, |
86 | | - "id": "e4fad2f2", |
| 86 | + "id": "869fd9b9", |
87 | 87 | "metadata": {}, |
88 | 88 | "outputs": [], |
89 | 89 | "source": [ |
|
93 | 93 | }, |
94 | 94 | { |
95 | 95 | "cell_type": "markdown", |
96 | | - "id": "516f6450", |
| 96 | + "id": "497591f9", |
97 | 97 | "metadata": {}, |
98 | 98 | "source": [ |
99 | 99 | "In this scenario we will train a linear regression model using `body_mass_g` as our feature data and `bill_depth_mm` as our label data. We will train our model on a subset of the data by slicing the first 146 samples of our cleaned data. \n", |
|
104 | 104 | { |
105 | 105 | "cell_type": "code", |
106 | 106 | "execution_count": null, |
107 | | - "id": "b8e72ebc", |
| 107 | + "id": "b2651838", |
108 | 108 | "metadata": {}, |
109 | 109 | "outputs": [], |
110 | 110 | "source": [ |
|
123 | 123 | }, |
124 | 124 | { |
125 | 125 | "cell_type": "markdown", |
126 | | - "id": "2f319a23", |
| 126 | + "id": "ab6258f7", |
127 | 127 | "metadata": {}, |
128 | 128 | "source": [ |
129 | 129 | "In this regression example we will create a Linear Regression model that will try to predict `y` values based upon `x` values.\n", |
|
148 | 148 | { |
149 | 149 | "cell_type": "code", |
150 | 150 | "execution_count": null, |
151 | | - "id": "2c68391f", |
| 151 | + "id": "2e440925", |
152 | 152 | "metadata": {}, |
153 | 153 | "outputs": [], |
154 | 154 | "source": [ |
|
161 | 161 | { |
162 | 162 | "cell_type": "code", |
163 | 163 | "execution_count": null, |
164 | | - "id": "e6d87d33", |
| 164 | + "id": "418b3305", |
165 | 165 | "metadata": {}, |
166 | 166 | "outputs": [], |
167 | 167 | "source": [ |
|
176 | 176 | }, |
177 | 177 | { |
178 | 178 | "cell_type": "markdown", |
179 | | - "id": "0e467680", |
| 179 | + "id": "eb3c0e63", |
180 | 180 | "metadata": {}, |
181 | 181 | "source": [ |
182 | 182 | "Next we’ll define a model, and train it on the pre-processed data. We’ll also inspect the trained model parameters m and c:" |
|
185 | 185 | { |
186 | 186 | "cell_type": "code", |
187 | 187 | "execution_count": null, |
188 | | - "id": "0d82bbd9", |
| 188 | + "id": "291e8505", |
189 | 189 | "metadata": {}, |
190 | 190 | "outputs": [], |
191 | 191 | "source": [ |
|
205 | 205 | }, |
206 | 206 | { |
207 | 207 | "cell_type": "markdown", |
208 | | - "id": "56bf0227", |
| 208 | + "id": "bbb96a1e", |
209 | 209 | "metadata": {}, |
210 | 210 | "source": [ |
211 | 211 | "Now we can make predictions using our trained model, and calculate the Root Mean Squared Error (RMSE) of our predictions:" |
|
214 | 214 | { |
215 | 215 | "cell_type": "code", |
216 | 216 | "execution_count": null, |
217 | | - "id": "c3b9346a", |
| 217 | + "id": "10412803", |
218 | 218 | "metadata": {}, |
219 | 219 | "outputs": [], |
220 | 220 | "source": [ |
|
232 | 232 | }, |
233 | 233 | { |
234 | 234 | "cell_type": "markdown", |
235 | | - "id": "e931b1b1", |
| 235 | + "id": "80063cb0", |
236 | 236 | "metadata": {}, |
237 | 237 | "source": [ |
238 | 238 | "Finally, we’ll plot our input data, our linear fit, and our predictions:" |
|
241 | 241 | { |
242 | 242 | "cell_type": "code", |
243 | 243 | "execution_count": null, |
244 | | - "id": "2c2449db", |
| 244 | + "id": "19c3008b", |
245 | 245 | "metadata": {}, |
246 | 246 | "outputs": [], |
247 | 247 | "source": [ |
|
256 | 256 | }, |
257 | 257 | { |
258 | 258 | "cell_type": "markdown", |
259 | | - "id": "17cdb92e", |
| 259 | + "id": "c30792e3", |
260 | 260 | "metadata": {}, |
261 | 261 | "source": [ |
262 | 262 | "Congratulations! We've now created our first machine-learning model of the lesson and we can now make predictions of `bill_depth_mm` for any `body_mass_g` values that we pass into our model.\n", |
|
267 | 267 | { |
268 | 268 | "cell_type": "code", |
269 | 269 | "execution_count": null, |
270 | | - "id": "70ea12e1", |
| 270 | + "id": "83916985", |
271 | 271 | "metadata": {}, |
272 | 272 | "outputs": [], |
273 | 273 | "source": [ |
|
291 | 291 | }, |
292 | 292 | { |
293 | 293 | "cell_type": "markdown", |
294 | | - "id": "3182040f", |
| 294 | + "id": "d8ff91d4", |
295 | 295 | "metadata": {}, |
296 | 296 | "source": [ |
297 | 297 | "Our RMSE for predictions on all penguin samples is far larger than before, so let's visually inspect the situation:" |
|
300 | 300 | { |
301 | 301 | "cell_type": "code", |
302 | 302 | "execution_count": null, |
303 | | - "id": "eb1dc01d", |
| 303 | + "id": "e52a9e7b", |
304 | 304 | "metadata": {}, |
305 | 305 | "outputs": [], |
306 | 306 | "source": [ |
|
316 | 316 | }, |
317 | 317 | { |
318 | 318 | "cell_type": "markdown", |
319 | | - "id": "acc4b15c", |
| 319 | + "id": "cdf58529", |
320 | 320 | "metadata": {}, |
321 | 321 | "source": [ |
322 | 322 | "Oh dear. It looks like our linear regression fits okay for our subset of the penguin data, and a few additional samples, but there appears to be a cluster of points that are poorly predicted by our model. Even if we re-trained our model using all samples it looks unlikely that our model would perform much better due to the two-cluster nature of our dataset.\n", |
|
344 | 344 | { |
345 | 345 | "cell_type": "code", |
346 | 346 | "execution_count": null, |
347 | | - "id": "303f77b2", |
| 347 | + "id": "edd5be58", |
348 | 348 | "metadata": {}, |
349 | 349 | "outputs": [], |
350 | 350 | "source": [ |
|
361 | 361 | }, |
362 | 362 | { |
363 | 363 | "cell_type": "markdown", |
364 | | - "id": "6ff788fe", |
| 364 | + "id": "b3822c3b", |
365 | 365 | "metadata": {}, |
366 | 366 | "source": [ |
367 | 367 | "### Exercise: Try to re-implement our univariate regression model using these new train/test sets.\n", |
|
377 | 377 | { |
378 | 378 | "cell_type": "code", |
379 | 379 | "execution_count": null, |
380 | | - "id": "05b2da13", |
| 380 | + "id": "94b5a534", |
381 | 381 | "metadata": {}, |
382 | 382 | "outputs": [], |
383 | 383 | "source": [ |
|
412 | 412 | }, |
413 | 413 | { |
414 | 414 | "cell_type": "markdown", |
415 | | - "id": "8b324b9d", |
| 415 | + "id": "c3245252", |
416 | 416 | "metadata": {}, |
417 | 417 | "source": [ |
418 | 418 | "**Quick follow-up**: Interpret the results of your model. Is it accurate? What does it say about the relationship between body mass and bill depth? Is this a \"good\" model?\n", |
|
436 | 436 | { |
437 | 437 | "cell_type": "code", |
438 | 438 | "execution_count": null, |
439 | | - "id": "af68b978", |
| 439 | + "id": "8ee2f060", |
440 | 440 | "metadata": {}, |
441 | 441 | "outputs": [], |
442 | 442 | "source": [ |
|
450 | 450 | }, |
451 | 451 | { |
452 | 452 | "cell_type": "markdown", |
453 | | - "id": "efe5a33a", |
| 453 | + "id": "eb3cdc4e", |
454 | 454 | "metadata": {}, |
455 | 455 | "source": [ |
456 | 456 | "::::::::::::::::::::::::::::::::::::: callout\n", |
|
467 | 467 | { |
468 | 468 | "cell_type": "code", |
469 | 469 | "execution_count": null, |
470 | | - "id": "5fea25bb", |
| 470 | + "id": "d6c84c57", |
471 | 471 | "metadata": {}, |
472 | 472 | "outputs": [], |
473 | 473 | "source": [ |
|
478 | 478 | }, |
479 | 479 | { |
480 | 480 | "cell_type": "markdown", |
481 | | - "id": "2f970876", |
| 481 | + "id": "e5de71e8", |
482 | 482 | "metadata": {}, |
483 | 483 | "source": [ |
484 | 484 | "We can now make predictions on train/test sets, and calculate RMSE" |
|
487 | 487 | { |
488 | 488 | "cell_type": "code", |
489 | 489 | "execution_count": null, |
490 | | - "id": "3cdfe16c", |
| 490 | + "id": "97ba38c8", |
491 | 491 | "metadata": {}, |
492 | 492 | "outputs": [], |
493 | 493 | "source": [ |
|
504 | 504 | }, |
505 | 505 | { |
506 | 506 | "cell_type": "markdown", |
507 | | - "id": "9c5d6b58", |
| 507 | + "id": "0fb8087e", |
508 | 508 | "metadata": {}, |
509 | 509 | "source": [ |
510 | 510 | "Finally, let's visualise our model fit on our training data and full dataset." |
|
513 | 513 | { |
514 | 514 | "cell_type": "code", |
515 | 515 | "execution_count": null, |
516 | | - "id": "dd01d4da", |
| 516 | + "id": "08fdd0c4", |
517 | 517 | "metadata": {}, |
518 | 518 | "outputs": [], |
519 | 519 | "source": [ |
|
535 | 535 | }, |
536 | 536 | { |
537 | 537 | "cell_type": "markdown", |
538 | | - "id": "63bfcdb8", |
| 538 | + "id": "fb14059c", |
539 | 539 | "metadata": {}, |
540 | 540 | "source": [ |
541 | 541 | "::::::::::::::::::::::::::::::::::::: challenge\n", |
|
560 | 560 | { |
561 | 561 | "cell_type": "code", |
562 | 562 | "execution_count": null, |
563 | | - "id": "30f94be2", |
| 563 | + "id": "c9a408e5", |
564 | 564 | "metadata": {}, |
565 | 565 | "outputs": [], |
566 | 566 | "source": [ |
|
571 | 571 | }, |
572 | 572 | { |
573 | 573 | "cell_type": "markdown", |
574 | | - "id": "468d965a", |
| 574 | + "id": "36844336", |
575 | 575 | "metadata": {}, |
576 | 576 | "source": [ |
577 | 577 | "Let's try a model that includes penguin species as a predictor." |
|
580 | 580 | { |
581 | 581 | "cell_type": "code", |
582 | 582 | "execution_count": null, |
583 | | - "id": "10e2f20d", |
| 583 | + "id": "da7036d4", |
584 | 584 | "metadata": {}, |
585 | 585 | "outputs": [], |
586 | 586 | "source": [ |
|
605 | 605 | }, |
606 | 606 | { |
607 | 607 | "cell_type": "markdown", |
608 | | - "id": "86679459", |
| 608 | + "id": "af9afb9a", |
609 | 609 | "metadata": {}, |
610 | 610 | "source": [ |
611 | 611 | "Since the species column is coded as a string, we need to convert it into a numerical format before we can use it in a machine learning model. To do this, we apply dummy coding (also called one-hot encoding), which creates new binary columns for each species category (e.g., species_Adelie, species_Chinstrap, species_Gentoo). Each row gets a 1 in the column that matches its species and 0 in the others.\n", |
|
616 | 616 | { |
617 | 617 | "cell_type": "code", |
618 | 618 | "execution_count": null, |
619 | | - "id": "969dd4a8", |
| 619 | + "id": "e0c9f919", |
620 | 620 | "metadata": {}, |
621 | 621 | "outputs": [], |
622 | 622 | "source": [ |
|
626 | 626 | }, |
627 | 627 | { |
628 | 628 | "cell_type": "markdown", |
629 | | - "id": "db3fe19a", |
| 629 | + "id": "e4c157cc", |
630 | 630 | "metadata": {}, |
631 | 631 | "source": [ |
632 | 632 | "We can than train/fit and evaluate our model as usual." |
|
635 | 635 | { |
636 | 636 | "cell_type": "code", |
637 | 637 | "execution_count": null, |
638 | | - "id": "3684bf88", |
| 638 | + "id": "83c4d36f", |
639 | 639 | "metadata": {}, |
640 | 640 | "outputs": [], |
641 | 641 | "source": [ |
|
659 | 659 | }, |
660 | 660 | { |
661 | 661 | "cell_type": "markdown", |
662 | | - "id": "72701918", |
| 662 | + "id": "6c2d548e", |
663 | 663 | "metadata": {}, |
664 | 664 | "source": [ |
665 | 665 | "{% include links.md %}\n", |
|
0 commit comments