Skip to content

Commit ed5cd11

Browse files
committed
2 parents 0b78d29 + 705b84f commit ed5cd11

9 files changed

Lines changed: 205 additions & 212 deletions

notebooks/01-introduction.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"cells": [
33
{
44
"cell_type": "markdown",
5-
"id": "b29bfe77",
5+
"id": "d79f6321",
66
"metadata": {},
77
"source": [
88
"---\n",
@@ -111,7 +111,7 @@
111111
{
112112
"cell_type": "code",
113113
"execution_count": null,
114-
"id": "b41e957e",
114+
"id": "bc069c6f",
115115
"metadata": {},
116116
"outputs": [],
117117
"source": [
@@ -121,7 +121,7 @@
121121
},
122122
{
123123
"cell_type": "markdown",
124-
"id": "f6bd1eca",
124+
"id": "2a2e5875",
125125
"metadata": {},
126126
"source": [
127127
"### Representation of Data in Scikit-learn\n",

notebooks/02-regression.ipynb

Lines changed: 40 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"cells": [
33
{
44
"cell_type": "markdown",
5-
"id": "e53f64ca",
5+
"id": "a29779d5",
66
"metadata": {},
77
"source": [
88
"---\n",
@@ -60,7 +60,7 @@
6060
{
6161
"cell_type": "code",
6262
"execution_count": null,
63-
"id": "ac76ffb1",
63+
"id": "b4b5673e",
6464
"metadata": {},
6565
"outputs": [],
6666
"source": [
@@ -74,7 +74,7 @@
7474
},
7575
{
7676
"cell_type": "markdown",
77-
"id": "2d62170e",
77+
"id": "b5b55fea",
7878
"metadata": {},
7979
"source": [
8080
"We can see that we have seven columns in total: 4 continuous (numerical) columns named `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, and `body_mass_g`; and 3 discrete (categorical) columns named `species`, `island`, and `sex`. We can also see from a quick inspection of the first 5 samples that we have some missing data in the form of `NaN` values. Missing data is a fairly common occurrence in real-life data, so let's go ahead and remove any rows that contain `NaN` values:"
@@ -83,7 +83,7 @@
8383
{
8484
"cell_type": "code",
8585
"execution_count": null,
86-
"id": "e4fad2f2",
86+
"id": "869fd9b9",
8787
"metadata": {},
8888
"outputs": [],
8989
"source": [
@@ -93,7 +93,7 @@
9393
},
9494
{
9595
"cell_type": "markdown",
96-
"id": "516f6450",
96+
"id": "497591f9",
9797
"metadata": {},
9898
"source": [
9999
"In this scenario we will train a linear regression model using `body_mass_g` as our feature data and `bill_depth_mm` as our label data. We will train our model on a subset of the data by slicing the first 146 samples of our cleaned data. \n",
@@ -104,7 +104,7 @@
104104
{
105105
"cell_type": "code",
106106
"execution_count": null,
107-
"id": "b8e72ebc",
107+
"id": "b2651838",
108108
"metadata": {},
109109
"outputs": [],
110110
"source": [
@@ -123,7 +123,7 @@
123123
},
124124
{
125125
"cell_type": "markdown",
126-
"id": "2f319a23",
126+
"id": "ab6258f7",
127127
"metadata": {},
128128
"source": [
129129
"In this regression example we will create a Linear Regression model that will try to predict `y` values based upon `x` values.\n",
@@ -148,7 +148,7 @@
148148
{
149149
"cell_type": "code",
150150
"execution_count": null,
151-
"id": "2c68391f",
151+
"id": "2e440925",
152152
"metadata": {},
153153
"outputs": [],
154154
"source": [
@@ -161,7 +161,7 @@
161161
{
162162
"cell_type": "code",
163163
"execution_count": null,
164-
"id": "e6d87d33",
164+
"id": "418b3305",
165165
"metadata": {},
166166
"outputs": [],
167167
"source": [
@@ -176,7 +176,7 @@
176176
},
177177
{
178178
"cell_type": "markdown",
179-
"id": "0e467680",
179+
"id": "eb3c0e63",
180180
"metadata": {},
181181
"source": [
182182
"Next we’ll define a model, and train it on the pre-processed data. We’ll also inspect the trained model parameters m and c:"
@@ -185,7 +185,7 @@
185185
{
186186
"cell_type": "code",
187187
"execution_count": null,
188-
"id": "0d82bbd9",
188+
"id": "291e8505",
189189
"metadata": {},
190190
"outputs": [],
191191
"source": [
@@ -205,7 +205,7 @@
205205
},
206206
{
207207
"cell_type": "markdown",
208-
"id": "56bf0227",
208+
"id": "bbb96a1e",
209209
"metadata": {},
210210
"source": [
211211
"Now we can make predictions using our trained model, and calculate the Root Mean Squared Error (RMSE) of our predictions:"
@@ -214,7 +214,7 @@
214214
{
215215
"cell_type": "code",
216216
"execution_count": null,
217-
"id": "c3b9346a",
217+
"id": "10412803",
218218
"metadata": {},
219219
"outputs": [],
220220
"source": [
@@ -232,7 +232,7 @@
232232
},
233233
{
234234
"cell_type": "markdown",
235-
"id": "e931b1b1",
235+
"id": "80063cb0",
236236
"metadata": {},
237237
"source": [
238238
"Finally, we’ll plot our input data, our linear fit, and our predictions:"
@@ -241,7 +241,7 @@
241241
{
242242
"cell_type": "code",
243243
"execution_count": null,
244-
"id": "2c2449db",
244+
"id": "19c3008b",
245245
"metadata": {},
246246
"outputs": [],
247247
"source": [
@@ -256,7 +256,7 @@
256256
},
257257
{
258258
"cell_type": "markdown",
259-
"id": "17cdb92e",
259+
"id": "c30792e3",
260260
"metadata": {},
261261
"source": [
262262
"Congratulations! We've now created our first machine-learning model of the lesson and we can now make predictions of `bill_depth_mm` for any `body_mass_g` values that we pass into our model.\n",
@@ -267,7 +267,7 @@
267267
{
268268
"cell_type": "code",
269269
"execution_count": null,
270-
"id": "70ea12e1",
270+
"id": "83916985",
271271
"metadata": {},
272272
"outputs": [],
273273
"source": [
@@ -291,7 +291,7 @@
291291
},
292292
{
293293
"cell_type": "markdown",
294-
"id": "3182040f",
294+
"id": "d8ff91d4",
295295
"metadata": {},
296296
"source": [
297297
"Our RMSE for predictions on all penguin samples is far larger than before, so let's visually inspect the situation:"
@@ -300,7 +300,7 @@
300300
{
301301
"cell_type": "code",
302302
"execution_count": null,
303-
"id": "eb1dc01d",
303+
"id": "e52a9e7b",
304304
"metadata": {},
305305
"outputs": [],
306306
"source": [
@@ -316,7 +316,7 @@
316316
},
317317
{
318318
"cell_type": "markdown",
319-
"id": "acc4b15c",
319+
"id": "cdf58529",
320320
"metadata": {},
321321
"source": [
322322
"Oh dear. It looks like our linear regression fits okay for our subset of the penguin data, and a few additional samples, but there appears to be a cluster of points that are poorly predicted by our model. Even if we re-trained our model using all samples it looks unlikely that our model would perform much better due to the two-cluster nature of our dataset.\n",
@@ -344,7 +344,7 @@
344344
{
345345
"cell_type": "code",
346346
"execution_count": null,
347-
"id": "303f77b2",
347+
"id": "edd5be58",
348348
"metadata": {},
349349
"outputs": [],
350350
"source": [
@@ -361,7 +361,7 @@
361361
},
362362
{
363363
"cell_type": "markdown",
364-
"id": "6ff788fe",
364+
"id": "b3822c3b",
365365
"metadata": {},
366366
"source": [
367367
"### Exercise: Try to re-implement our univariate regression model using these new train/test sets.\n",
@@ -377,7 +377,7 @@
377377
{
378378
"cell_type": "code",
379379
"execution_count": null,
380-
"id": "05b2da13",
380+
"id": "94b5a534",
381381
"metadata": {},
382382
"outputs": [],
383383
"source": [
@@ -412,7 +412,7 @@
412412
},
413413
{
414414
"cell_type": "markdown",
415-
"id": "8b324b9d",
415+
"id": "c3245252",
416416
"metadata": {},
417417
"source": [
418418
"**Quick follow-up**: Interpret the results of your model. Is it accurate? What does it say about the relationship between body mass and bill depth? Is this a \"good\" model?\n",
@@ -436,7 +436,7 @@
436436
{
437437
"cell_type": "code",
438438
"execution_count": null,
439-
"id": "af68b978",
439+
"id": "8ee2f060",
440440
"metadata": {},
441441
"outputs": [],
442442
"source": [
@@ -450,7 +450,7 @@
450450
},
451451
{
452452
"cell_type": "markdown",
453-
"id": "efe5a33a",
453+
"id": "eb3cdc4e",
454454
"metadata": {},
455455
"source": [
456456
"::::::::::::::::::::::::::::::::::::: callout\n",
@@ -467,7 +467,7 @@
467467
{
468468
"cell_type": "code",
469469
"execution_count": null,
470-
"id": "5fea25bb",
470+
"id": "d6c84c57",
471471
"metadata": {},
472472
"outputs": [],
473473
"source": [
@@ -478,7 +478,7 @@
478478
},
479479
{
480480
"cell_type": "markdown",
481-
"id": "2f970876",
481+
"id": "e5de71e8",
482482
"metadata": {},
483483
"source": [
484484
"We can now make predictions on train/test sets, and calculate RMSE"
@@ -487,7 +487,7 @@
487487
{
488488
"cell_type": "code",
489489
"execution_count": null,
490-
"id": "3cdfe16c",
490+
"id": "97ba38c8",
491491
"metadata": {},
492492
"outputs": [],
493493
"source": [
@@ -504,7 +504,7 @@
504504
},
505505
{
506506
"cell_type": "markdown",
507-
"id": "9c5d6b58",
507+
"id": "0fb8087e",
508508
"metadata": {},
509509
"source": [
510510
"Finally, let's visualise our model fit on our training data and full dataset."
@@ -513,7 +513,7 @@
513513
{
514514
"cell_type": "code",
515515
"execution_count": null,
516-
"id": "dd01d4da",
516+
"id": "08fdd0c4",
517517
"metadata": {},
518518
"outputs": [],
519519
"source": [
@@ -535,7 +535,7 @@
535535
},
536536
{
537537
"cell_type": "markdown",
538-
"id": "63bfcdb8",
538+
"id": "fb14059c",
539539
"metadata": {},
540540
"source": [
541541
"::::::::::::::::::::::::::::::::::::: challenge\n",
@@ -560,7 +560,7 @@
560560
{
561561
"cell_type": "code",
562562
"execution_count": null,
563-
"id": "30f94be2",
563+
"id": "c9a408e5",
564564
"metadata": {},
565565
"outputs": [],
566566
"source": [
@@ -571,7 +571,7 @@
571571
},
572572
{
573573
"cell_type": "markdown",
574-
"id": "468d965a",
574+
"id": "36844336",
575575
"metadata": {},
576576
"source": [
577577
"Let's try a model that includes penguin species as a predictor."
@@ -580,7 +580,7 @@
580580
{
581581
"cell_type": "code",
582582
"execution_count": null,
583-
"id": "10e2f20d",
583+
"id": "da7036d4",
584584
"metadata": {},
585585
"outputs": [],
586586
"source": [
@@ -605,7 +605,7 @@
605605
},
606606
{
607607
"cell_type": "markdown",
608-
"id": "86679459",
608+
"id": "af9afb9a",
609609
"metadata": {},
610610
"source": [
611611
"Since the species column is coded as a string, we need to convert it into a numerical format before we can use it in a machine learning model. To do this, we apply dummy coding (also called one-hot encoding), which creates new binary columns for each species category (e.g., species_Adelie, species_Chinstrap, species_Gentoo). Each row gets a 1 in the column that matches its species and 0 in the others.\n",
@@ -616,7 +616,7 @@
616616
{
617617
"cell_type": "code",
618618
"execution_count": null,
619-
"id": "969dd4a8",
619+
"id": "e0c9f919",
620620
"metadata": {},
621621
"outputs": [],
622622
"source": [
@@ -626,7 +626,7 @@
626626
},
627627
{
628628
"cell_type": "markdown",
629-
"id": "db3fe19a",
629+
"id": "e4c157cc",
630630
"metadata": {},
631631
"source": [
632632
"We can than train/fit and evaluate our model as usual."
@@ -635,7 +635,7 @@
635635
{
636636
"cell_type": "code",
637637
"execution_count": null,
638-
"id": "3684bf88",
638+
"id": "83c4d36f",
639639
"metadata": {},
640640
"outputs": [],
641641
"source": [
@@ -659,7 +659,7 @@
659659
},
660660
{
661661
"cell_type": "markdown",
662-
"id": "72701918",
662+
"id": "6c2d548e",
663663
"metadata": {},
664664
"source": [
665665
"{% include links.md %}\n",

0 commit comments

Comments
 (0)