intro-stat-learning
diff --git a/‎Ch02-statlearn-lab.ipynb
Lines changed: 1295 additions & 1291 deletions b/‎Ch02-statlearn-lab.ipynb
Lines changed: 1295 additions & 1291 deletions
diff --git a/‎Ch03-linreg-lab.Rmd
Lines changed: 2 additions & 2 deletions b/‎Ch03-linreg-lab.Rmd
Lines changed: 2 additions & 2 deletions
diff --git a/‎Ch03-linreg-lab.ipynb
Lines changed: 2 additions & 2 deletions b/‎Ch03-linreg-lab.ipynb
Lines changed: 2 additions & 2 deletions
diff --git a/‎Ch04-classification-lab.Rmd
Lines changed: 12 additions & 12 deletions b/‎Ch04-classification-lab.Rmd
Lines changed: 12 additions & 12 deletions
diff --git a/‎Ch04-classification-lab.ipynb
Lines changed: 12 additions & 12 deletions b/‎Ch04-classification-lab.ipynb
Lines changed: 12 additions & 12 deletions
diff --git a/‎Ch05-resample-lab.Rmd
Lines changed: 12 additions & 12 deletions b/‎Ch05-resample-lab.Rmd
Lines changed: 12 additions & 12 deletions
@@ -343,7 +343,7 @@ As mentioned above, there is an existing function to add a line to a plot --- `a
 
 
 Next we examine some diagnostic plots, several of which were discussed
-in Section~\ref{Ch3:problems.sec}.
+in Section 3.3.3.
 We can find the fitted values and residuals
 of the fit as attributes of the `results` object.
 Various influence measures describing the regression model
@@ -440,7 +440,7 @@ We can access the individual components of `results` by name
 and
 `np.sqrt(results.scale)` gives us the RSE.
 
-Variance inflation factors (section~\ref{Ch3:problems.sec}) are sometimes useful
+Variance inflation factors (section 3.3.3) are sometimes useful
 to assess the effect of collinearity in the model matrix of a regression model.
 We will compute the VIFs in our multiple regression fit, and use the opportunity to introduce the idea of *list comprehension*.
 
 
@@ -1533,7 +1533,7 @@
    "metadata": {},
    "source": [
     "Next we examine some diagnostic plots, several of which were discussed\n",
-    "in Section~\\ref{Ch3:problems.sec}.\n",
+    "in Section 3.3.3.\n",
     "We can find the fitted values and residuals\n",
     "of the fit as attributes of the `results` object.\n",
     "Various influence measures describing the regression model\n",
@@ -2142,7 +2142,7 @@
     "and\n",
     "`np.sqrt(results.scale)` gives us the RSE.\n",
     "\n",
-    "Variance inflation factors (section~\\ref{Ch3:problems.sec}) are sometimes useful\n",
+    "Variance inflation factors (section 3.3.3) are sometimes useful\n",
     "to assess the effect of collinearity in the model matrix of a regression model.\n",
     "We will compute the VIFs in our multiple regression fit, and use the opportunity to introduce the idea of *list comprehension*.\n",
     "\n",
 
@@ -405,7 +405,7 @@ lda.fit(X_train, L_train)
 
 ```
 Here we have used the list comprehensions introduced
-in Section~\ref{Ch3-linreg-lab:multivariate-goodness-of-fit}. Looking at our first line above, we see that the right-hand side is a list
+in Section 3.6.4. Looking at our first line above, we see that the right-hand side is a list
 of length two. This is because the code `for M in [X_train, X_test]` iterates over a list
 of length two. While here we loop over a list,
 the list comprehension method works when looping over any iterable object.
@@ -454,7 +454,7 @@ lda.scalings_
 
 ```
 
-These values provide the linear combination of `Lag1`  and `Lag2`  that are used to form the LDA decision rule. In other words, these are the multipliers of the elements of $X=x$ in (\ref{Ch4:bayes.multi}).
+These values provide the linear combination of `Lag1`  and `Lag2`  that are used to form the LDA decision rule. In other words, these are the multipliers of the elements of $X=x$ in (4.24).
   If $-0.64\times `Lag1`  - 0.51 \times `Lag2` $ is large, then the LDA classifier will predict a market increase, and if it is small, then the LDA classifier will predict a market decline.
 
 ```{python}
@@ -463,7 +463,7 @@ lda_pred = lda.predict(X_test)
 ```
 
 As we observed in our comparison of classification methods
- (Section~\ref{Ch4:comparison.sec}),  the LDA and logistic
+ (Section 4.5),  the LDA and logistic
 regression predictions are almost identical.
 
 ```{python}
@@ -522,7 +522,7 @@ The LDA classifier above is the first classifier from the
 `sklearn` library. We will use several other objects
 from this library. The objects
 follow a common structure that simplifies tasks such as cross-validation,
-which we will see in Chapter~\ref{Ch5:resample}. Specifically,
+which we will see in Chapter 5. Specifically,
 the methods first create a generic classifier without
 referring to any data. This classifier is then fit
 to data with the `fit()`  method and predictions are
@@ -808,7 +808,7 @@ feature_std.std()
 
 ```
 
-Notice that the standard deviations are not quite $1$ here; this is again due to some procedures using the $1/n$ convention for variances (in this case `scaler()`), while others use $1/(n-1)$ (the `std()` method). See the footnote on page~\pageref{Ch4-varformula}.
+Notice that the standard deviations are not quite $1$ here; this is again due to some procedures using the $1/n$ convention for variances (in this case `scaler()`), while others use $1/(n-1)$ (the `std()` method). See the footnote on page 183.
 In this case it does not matter, as long as the variables are all on the same scale.
 
 Using the function `train_test_split()`  we now split the observations into a test set,
@@ -875,7 +875,7 @@ This is double the rate that one would obtain from random guessing.
 The number of neighbors in KNN is referred to as a *tuning parameter*, also referred to as a *hyperparameter*.
 We do not know *a priori* what value to use. It is therefore of interest
 to see how the classifier performs on test data as we vary these
-parameters. This can be achieved with a `for` loop, described in Section~\ref{Ch2-statlearn-lab:for-loops}.
+parameters. This can be achieved with a `for` loop, described in Section 2.3.8.
 Here we use a for loop to look at the accuracy of our classifier in the group predicted to purchase
 insurance as we vary the number of neighbors from 1 to 5:
 
@@ -902,7 +902,7 @@ As a comparison, we can also fit a logistic regression model to the
 data. This can also be done
 with `sklearn`, though by default it fits
 something like the *ridge regression* version
-of logistic regression, which we introduce in Chapter~\ref{Ch6:varselect}. This can
+of logistic regression, which we introduce in Chapter 6. This can
 be modified by appropriately setting the argument `C` below. Its default
 value is 1 but by setting it to a very large number, the algorithm converges to the same solution as the usual (unregularized)
 logistic regression estimator discussed above.
@@ -946,7 +946,7 @@ confusion_table(logit_labels, y_test)
 
 ```
 ## Linear and Poisson Regression on the Bikeshare Data
-Here we fit linear and  Poisson regression models to the `Bikeshare` data, as described in Section~\ref{Ch4:sec:pois}.
+Here we fit linear and  Poisson regression models to the `Bikeshare` data, as described in Section 4.6.
 The response `bikers` measures the number of bike rentals per hour
 in Washington, DC in the period 2010--2012.
 
@@ -987,7 +987,7 @@ variables constant, there are on average about 7 more riders in
 February than in January. Similarly there are about 16.5 more riders
 in March than in January.
 
-The results seen in Section~\ref{sec:bikeshare.linear}
+The results seen in Section 4.6.1
 used a slightly different coding of the variables `hr` and `mnth`, as follows:
 
 ```{python}
@@ -1041,7 +1041,7 @@ np.allclose(M_lm.fittedvalues, M2_lm.fittedvalues)
 ```
 
 
-To reproduce the left-hand side of Figure~\ref{Ch4:bikeshare}
+To reproduce the left-hand side of Figure 4.13
 we must first obtain the coefficient estimates associated with
 `mnth`. The coefficients for January through November can be obtained
 directly from the `M2_lm` object. The coefficient for December
@@ -1081,7 +1081,7 @@ ax_month.set_ylabel('Coefficient', fontsize=20);
 
 ```
 
-Reproducing the  right-hand plot in Figure~\ref{Ch4:bikeshare}  follows a similar process.
+Reproducing the  right-hand plot in Figure 4.13  follows a similar process.
 
 ```{python}
 coef_hr = S2[S2.index.str.contains('hr')]['coef']
@@ -1116,7 +1116,7 @@ M_pois = sm.GLM(Y, X2, family=sm.families.Poisson()).fit()
 
 ```
 
-We can plot the coefficients associated with `mnth` and `hr`, in order to reproduce  Figure~\ref{Ch4:bikeshare.pois}. We first complete these coefficients as before.
+We can plot the coefficients associated with `mnth` and `hr`, in order to reproduce  Figure 4.15. We first complete these coefficients as before.
 
 ```{python}
 S_pois = summarize(M_pois)
 
@@ -2007,7 +2007,7 @@
    "metadata": {},
    "source": [
     "Here we have used the list comprehensions introduced\n",
-    "in Section~\\ref{Ch3-linreg-lab:multivariate-goodness-of-fit}. Looking at our first line above, we see that the right-hand side is a list\n",
+    "in Section 3.6.4. Looking at our first line above, we see that the right-hand side is a list\n",
     "of length two. This is because the code `for M in [X_train, X_test]` iterates over a list\n",
     "of length two. While here we loop over a list,\n",
     "the list comprehension method works when looping over any iterable object.\n",
@@ -2173,7 +2173,7 @@
    "id": "f0a4abaf",
    "metadata": {},
    "source": [
-    "These values provide the linear combination of `Lag1`  and `Lag2`  that are used to form the LDA decision rule. In other words, these are the multipliers of the elements of $X=x$ in (\\ref{Ch4:bayes.multi}).\n",
+    "These values provide the linear combination of `Lag1`  and `Lag2`  that are used to form the LDA decision rule. In other words, these are the multipliers of the elements of $X=x$ in (4.24).\n",
     "  If $-0.64\\times `Lag1`  - 0.51 \\times `Lag2` $ is large, then the LDA classifier will predict a market increase, and if it is small, then the LDA classifier will predict a market decline."
    ]
   },
@@ -2200,7 +2200,7 @@
    "metadata": {},
    "source": [
     "As we observed in our comparison of classification methods\n",
-    " (Section~\\ref{Ch4:comparison.sec}),  the LDA and logistic\n",
+    " (Section 4.5),  the LDA and logistic\n",
     "regression predictions are almost identical."
    ]
   },
@@ -2421,7 +2421,7 @@
     "`sklearn` library. We will use several other objects\n",
     "from this library. The objects\n",
     "follow a common structure that simplifies tasks such as cross-validation,\n",
-    "which we will see in Chapter~\\ref{Ch5:resample}. Specifically,\n",
+    "which we will see in Chapter 5. Specifically,\n",
     "the methods first create a generic classifier without\n",
     "referring to any data. This classifier is then fit\n",
     "to data with the `fit()`  method and predictions are\n",
@@ -4349,7 +4349,7 @@
    "id": "c225f2b2",
    "metadata": {},
    "source": [
-    "Notice that the standard deviations are not quite $1$ here; this is again due to some procedures using the $1/n$ convention for variances (in this case `scaler()`), while others use $1/(n-1)$ (the `std()` method). See the footnote on page~\\pageref{Ch4-varformula}.\n",
+    "Notice that the standard deviations are not quite $1$ here; this is again due to some procedures using the $1/n$ convention for variances (in this case `scaler()`), while others use $1/(n-1)$ (the `std()` method). See the footnote on page 183.\n",
     "In this case it does not matter, as long as the variables are all on the same scale.\n",
     "\n",
     "Using the function `train_test_split()`  we now split the observations into a test set,\n",
@@ -4570,7 +4570,7 @@
     "The number of neighbors in KNN is referred to as a *tuning parameter*, also referred to as a *hyperparameter*.\n",
     "We do not know *a priori* what value to use. It is therefore of interest\n",
     "to see how the classifier performs on test data as we vary these\n",
-    "parameters. This can be achieved with a `for` loop, described in Section~\\ref{Ch2-statlearn-lab:for-loops}.\n",
+    "parameters. This can be achieved with a `for` loop, described in Section 2.3.8.\n",
     "Here we use a for loop to look at the accuracy of our classifier in the group predicted to purchase\n",
     "insurance as we vary the number of neighbors from 1 to 5:"
    ]
@@ -4629,7 +4629,7 @@
     "data. This can also be done\n",
     "with `sklearn`, though by default it fits\n",
     "something like the *ridge regression* version\n",
-    "of logistic regression, which we introduce in Chapter~\\ref{Ch6:varselect}. This can\n",
+    "of logistic regression, which we introduce in Chapter 6. This can\n",
     "be modified by appropriately setting the argument `C` below. Its default\n",
     "value is 1 but by setting it to a very large number, the algorithm converges to the same solution as the usual (unregularized)\n",
     "logistic regression estimator discussed above.\n",
@@ -4849,7 +4849,7 @@
    "metadata": {},
    "source": [
     "## Linear and Poisson Regression on the Bikeshare Data\n",
-    "Here we fit linear and  Poisson regression models to the `Bikeshare` data, as described in Section~\\ref{Ch4:sec:pois}.\n",
+    "Here we fit linear and  Poisson regression models to the `Bikeshare` data, as described in Section 4.6.\n",
     "The response `bikers` measures the number of bike rentals per hour\n",
     "in Washington, DC in the period 2010--2012."
    ]
@@ -5322,7 +5322,7 @@
     "February than in January. Similarly there are about 16.5 more riders\n",
     "in March than in January.\n",
     "\n",
-    "The results seen in Section~\\ref{sec:bikeshare.linear}\n",
+    "The results seen in Section 4.6.1\n",
     "used a slightly different coding of the variables `hr` and `mnth`, as follows:"
    ]
   },
@@ -5834,7 +5834,7 @@
    "id": "41fb2787",
    "metadata": {},
    "source": [
-    "To reproduce the left-hand side of Figure~\\ref{Ch4:bikeshare}\n",
+    "To reproduce the left-hand side of Figure 4.13\n",
     "we must first obtain the coefficient estimates associated with\n",
     "`mnth`. The coefficients for January through November can be obtained\n",
     "directly from the `M2_lm` object. The coefficient for December\n",
@@ -5988,7 +5988,7 @@
    "id": "6c68761a",
    "metadata": {},
    "source": [
-    "Reproducing the  right-hand plot in Figure~\\ref{Ch4:bikeshare}  follows a similar process."
+    "Reproducing the  right-hand plot in Figure 4.13  follows a similar process."
    ]
   },
   {
@@ -6088,7 +6088,7 @@
    "id": "8552fb8b",
    "metadata": {},
    "source": [
-    "We can plot the coefficients associated with `mnth` and `hr`, in order to reproduce  Figure~\\ref{Ch4:bikeshare.pois}. We first complete these coefficients as before."
+    "We can plot the coefficients associated with `mnth` and `hr`, in order to reproduce  Figure 4.15. We first complete these coefficients as before."
    ]
   },
   {
 
@@ -237,7 +237,7 @@ for i, d in enumerate(range(1,6)):
 cv_error
 
 ```
-As in Figure~\ref{Ch5:cvplot}, we see a sharp drop in the estimated test MSE between the linear and
+As in Figure 5.4, we see a sharp drop in the estimated test MSE between the linear and
 quadratic fits, but then no clear improvement from using higher-degree polynomials.
 
 Above we introduced the `outer()`  method of the `np.power()`
@@ -278,7 +278,7 @@ cv_error
 Notice that the computation time is much shorter than that of LOOCV.
 (In principle, the computation time for LOOCV for a least squares
 linear model should be faster than for $k$-fold CV, due to the
-availability of the formula~(\ref{Ch5:eq:LOOCVform})  for LOOCV;
+availability of the formula~(5.2)  for LOOCV;
 however, the generic `cross_validate()`  function does not make
 use of this formula.)  We still see little evidence that using cubic
 or higher-degree polynomial terms leads to a lower test error than simply
@@ -325,7 +325,7 @@ incurred by picking different random folds.
 
 ## The Bootstrap
 We illustrate the use of the bootstrap in the simple example
- {of Section~\ref{Ch5:sec:bootstrap},}  as well as on an example involving
+ {of Section 5.2,}  as well as on an example involving
 estimating the accuracy of the linear regression model on the  `Auto`
 data set.
 ### Estimating the Accuracy of a Statistic of Interest
@@ -340,8 +340,8 @@ in a dataframe.
 To illustrate the bootstrap, we
 start with a simple example.
 The  `Portfolio`  data set in the `ISLP` package is described
-in Section~\ref{Ch5:sec:bootstrap}. The goal is to estimate the
-sampling variance of the parameter $\alpha$ given in formula~(\ref{Ch5:min.var}).  We will
+in Section 5.2. The goal is to estimate the
+sampling variance of the parameter $\alpha$ given in formula~(5.7).  We will
 create a function
 `alpha_func()`, which takes as input a dataframe `D` assumed
 to have columns `X` and `Y`, as well as a
@@ -360,7 +360,7 @@ def alpha_func(D, idx):
 ```
 This function returns an estimate for $\alpha$
 based on applying the minimum
-    variance formula (\ref{Ch5:min.var}) to the observations indexed by
+    variance formula (5.7) to the observations indexed by
 the argument `idx`.  For instance, the following command
 estimates $\alpha$ using all 100 observations.
 
@@ -430,7 +430,7 @@ intercept and slope terms for the linear regression model that uses
 `horsepower` to predict `mpg` in the  `Auto`  data set. We
 will compare the estimates obtained using the bootstrap to those
 obtained using the formulas for ${\rm SE}(\hat{\beta}_0)$ and
-${\rm SE}(\hat{\beta}_1)$ described in Section~\ref{Ch3:secoefsec}.
+${\rm SE}(\hat{\beta}_1)$ described in Section 3.1.2.
 
 To use our `boot_SE()` function, we must write a function (its
 first argument)
@@ -499,7 +499,7 @@ This indicates that the bootstrap estimate for ${\rm SE}(\hat{\beta}_0)$ is
 0.85, and that the bootstrap
 estimate for ${\rm SE}(\hat{\beta}_1)$ is
 0.0074.  As discussed in
-Section~\ref{Ch3:secoefsec}, standard formulas can be used to compute
+Section 3.1.2, standard formulas can be used to compute
 the standard errors for the regression coefficients in a linear
 model. These can be obtained using the `summarize()`  function
 from `ISLP.sm`.
@@ -513,21 +513,21 @@ model_se
 
 
 The standard error estimates for $\hat{\beta}_0$ and $\hat{\beta}_1$
-obtained using the formulas  from Section~\ref{Ch3:secoefsec}  are
+obtained using the formulas  from Section 3.1.2  are
 0.717 for the
 intercept and
 0.006 for the
 slope. Interestingly, these are somewhat different from the estimates
 obtained using the bootstrap.  Does this indicate a problem with the
 bootstrap? In fact, it suggests the opposite.  Recall that the
 standard formulas given in
- {Equation~\ref{Ch3:se.eqn} on page~\pageref{Ch3:se.eqn}}
+ {Equation 3.8 on page 75}
 rely on certain assumptions. For example,
 they depend on the unknown parameter $\sigma^2$, the noise
 variance. We then estimate $\sigma^2$ using the RSS. Now although the
 formulas for the standard errors do not rely on the linear model being
 correct, the estimate for $\sigma^2$ does.  We see
- {in Figure~\ref{Ch3:polyplot} on page~\pageref{Ch3:polyplot}}  that there is
+ {in Figure 3.8 on page 99}  that there is
 a non-linear relationship in the data, and so the residuals from a
 linear fit will be inflated, and so will $\hat{\sigma}^2$.  Secondly,
 the standard formulas assume (somewhat unrealistically) that the $x_i$
@@ -540,7 +540,7 @@ the results from `sm.OLS`.
 Below we compute the bootstrap standard error estimates and the
 standard linear regression estimates that result from fitting the
 quadratic model to the data. Since this model provides a good fit to
-the data (Figure~\ref{Ch3:polyplot}), there is now a better
+the data (Figure 3.8), there is now a better
 correspondence between the bootstrap estimates and the standard
 estimates of ${\rm SE}(\hat{\beta}_0)$, ${\rm SE}(\hat{\beta}_1)$ and
 ${\rm SE}(\hat{\beta}_2)$.