-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathModule2.tex
More file actions
2432 lines (2191 loc) · 179 KB
/
Module2.tex
File metadata and controls
2432 lines (2191 loc) · 179 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Options for packages loaded elsewhere
\PassOptionsToPackage{unicode}{hyperref}
\PassOptionsToPackage{hyphens}{url}
%
\documentclass[
]{article}
\usepackage{lmodern}
\usepackage{amssymb,amsmath}
\usepackage{ifxetex,ifluatex}
\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{textcomp} % provide euro and other symbols
\else % if luatex or xetex
\usepackage{unicode-math}
\defaultfontfeatures{Scale=MatchLowercase}
\defaultfontfeatures[\rmfamily]{Ligatures=TeX,Scale=1}
\fi
% Use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\IfFileExists{microtype.sty}{% use microtype if available
\usepackage[]{microtype}
\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
}{}
\makeatletter
\@ifundefined{KOMAClassName}{% if non-KOMA class
\IfFileExists{parskip.sty}{%
\usepackage{parskip}
}{% else
\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 2pt minus 1pt}}
}{% if KOMA class
\KOMAoptions{parskip=half}}
\makeatother
\usepackage{xcolor}
\IfFileExists{xurl.sty}{\usepackage{xurl}}{} % add URL line breaks if available
\IfFileExists{bookmark.sty}{\usepackage{bookmark}}{\usepackage{hyperref}}
\hypersetup{
pdftitle={R Module 2},
pdfauthor={Connor Gibbs},
hidelinks,
pdfcreator={LaTeX via pandoc}}
\urlstyle{same} % disable monospaced font for URLs
\usepackage[margin=1in]{geometry}
\usepackage{color}
\usepackage{fancyvrb}
\newcommand{\VerbBar}{|}
\newcommand{\VERB}{\Verb[commandchars=\\\{\}]}
\DefineVerbatimEnvironment{Highlighting}{Verbatim}{commandchars=\\\{\}}
% Add ',fontsize=\small' for more characters per line
\usepackage{framed}
\definecolor{shadecolor}{RGB}{248,248,248}
\newenvironment{Shaded}{\begin{snugshade}}{\end{snugshade}}
\newcommand{\AlertTok}[1]{\textcolor[rgb]{0.94,0.16,0.16}{#1}}
\newcommand{\AnnotationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\AttributeTok}[1]{\textcolor[rgb]{0.77,0.63,0.00}{#1}}
\newcommand{\BaseNTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\BuiltInTok}[1]{#1}
\newcommand{\CharTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\CommentTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textit{#1}}}
\newcommand{\CommentVarTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\ConstantTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\ControlFlowTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{\textbf{#1}}}
\newcommand{\DataTypeTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{#1}}
\newcommand{\DecValTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\DocumentationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\ErrorTok}[1]{\textcolor[rgb]{0.64,0.00,0.00}{\textbf{#1}}}
\newcommand{\ExtensionTok}[1]{#1}
\newcommand{\FloatTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\FunctionTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\ImportTok}[1]{#1}
\newcommand{\InformationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\KeywordTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{\textbf{#1}}}
\newcommand{\NormalTok}[1]{#1}
\newcommand{\OperatorTok}[1]{\textcolor[rgb]{0.81,0.36,0.00}{\textbf{#1}}}
\newcommand{\OtherTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{#1}}
\newcommand{\PreprocessorTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textit{#1}}}
\newcommand{\RegionMarkerTok}[1]{#1}
\newcommand{\SpecialCharTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\SpecialStringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\StringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\VariableTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\VerbatimStringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\WarningTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\usepackage{longtable,booktabs}
% Correct order of tables after \paragraph or \subparagraph
\usepackage{etoolbox}
\makeatletter
\patchcmd\longtable{\par}{\if@noskipsec\mbox{}\fi\par}{}{}
\makeatother
% Allow footnotes in longtable head/foot
\IfFileExists{footnotehyper.sty}{\usepackage{footnotehyper}}{\usepackage{footnote}}
\makesavenoteenv{longtable}
\usepackage{graphicx,grffile}
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
% Set default figure placement to htbp
\makeatletter
\def\fps@figure{htbp}
\makeatother
\setlength{\emergencystretch}{3em} % prevent overfull lines
\providecommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
\setcounter{secnumdepth}{5}
\usepackage{booktabs}
\definecolor{output}{HTML}{fffbcf}
% add a background color to the verbatim environment
\let\oldv\verbatim
\let\oldendv\endverbatim
\def\verbatim{\par\setbox0\vbox\bgroup\oldv}
\def\endverbatim{\oldendv\egroup\fboxsep0pt \noindent\colorbox{output}{\usebox0}}
% png images should be 72x72 pixels
\usepackage{xcolor}
\usepackage{hyperref}
\hypersetup{
colorlinks=true,
linkcolor=blue!50!red,
urlcolor=red!70!black
}
% define colors:
\definecolor{bonus}{HTML}{81c9a8}
\definecolor{reflect}{HTML}{ffdb80}
\definecolor{assessment}{HTML}{93b6ed}
\definecolor{progress}{HTML}{bba3cc}
\definecolor{video}{HTML}{d98780}
\definecolor{caution}{HTML}{ff6700}
\definecolor{feedback}{HTML}{cccccc}
% template block for all environments
\newenvironment{specialblock}[3]
{
\begin{center}
\begin{tabular}
{|>{\columncolor{#1}}p{0.9\textwidth}|}\hline\\
\includegraphics[scale=0.1]{src/images/#2}
\textbf{#3}
}
{\\\\\hline
\end{tabular}
\end{center}
}
% styling for all special blocks
\newenvironment{bonus}{
\specialblock{bonus}{sun-fill.png}{Bonus}
}{\endspecialblock}
\newenvironment{reflect}{
\specialblock{reflect}{lightbulb-fill.png}{Reflect}
}{\endspecialblock}
\newenvironment{assessment}{
\specialblock{assessment}{pencil-fill.png}{Assessment}
}{\endspecialblock}
\newenvironment{progress}{
\specialblock{progress}{pulse-line.png}{Progress Check}
}{\endspecialblock}
\newenvironment{video}{
\specialblock{video}{vidicon-fill.png}{Video}
}{\endspecialblock}
\newenvironment{caution}{
\specialblock{caution}{alarm-warning-fill.png}{Caution}
}{\endspecialblock}
\newenvironment{feedback}{
\specialblock{feedback}{chat-1-fill.png}{Feedback}
}{\endspecialblock}
\usepackage{booktabs}
\usepackage{longtable}
\usepackage{array}
\usepackage{multirow}
\usepackage{wrapfig}
\usepackage{float}
\usepackage{colortbl}
\usepackage{pdflscape}
\usepackage{tabu}
\usepackage{threeparttable}
\usepackage{threeparttablex}
\usepackage[normalem]{ulem}
\usepackage{makecell}
\usepackage[]{natbib}
\bibliographystyle{apalike}
\title{R Module 2}
\author{Connor Gibbs\footnote{Department of Statistics, Colorado State University, \href{mailto:connor.gibbs@colostate.edu}{\nolinkurl{connor.gibbs@colostate.edu}}}}
\date{12 Oct, 2020, 12:57 PM}
\begin{document}
\maketitle
{
\setcounter{tocdepth}{2}
\tableofcontents
}
\hypertarget{welcome}{%
\section{Welcome!}\label{welcome}}
Hi, and welcome to the R Module 2 (AKA STAT 158) course at Colorado State University!
This course is the second of three 1 credit courses intended to introduce the R programming language, specifically the Tidyverse.
Through these Modules (courses), we'll explore how R can be used to do the following:
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Access data via files or web application programming interfaces (APIs)
\item
Scrape data from web
\item
Wrangle and clean complicated data structures
\item
Create graphics with an eye for quality and aesthetics
\item
Understand data using basic modeling
\end{enumerate}
In addition, you'll also be exposed to broader concepts, including:
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Data organization and storage
\item
Hypertext Markup Language (HTML)
\item
Tidyverse principles
\end{enumerate}
More detail will be provided in the Course Topics laid out in the next chapter.
\hypertarget{how-to-navigate-this-book}{%
\subsubsection{How To Navigate This Book}\label{how-to-navigate-this-book}}
To move quickly to different portions of the book, click on the appropriate chapter or section in the the table of contents on the left.
The buttons at the top of the page allow you to show/hide the table of contents, search the book, change font settings, download a pdf or ebook copy of this book, or get hints on various sections of the book.
The faint left and right arrows at the sides of each page (or bottom of the page if it's narrow enough) allow you to step to the next/previous section.
Here's what they look like:
\begin{figure}
{\centering \includegraphics[width=0.93in]{src/images/left_arrow} \includegraphics[width=0.74in]{src/images/right_arrow}
}
\caption{Left and right navigation arrows}\label{fig:unnamed-chunk-1}
\end{figure}
\hypertarget{associated-csu-course}{%
\subsection{Associated CSU Course}\label{associated-csu-course}}
This bookdown book is intended to accompany the associated course at Colorado State University, but the curriculum is free for anyone to access and use.
If you're reading the PDF or EPUB version of this book, you can find the ``live'' version at \url{https://csu-r.github.io/Module2/}, and all of the source files for this book can be found at \url{https://github.com/CSU-R/Module2}.
If you're not taking the CSU course, you will periodically encounter instructions and references which are not relevant to you. For example, we will make reference to the Canvas website, which only CSU students enrolled in the course have access to.
\hypertarget{AccessingData}{%
\section{Accessing Data}\label{AccessingData}}
\begin{quote}
``Data is the new oil.'' ---Clive Humby, Chief Data Scientist, Starcount
\end{quote}
In this chapter, we'll cover how to access data given in various forms and provided from various sources.
\hypertarget{rectangular-vs.-non-rectangular-data}{%
\subsection{Rectangular vs.~Non-rectangular Data}\label{rectangular-vs.-non-rectangular-data}}
Data present themselves in many forms, but at a basic level, all data can be categorized into two structures: \textbf{rectangular data} and \textbf{non-rectangular data}. Intuitively, rectangular data are shaped like a rectangle where every value corresponds to some row and column. Non-rectangular data, on the other hand, are not no neatly arranged in rows and columns. Instead, they are often a culmination of separate data structures where there is some similarity among members of the same data structure.
To motivate this idea, let's consider a basic grocery list which consists of ten items: black beans, milk, pasta, cheese, bananas, peanut butter, bread, apples, tomato sauce, and mayonnaise. Notice, there is little organization to this list, and more involved shoppers may find this list inadequate or unhelpful. We may wish to group these items by sections in which we're likely to find them. We may also want to include prices, so we know in-store whether the items are on sale. Let's consider two distinct (but legitimate) ways to organize these data.
\begin{caution}
To illustrate the idea of rectangular vs.~non-rectangular data, we will
consider how these data can be structured in both ways using \texttt{R}.
You may not have seen some of these functions yet. No worries! The
objective is not to understand \textbf{how} to utilize these functions
but to comprehend the difference between rectangular and non-rectangular
data.
\end{caution}
One may first consider grouping these items by section. For example, apples and bananas can be found in the produce section, whereas black beans and tomato sauce can be found in the canned goods. If we were to continue to group these items by section, we may arrive at a data set which looks something like this:
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{groc <-}\StringTok{ }\KeywordTok{list}\NormalTok{(}\DataTypeTok{produce =} \KeywordTok{data.frame}\NormalTok{(}\DataTypeTok{item =} \KeywordTok{c}\NormalTok{(}\StringTok{'apples'}\NormalTok{, }\StringTok{'bananas'}\NormalTok{),}
\DataTypeTok{price =} \KeywordTok{c}\NormalTok{(}\FloatTok{3.99}\NormalTok{, }\FloatTok{0.49}\NormalTok{)),}
\DataTypeTok{condiments =} \KeywordTok{data.frame}\NormalTok{(}\DataTypeTok{item =} \KeywordTok{c}\NormalTok{(}\StringTok{'peanut_butter'}\NormalTok{, }\StringTok{'mayonnaise'}\NormalTok{),}
\DataTypeTok{price =} \KeywordTok{c}\NormalTok{(}\FloatTok{2.18}\NormalTok{, }\FloatTok{3.89}\NormalTok{)),}
\DataTypeTok{canned_goods =} \KeywordTok{data.frame}\NormalTok{(}\DataTypeTok{item =} \KeywordTok{c}\NormalTok{(}\StringTok{'black_beans'}\NormalTok{, }\StringTok{'tomato_sauce'}\NormalTok{),}
\DataTypeTok{price =} \KeywordTok{c}\NormalTok{(}\FloatTok{0.99}\NormalTok{, }\FloatTok{0.69}\NormalTok{)),}
\DataTypeTok{grains =} \KeywordTok{data.frame}\NormalTok{(}\DataTypeTok{item =} \KeywordTok{c}\NormalTok{(}\StringTok{'bread'}\NormalTok{, }\StringTok{'pasta'}\NormalTok{),}
\DataTypeTok{price =} \KeywordTok{c}\NormalTok{(}\FloatTok{2.99}\NormalTok{, }\FloatTok{1.99}\NormalTok{)),}
\DataTypeTok{dairy =} \KeywordTok{data.frame}\NormalTok{(}\DataTypeTok{item =} \KeywordTok{c}\NormalTok{(}\StringTok{'milk'}\NormalTok{, }\StringTok{'butter'}\NormalTok{),}
\DataTypeTok{price =} \KeywordTok{c}\NormalTok{(}\FloatTok{2.73}\NormalTok{, }\FloatTok{2.57}\NormalTok{)))}
\NormalTok{groc}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
$produce
item price
1 apples 3.99
2 bananas 0.49
$condiments
item price
1 peanut_butter 2.18
2 mayonnaise 3.89
$canned_goods
item price
1 black_beans 0.99
2 tomato_sauce 0.69
$grains
item price
1 bread 2.99
2 pasta 1.99
$dairy
item price
1 milk 2.73
2 butter 2.57
\end{verbatim}
Here, we use lists and data frames to create a data set of our grocery list. This list can be traversed depending on what section of the store we find ourselves. For example, suppose we are in the produce section, and we need to recall what items to buy. We could utilize the following code to remind ourselves.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{groc}\OperatorTok{$}\NormalTok{produce}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
item price
1 apples 3.99
2 bananas 0.49
\end{verbatim}
\begin{reflect}
Is this grocery list an example of rectangular or non-rectangular data?
Are there examples of rectangular data contained within the grocery
list? How could we restructure the data to \textbf{rectangularize} the
grocery list?
\end{reflect}
As constructed, this grocery list is an example of non-rectangular data. As a whole, the grocery list is not shaped like a rectangle, but rather, consists of sets of rectangular data, where the sets are defined by the section of the store. Within a section of the store, the items and prices are given in rectangular form since every value is defined by a row and column.
While non-rectangular data is often a useful return object for user-defined functions, they are often troublesome to work with. If a data set can be restructured or created in rectangular form, it should be. Rectangular data is especially important within the Tidyverse, a self-described `opinionated collection of R packages designed for data science'. All packages within the Tidyverse rely on the principle of \emph{tidy data}, data structures where observations are given by rows and variables are given by columns. As defined, tidy data are rectangular, so as we embark on wrangling, visualizing, and modeling data in future chapters, it is important to ponder the nature of our data and whether it can be rectangularized.
Let's consider how we can rectangularize the grocery list. Instead of creating a list of named data frames, where the name represents the section of the store, let's create a grocery list where each row represents an item and columns specify the section and price. Because the Tidyverse requires rectangular data, there are several functions which are handy for converting data structures to rectangular form. We could utilize one of these functions to rectangularize the data set.
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{library}\NormalTok{(tidyverse, }\DataTypeTok{quietly =} \OtherTok{TRUE}\NormalTok{)}
\NormalTok{groc_rec <-}\StringTok{ }\NormalTok{groc }\OperatorTok{%>%}
\StringTok{ }\KeywordTok{bind_rows}\NormalTok{(., }\DataTypeTok{.id =} \StringTok{'section'}\NormalTok{)}
\NormalTok{groc_rec}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
section item price
1 produce apples 3.99
2 produce bananas 0.49
3 condiments peanut_butter 2.18
4 condiments mayonnaise 3.89
5 canned_goods black_beans 0.99
6 canned_goods tomato_sauce 0.69
7 grains bread 2.99
8 grains pasta 1.99
9 dairy milk 2.73
10 dairy butter 2.57
\end{verbatim}
Or, we can simply create the grocery list in rectangular form to begin with.
\begin{feedback}
Any feedback for this section? Click
\href{https://docs.google.com/forms/d/e/1FAIpQLSePQZ3lIaCIPo9J2owXImHZ_9wBEgTo21A0s-A1ty28u4yfvw/viewform?entry.1684471501=The\%20R\%20Community}{here}
\end{feedback}
\hypertarget{reading-and-writing-rectangular-data}{%
\subsubsection{Reading and Writing Rectangular Data}\label{reading-and-writing-rectangular-data}}
Rectangular data are often stored locally using text files (.txt), comma separated value files (.csv), and Excel files (.xlsx). When data are written to these file types, they are easy to view across devices, without the need for \texttt{R}. Since most grocery store trips obviate the need for \texttt{R}, let's consider how to write our grocery list to each of these file types. To write and read data to and from text files or comma separated value files, the \texttt{readr} package will come in handy, whereas the \texttt{xlsx} package will allow us to write and read to and from Excel files. To write data from \texttt{R} to a file, we will leverage commands beginning with \texttt{write}.
\begin{Shaded}
\begin{Highlighting}[]
\CommentTok{# text file}
\NormalTok{readr}\OperatorTok{::}\KeywordTok{write_delim}\NormalTok{(groc_rec, }\DataTypeTok{path =} \StringTok{'./data_raw/groceries-rectangular.txt'}\NormalTok{)}
\CommentTok{# csv file}
\NormalTok{readr}\OperatorTok{::}\KeywordTok{write_csv}\NormalTok{(groc_rec, }\DataTypeTok{path =} \StringTok{'./data_raw/groceries-rectangular.csv'}\NormalTok{)}
\CommentTok{# Excel file}
\NormalTok{xlsx}\OperatorTok{::}\KeywordTok{write.xlsx}\NormalTok{(groc_rec, }\DataTypeTok{file =} \StringTok{'./data_raw/groceries-rectangular.xlsx'}\NormalTok{, }\DataTypeTok{row.names =} \OtherTok{FALSE}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
To read data from a file to \texttt{R}, we will leverage commands beginning with \texttt{read}. Before reading data into \texttt{R}, you will need to look at the file and file extension to better understand which function to use.
\begin{Shaded}
\begin{Highlighting}[]
\CommentTok{# text file}
\NormalTok{readr}\OperatorTok{::}\KeywordTok{read_delim}\NormalTok{(}\StringTok{'./data_raw/groceries-rectangular.txt'}\NormalTok{, }\DataTypeTok{delim =} \StringTok{' '}\NormalTok{)}
\CommentTok{# csv file}
\NormalTok{readr}\OperatorTok{::}\KeywordTok{read_csv}\NormalTok{(}\StringTok{'./data_raw/groceries-rectangular.csv'}\NormalTok{)}
\CommentTok{# Excel file}
\NormalTok{xlsx}\OperatorTok{::}\KeywordTok{read.xlsx}\NormalTok{(}\StringTok{'./data_raw/groceries-rectangular.xlsx'}\NormalTok{, }\DataTypeTok{sheetName =} \StringTok{'Sheet1'}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
# A tibble: 10 x 3
section item price
<chr> <chr> <dbl>
1 produce apples 3.99
2 produce bananas 0.49
3 condiments peanut_butter 2.18
4 condiments mayonnaise 3.89
5 canned_goods black_beans 0.99
6 canned_goods tomato_sauce 0.69
7 grains bread 2.99
8 grains pasta 1.99
9 dairy milk 2.73
10 dairy butter 2.57
# A tibble: 10 x 3
section item price
<chr> <chr> <dbl>
1 produce apples 3.99
2 produce bananas 0.49
3 condiments peanut_butter 2.18
4 condiments mayonnaise 3.89
5 canned_goods black_beans 0.99
6 canned_goods tomato_sauce 0.69
7 grains bread 2.99
8 grains pasta 1.99
9 dairy milk 2.73
10 dairy butter 2.57
section item price
1 produce apples 3.99
2 produce bananas 0.49
3 condiments peanut_butter 2.18
4 condiments mayonnaise 3.89
5 canned_goods black_beans 0.99
6 canned_goods tomato_sauce 0.69
7 grains bread 2.99
8 grains pasta 1.99
9 dairy milk 2.73
10 dairy butter 2.57
\end{verbatim}
\begin{caution}
Reading files into \texttt{R} can sometimes be frustrating. Always look
at the data to see if there are column headers and row names. Text files
can have different \textbf{delimiters}, characters which separate values
in a data set. The default delimiter for \texttt{readr::write\_delim()}
is a space, but other common text delimiters are tabs, colons,
semi-colons, or vertical bars. Commas are so commonly used as a
delimiter, it gets a function of its own. Always ensure that data from
an Excel spreadsheet are rectangular. Lastly, the \texttt{readr} package
will guess the data type of each column. Check these data types are
correct using \texttt{str()}.
\end{caution}
\begin{feedback}
Any feedback for this section? Click
\href{https://docs.google.com/forms/d/e/1FAIpQLSePQZ3lIaCIPo9J2owXImHZ_9wBEgTo21A0s-A1ty28u4yfvw/viewform?entry.1684471501=The\%20R\%20Community}{here}
\end{feedback}
\hypertarget{reading-and-writing-non-rectangular-data}{%
\subsubsection{Reading and Writing Non-rectangular Data}\label{reading-and-writing-non-rectangular-data}}
Writing non-rectangular data from R to your local machine is easy with the help of \texttt{write\_rds()} from the \texttt{readr} package. While the origin of `RDS' is unclear, some believe it stands for R data serialization. Nonetheless, RDS files store single R objects, regardless of the structure. This means that RDS files are a great choice for data which cannot be written to rectangular file formats such as text, csv, and Excel files.
The sister function entitled \texttt{read\_rds()} allows you to read any RDS file directly into your current \texttt{R} environment, assuming the file already exists.
\begin{bonus}
Similar to RDS files, there are also RData files which can store
multiple \texttt{R} objects. These files can be written from \texttt{R}
to your local machine using \texttt{save()} and read from your local
machine to R using \texttt{load()}. We recommend avoiding RData files,
and instead, storing multiple \texttt{R} objects in one named list which
is then saved as an RDS file.
\end{bonus}
When there is inevitably non-rectangular data that exist which you would like to load into \texttt{R}, you are in for a treat. The rest of this module can loosely be viewed as a guide to managing and curating data. We will leverage many tools to tackle this problem, but in the next two sections, we will address two specif, common instances of non-rectangular data: data from APIs and from scraped sources.
\begin{feedback}
Any feedback for this section? Click
\href{https://docs.google.com/forms/d/e/1FAIpQLSePQZ3lIaCIPo9J2owXImHZ_9wBEgTo21A0s-A1ty28u4yfvw/viewform?entry.1684471501=The\%20R\%20Community}{here}
\end{feedback}
\hypertarget{APIs}{%
\subsection{APIs: Clean and Curated}\label{APIs}}
An application programming interface (API) is a set of functions and procedures which allows one computer program to interact with another. To simplify the concept remarkably, we will consider web-APIs where there is a server (computer waiting to provide data) and a client (computer making a request for data).
The benefit of APIs is the result: clean and curated data from the host. The pre-processing needed to get the data in a workable form is entirely done on the server side. We, however, are responsible for making the request. Web-APIs often utilize JavaScript Object Notation (JSON), another example of non-rectangular data. We will utilize the \texttt{httr} and the \texttt{jsonlite} packages to retrieve the latest sports lines from Bovada, an online sportsbook.
Before we start, we'll need to download the \texttt{httr} and \texttt{jsonlite} packages and load them into our current environment. Furthermore, we will need to find the address of the server to which we will send the request.
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{library}\NormalTok{(httr, }\DataTypeTok{quietly =} \OtherTok{TRUE}\NormalTok{)}
\KeywordTok{library}\NormalTok{(jsonlite, }\DataTypeTok{quietly =} \OtherTok{TRUE}\NormalTok{)}
\NormalTok{bov_nfl_api <-}\StringTok{ "https://www.bovada.lv/services/sports/event/v2/events/A/description/football/nfl"}
\end{Highlighting}
\end{Shaded}
To ask for data through a web-API, we will need to make a \texttt{GET} request with the \texttt{httr} package's \texttt{GET()} function. After making the request, we can read about the server's response.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{bov_req <-}\StringTok{ }\NormalTok{httr}\OperatorTok{::}\KeywordTok{GET}\NormalTok{(}\DataTypeTok{url =}\NormalTok{ bov_nfl_api)}
\NormalTok{bov_req}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
Response [https://www.bovada.lv/services/sports/event/v2/events/A/description/football/nfl]
Date: 2020-10-12 18:58
Status: 200
Content-Type: application/json;charset=utf-8
Size: 960 kB
\end{verbatim}
If the request was successful, then the status of the request will read 200. Otherwise, there was some error with your request. For a list of HTTP status codes and their respective definitions, follow this \href{https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html}{link}. Since the response clarifies that the content is indeed driven by JavaScript, then we will utilize the \texttt{jsonlite} package to read the JSON structured data. A handy function we will use will be \texttt{fromJSON()} which converts a character vector containing data in JSON structure to native structures in \texttt{R} like lists. So, in order, we will
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Extract the content from the server's response
\item
Convert the content to a character vector, maintaining the JSON structure
\item
Restructure the data into native \texttt{R} structures, using \texttt{fromJSON()}.
\end{enumerate}
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{content <-}\StringTok{ }\NormalTok{bov_req}\OperatorTok{$}\NormalTok{content}
\NormalTok{content_char <-}\StringTok{ }\KeywordTok{rawToChar}\NormalTok{(content)}
\NormalTok{bov_res <-}\StringTok{ }\NormalTok{jsonlite}\OperatorTok{::}\KeywordTok{fromJSON}\NormalTok{(content_char)}
\end{Highlighting}
\end{Shaded}
Of course, we could also create a function which takes the server's response and converts the content to native \texttt{R} structures. We will want to code in a force stop if the response status is not 200. We will also want to require the \texttt{httr} and \texttt{jsonlite} packages which will automatically install the packages if a user calls the function without having the packages installed.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{convert_JSON <-}\StringTok{ }\ControlFlowTok{function}\NormalTok{(resp)\{}
\CommentTok{# call needed packages}
\KeywordTok{require}\NormalTok{(httr)}
\KeywordTok{require}\NormalTok{(jsonlite)}
\CommentTok{# stop if the server returned an error}
\NormalTok{ httr}\OperatorTok{::}\KeywordTok{stop_for_status}\NormalTok{(resp)}
\CommentTok{# return JSON content in native R structures}
\KeywordTok{return}\NormalTok{(jsonlite}\OperatorTok{::}\KeywordTok{fromJSON}\NormalTok{(}\KeywordTok{rawToChar}\NormalTok{(resp}\OperatorTok{$}\NormalTok{content)))}
\NormalTok{\}}
\end{Highlighting}
\end{Shaded}
Finally, we can get the same output by simply calling the function.
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{identical}\NormalTok{(}\KeywordTok{convert_JSON}\NormalTok{(bov_req), bov_res)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
[1] TRUE
\end{verbatim}
\begin{bonus}
Some web-APIs require additional information from the us as outlined in
the documentation for the API. In this case, the user would need to
provide additional query parameters in their GET request. Thankfully,
this functionality is ingrained in the \texttt{httr} package's
\texttt{GET()} function. For more information on how to include query
parameters, type \texttt{??GET} into your \texttt{R} console.
\end{bonus}
\begin{feedback}
Any feedback for this section? Click
\href{https://docs.google.com/forms/d/e/1FAIpQLSePQZ3lIaCIPo9J2owXImHZ_9wBEgTo21A0s-A1ty28u4yfvw/viewform?entry.1684471501=The\%20R\%20Community}{here}
\end{feedback}
\hypertarget{scraping-messy-and-mangled}{%
\subsection{Scraping: Messy and Mangled}\label{scraping-messy-and-mangled}}
If you are reading this textbook, at some point in your career, you are likely to want or need data which exists on the web. You have looked for downloadable sources and Google searched for an API, but alas, no luck. The last resort for importing data into \texttt{R} is \textbf{web scraping}. Web scraping is a technique for harvesting data which is portrayed on the web and exists in hypertext markup language (HTML), the language of web browser documents.
\hypertarget{scraping-vs-apis}{%
\subsubsection{Scraping vs APIs}\label{scraping-vs-apis}}
The benefit of using an API are clean data. For example, we can traverse the result to find the latest NFL events.
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{head}\NormalTok{(bov_res[[}\DecValTok{2}\NormalTok{]][[}\DecValTok{1}\NormalTok{]][,}\DecValTok{2}\NormalTok{])}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
[1] "Los Angeles Chargers @ New Orleans Saints"
[2] "Buffalo Bills @ Tennessee Titans"
[3] "Atlanta Falcons @ Minnesota Vikings"
[4] "Baltimore Ravens @ Philadelphia Eagles"
[5] "Chicago Bears @ Carolina Panthers"
[6] "Cincinnati Bengals @ Indianapolis Colts"
\end{verbatim}
With more digging, we can find which teams are playing at home.
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{head}\NormalTok{(bov_res[[}\DecValTok{2}\NormalTok{]][[}\DecValTok{1}\NormalTok{]][[}\DecValTok{16}\NormalTok{]])}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
[[1]]
id name home
1 7759933-11904221 New Orleans Saints TRUE
2 7759933-11904248 Los Angeles Chargers FALSE
[[2]]
id name home
1 7800334-11904229 Tennessee Titans TRUE
2 7800334-11904215 Buffalo Bills FALSE
[[3]]
id name home
1 7783039-11904246 Minnesota Vikings TRUE
2 7783039-11904242 Atlanta Falcons FALSE
[[4]]
id name home
1 7783032-11904222 Philadelphia Eagles TRUE
2 7783032-11903831 Baltimore Ravens FALSE
[[5]]
id name home
1 7783035-11904216 Carolina Panthers TRUE
2 7783035-11903832 Chicago Bears FALSE
[[6]]
id name home
1 7783033-11904232 Indianapolis Colts TRUE
2 7783033-11904217 Cincinnati Bengals FALSE
\end{verbatim}
We can also find the current line of each of these games. Here, I have created a function called \texttt{get\_bovada\_lines()} which traverses this complicated (yet clean) JSON object using methods explored in Chapter 3 and combines the information together into a rectangular data set.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{bov_res }\OperatorTok{%>%}
\StringTok{ }\KeywordTok{get_bovada_lines}\NormalTok{() }\OperatorTok{%>%}
\StringTok{ }\KeywordTok{print}\NormalTok{(}\DataTypeTok{n =} \DecValTok{10}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
# A tibble: 24 x 16
id link description startTime live type lastModified
<chr> <chr> <chr> <dttm> <lgl> <chr> <dttm>
1 7759~ /foo~ Los Angele~ 2020-10-12 17:15:00 FALSE GAME~ 2020-10-12 11:57:27
2 7759~ /foo~ Los Angele~ 2020-10-12 17:15:00 FALSE GAME~ 2020-10-12 11:57:27
3 7800~ /foo~ Buffalo Bi~ 2020-10-13 16:00:00 FALSE GAME~ 2020-10-12 11:30:44
4 7800~ /foo~ Buffalo Bi~ 2020-10-13 16:00:00 FALSE GAME~ 2020-10-12 11:30:44
5 7783~ /foo~ Atlanta Fa~ 2020-10-18 10:00:00 FALSE GAME~ 2020-10-12 11:44:26
6 7783~ /foo~ Atlanta Fa~ 2020-10-18 10:00:00 FALSE GAME~ 2020-10-12 11:44:26
7 7783~ /foo~ Baltimore ~ 2020-10-18 10:00:00 FALSE GAME~ 2020-10-12 09:31:47
8 7783~ /foo~ Baltimore ~ 2020-10-18 10:00:00 FALSE GAME~ 2020-10-12 09:31:47
9 7783~ /foo~ Chicago Be~ 2020-10-18 10:00:00 FALSE GAME~ 2020-10-12 11:43:55
10 7783~ /foo~ Chicago Be~ 2020-10-18 10:00:00 FALSE GAME~ 2020-10-12 11:43:55
# ... with 14 more rows, and 9 more variables: team_id <chr>, name <chr>,
# home <lgl>, juice_money <dbl>, handicap_spread <dbl>, juice_spread <dbl>,
# handicap_total <dbl>, juice_over <dbl>, juice_under <dbl>
\end{verbatim}
While traversing these sometimes complicated lists may seem intimidating, with practice, working with data from an API will be made easier after discussing mapping functions in Chapter 3 which are useful for traversing complicated lists. Hopefully, after the scraping section, you will find working with APIs like a walk in the park compared to scraping data directly from the web.
\hypertarget{LessonsLearnedFromScraping}{%
\subsubsection{Lessons Learned from Scraping}\label{LessonsLearnedFromScraping}}
Scraping is a necessary evil that requires patience. While some tasks may prove easy, you will quickly find others seem insurmountable. In this section, we will outline a few tips to help you become a web scraper.
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\item
\textbf{Brainstorm}! Before jumping into your scraping project, ask yourself \emph{what data do I need} and \emph{where can I find it}? If you discover you need data from various sources, \emph{what is the unique identifier}, the link which ties these data together? Taking the time to explore different websites can save you a vast amount of time in the long run. As a general rule, simplistic looking websites are generally easier to scrape and often contain the same information as more complicated websites with several bells and whistles.
\item
\textbf{Start small}! Sometimes a scraping task can feel daunting, but it is important to \emph{view your project as a war, splitting it up into small battles}. If you are interested in the racial demographics of each of the United States, consider how you can first scrape this information for one state. In this process, don't forget tip 1!
\item
\textbf{Hyperlinks are your friend}! They can lead to websites with more detailed information or serve as the unique identifier you need between different data sources. Sometimes you won't even need to scrape the hyperlinks to navigate between webpages, making minor adjustments to the web address will sometimes do.
\item
\textbf{Data is everywhere}! Text color, font, or highlighting may serve as valuable data that you need. If these features exist on the webpage, then they exist within the HTML code which generated the document. Sometimes these features are well hidden or even inaccessible, leading to the last and final tip.
\item
\textbf{Ready your search engine}! Just like coding in \texttt{R} is an art, web developing is an art. When asking distinct developers to create the same website with the same functionality, the final result may be similar but the underlying HTML code could be drastically different. Why does this matter? You will run into an issue that hasn't been addressed in this text. Thankfully, if you've run into an issue, someone else probably has too. We cannot recommend websites like \href{https://stackoverflow.com/}{Stack Overflow} enough.
\end{enumerate}
\hypertarget{tools-for-scraping}{%
\subsubsection{Tools for Scraping}\label{tools-for-scraping}}
Before we can scrape information from a webpage, we need a bit of background on how this information is stored and presented. The goal of this subsection is to briefly introduce the languange of the web, hypertext markup language (HTML). When we talk about scraping the web, what we really mean is gathering bits of information from the HTML code used to build a webpage. Like \texttt{R} code, HTML can be overwhelming. The goal is not to teach HTML but to introduce its components, so you have a much more intuitive sense of what we are doing when we scrape the web.
\hypertarget{hypertext-markup-language-html}{%
\paragraph{Hypertext Markup Language (HTML)}\label{hypertext-markup-language-html}}
Web sites are written in hypertext markup language. All contents that are displayed on a web page are structured through HTML with the help of HTML \textbf{elements}. HTML elements consist of a tag and contents. The \textbf{tag} defines how the web browser should format and display the content. Aptly, the \textbf{content} is what should be displayed.
For example, if we wished to format text as a paragraph within the web document, then we could use the paragraph tag, \texttt{\textless{}p\textgreater{}}, to indicate the beginning of a paragraph. After opening a tag, we then specify the content to display before closing the tag. A complete paragraph may read:
\texttt{\textless{}p\textgreater{}\ This\ is\ the\ paragraph\ you\ want\ to\ scrape.\ \textless{}/p\textgreater{}}
\textbf{Attributes} are optional parameters which provide additional information about the element in which the attribute is included. For example, within the paragraph tag, you can define a class attribute which formats the text in a specific way, such as bolding, coloring, or aligning the text. To extend our example, the element may read:
\texttt{\textless{}p\ class\ =\ "fancy"\textgreater{}\ This\ is\ the\ paragraph\ you\ want\ to\ scrape\ which\ has\ been\ formatted\ in\ a\ fancy\ script.\ \textless{}/p\textgreater{}}
The type of attribute, being class, is the attribute \textbf{name}, whereas the quantity assigned to the attribute, being fancy, is the attribute \textbf{value}. The general decomposition of an HTML element is characterized by the following figure:
\begin{figure}
{\centering \includegraphics[width=12.29in]{src/images/element_decomp}
}
\caption{the lingo of an HTML element}\label{fig:unnamed-chunk-24}
\end{figure}
\begin{bonus}
The class attribute is a flexible one. Many web developers use the class
attribute to point to a class name in a style sheet or to access and
manipulate elements with the specific class name with a JavaScript. For
more information of the class attribute, see this
\href{https://www.w3schools.com/html/html_classes.asp}{link}. For more
information on cascading style sheets which are used to decorate HTML
pages, see this \href{https://www.w3schools.com/css/}{link}.
\end{bonus}
\begin{feedback}
Any feedback for this section? Click
\href{https://docs.google.com/forms/d/e/1FAIpQLSePQZ3lIaCIPo9J2owXImHZ_9wBEgTo21A0s-A1ty28u4yfvw/viewform?entry.1684471501=The\%20R\%20Community}{here}
\end{feedback}
\hypertarget{selector-gadgets}{%
\paragraph{Selector Gadgets}\label{selector-gadgets}}
While all web pages are composed of HTML elements, the elements themselves can be structured in complicated ways. Elements are often nested inside one another or make use of elements in other documents. These complicated structures can make scraping data difficult. Thankfully, we can circumvent exploring these complicated structures with the help of selector gadgets.
A \textbf{selector gadget} allows you to determine what css selector you need to extract the information desired from a webpage. These JavaScript bookmarklets allow you to determine where the information you desire belongs within the complicated structure of elements that makeup a webpage. To follow along in Chapter 3, you will need to download one of these gadgets from this \href{https://selectorgadget.com/}{link}. If you use Google Chrome, you can download the bookmark extension directly from this \href{https://selectorgadget.com/}{link}.
If the selector gadget fails us, we can always view the structure of the elements directly by viewing the page source. This can be done by right-clicking on the webpage and selecting `View Page Source'. For Google Chrome, you can also use the keyboard shortcut `CTRL-U'.
\hypertarget{scraping-nfl-data}{%
\subsubsection{Scraping NFL Data}\label{scraping-nfl-data}}
In Chapter \ref{APIs}, we gathered some betting data pertaining to the NFL through a web-API. We may wish to supplement these betting data with data pertaining to NFL teams, players, or even playing conditions. The goal in this subsection is to introduce you to scraping by heeding the advice given in the Chapter \ref{LessonsLearnedFromScraping}. Further examples are given in the supplemental material.
Following our own advice, let's brainstorm. When you think of NFL data, you probably think of \href{https://www.nfl.com/stats}{NFL.com} or \href{https://www.espn.com/nfl/stats}{ESPN}. These sites obviously have reliable data, but the webpages are pretty involved. While the filters, dropdown menus, and graphics lend great experiences for web browsers, they create headaches for web scrapers. After further digging, we will explore \href{https://www.pro-football-reference.com/}{Pro Football Reference}, a reliable archive for football statistics (with a reasonably simple webpage). This is an exhaustive source which boasts team statistics, player statistics, and playing conditions for various seasons. Let's now start small by focusing on team statistics, but further, let's limit our scope to the \href{https://www.pro-football-reference.com/teams/den/2020.htm}{2020 Denver Broncos}. Notice, there are hyperlinks for each \href{https://www.pro-football-reference.com/players/G/GordMe00.htm}{player} documented in any of the categories, as well hyperlinks for each game's \href{https://www.pro-football-reference.com/boxscores/202009140den.htm}{boxscore} where there is information about playing conditions and outcomes. Hence, we have a common thread between team statistics, players, and boxscores. If, for example, we chose to scrape team statistics from one website and player statistics from another website, we may have to worry about a unique identifier (being team) if the websites have different naming conventions.
\hypertarget{html-tables-team-statistics}{%
\paragraph{HTML Tables: Team Statistics}\label{html-tables-team-statistics}}
We'll start with the team statistics for the 2020 Denver Broncos which can be found in a table entitled `Team Stats and Rankings'. We'll need to figure in which element or \textbf{node} the table lives within the underlying HTML. To do this, we will utilize the CSS selector gadget. If we highlight over and click the table with the selector gadget, we will see that the desired table lives in an element called `\#team\_stats'.
\begin{figure}
{\centering \includegraphics[width=22.22in]{src/images/broncos_selector_gadget}
}
\caption{finding the team statistics element using the selector gadget}\label{fig:unnamed-chunk-27}
\end{figure}
Alternatively, we could view the page source and search for the table name. I've highlighted the information identified by the selector gadget with the cursor.
\begin{figure}
{\centering \includegraphics[width=22.22in]{src/images/broncos_page_source}
}
\caption{finding the team statistics element using the page source}\label{fig:unnamed-chunk-28}
\end{figure}
\begin{caution}
While the selector gadget is always a great first option, it is not
always reliable. There are instances when the selector gadget identifies
a node that is hidden or inaccessible without JavaScript. In these
situations, it is best view the page source directly for more guidance
on how to proceed. Practice with both the selector gadget and the page
source.
\end{caution}
Once we have found the name of the element containing the desired data, we can utilize the \texttt{rvest} package to scrape the table. The general process for scraping an HTML table is
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Read the HTML identified by the web address.
\item
Isolate the node containing the data we desire.
\item
Parse the HTML table.
\item
Take a look at the data to ensure the columns are appropriate labels.
\end{enumerate}
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{library}\NormalTok{(rvest)}
\KeywordTok{library}\NormalTok{(janitor)}
\NormalTok{pfr_url <-}\StringTok{ "https://www.pro-football-reference.com"}
\NormalTok{broncos_url <-}\StringTok{ }\KeywordTok{str_c}\NormalTok{(pfr_url, }\StringTok{'/teams/den/2020.htm'}\NormalTok{)}
\NormalTok{broncos_url }\OperatorTok{%>%}
\StringTok{ }\CommentTok{# read the HTML}
\StringTok{ }\KeywordTok{read_html}\NormalTok{(.) }\OperatorTok{%>%}
\StringTok{ }\CommentTok{# isolate the node containing the HTML table}
\StringTok{ }\KeywordTok{html_node}\NormalTok{(., }\DataTypeTok{css =} \StringTok{'#team_conversions'}\NormalTok{) }\OperatorTok{%>%}
\StringTok{ }\CommentTok{# parse the html table}
\StringTok{ }\KeywordTok{html_table}\NormalTok{(.) }\OperatorTok{%>%}
\StringTok{ }\CommentTok{# make the first row of the table column headers and clean up column names}
\StringTok{ }\KeywordTok{row_to_names}\NormalTok{(., }\DataTypeTok{row_number =} \DecValTok{1}\NormalTok{) }\OperatorTok{%>%}
\StringTok{ }\KeywordTok{clean_names}\NormalTok{()}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
player x3d_att x3d_conv x3d_percent x4d_att x4d_conv x4d_percent
2 Team Stats 50 19 38.0 4 0 0.0
3 Opp. Stats 63 25 39.7 6 3 50.0
4 Lg Rank Offense 25 30
5 Lg Rank Defense 12 12
rz_att rztd rz_pct
2 13 6 46.2
3 13 6 46.2
4 28
5 4
\end{verbatim}
While these data need cleaning up before they can be used in practice, we will defer these responsibilities to Chapter \ref{Wrangling}.
\begin{progress}
Take this time to scrape the `Team Conversions' table on your own.
\end{progress}
\begin{feedback}
Any feedback for this section? Click
\href{https://docs.google.com/forms/d/e/1FAIpQLSePQZ3lIaCIPo9J2owXImHZ_9wBEgTo21A0s-A1ty28u4yfvw/viewform?entry.1684471501=The\%20R\%20Community}{here}
\end{feedback}
\hypertarget{ScrapingInTheWild}{%
\section{Scraping in the Wild}\label{ScrapingInTheWild}}
\begin{quote}
``You can have data without information, but you cannot have information without data.'' --- Daniel Keys Moran, Computer Scientist and Author
\end{quote}
In Chapter \ref{AccessingData}, we introduced the idea of rectangular data vs.~non-rectangular data, providing examples for each and demonstrating the process of rectangularization. We outlined how to use a web-API before introducing the concept of web scraping by illustrating the language of the web: HTML. Since webpages can be complicated, scraping can be complicated. In this chapter, we will leverage the Selector Gadget and our knowledge of HTML elements to scrape data from various sources. It is our belief that the only way to teach web scraping is through examples. Each example will become slightly more difficult than the previous.
\hypertarget{LessonsLearnedFromScraping}{%
\subsection{Lessons Learned from Scraping}\label{LessonsLearnedFromScraping}}
Scraping is a necessary evil that requires patience. While some tasks may prove easy, you will quickly find others seem insurmountable. In this section, we will outline a few tips to help you become a web scraper.
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\item
\textbf{Brainstorm}! Before jumping into your scraping project, ask yourself \emph{what data do I need} and \emph{where can I find it}? If you discover you need data from various sources, \emph{what is the unique identifier}, the link which ties these data together? Taking the time to explore different websites can save you a vast amount of time in the long run. As a general rule, simplistic looking websites are generally easier to scrape and often contain the same information as more complicated websites with several bells and whistles.
\item
\textbf{Start small}! Sometimes a scraping task can feel daunting, but it is important to \emph{view your project as a war, splitting it up into small battles}. If you are interested in the racial demographics of each of the United States, consider how you can first scrape this information for one state. In this process, don't forget tip 1!
\item
\textbf{Hyperlinks are your friend}! They can lead to websites with more detailed information or serve as the unique identifier you need between different data sources. Sometimes you won't even need to scrape the hyperlinks to navigate between webpages, making minor adjustments to the web address will sometimes do.
\item
\textbf{Data is everywhere}! Text color, font, or highlighting may serve as valuable data that you need. If these features exist on the webpage, then they exist within the HTML code which generated the document. Sometimes these features are well hidden or even inaccessible, leading to the last and final tip.
\item
\textbf{Ready your search engine}! Just like coding in \texttt{R} is an art, web developing is an art. When asking distinct developers to create the same website with the same functionality, the final result may be similar but the underlying HTML code could be drastically different. Why does this matter? You will run into an issue that hasn't been addressed in this text. Thankfully, if you've run into an issue, someone else probably has too. We cannot recommend websites like \href{https://stackoverflow.com/}{Stack Overflow} enough.
\end{enumerate}
\hypertarget{scraping-nfl-data-1}{%
\subsection{Scraping NFL Data}\label{scraping-nfl-data-1}}
In Chapter \ref{AccessingData}, we gathered some betting data pertaining to the NFL through a web-API. We may wish to supplement these betting data with data pertaining to NFL teams, players, or even playing conditions. As we progress through this sub-section, examples will become increasingly problematic or troublesome. The goal in this subsection is to introduce you to scraping by heeding the advice given in the Chapter \ref{LessonsLearnedFromScraping}.
Following our own advice, let's brainstorm. When you think of NFL data, you probably think of \href{https://www.nfl.com/stats}{NFL.com} or \href{https://www.espn.com/nfl/stats}{ESPN}. After further digging, we will explore \href{https://www.pro-football-reference.com/}{Pro Football Reference}, a reliable archive for football statistics (with a reasonably simple webpage). This is an exhaustive source which boasts team statistics, player statistics, and playing conditions for various seasons. Let's now start small by focusing on team statistics, but further, let's limit our scope to the \href{https://www.pro-football-reference.com/teams/den/2020.htm}{2020 Denver Broncos}. Notice, there are hyperlinks for each \href{https://www.pro-football-reference.com/players/G/GordMe00.htm}{player} documented in any of the categories, as well hyperlinks for each game's \href{https://www.pro-football-reference.com/boxscores/202009140den.htm}{boxscore} where there is information about playing conditions and outcomes. Hence, we have a common thread between team statistics, players, and boxscores. If, for example, we chose to scrape team statistics from one website and player statistics from another website, we may have to worry about a unique identifier (being team) if the websites have different naming conventions.
\hypertarget{html-tables-team-statistics-1}{%
\subsubsection{HTML Tables: Team Statistics}\label{html-tables-team-statistics-1}}
We'll start with the team statistics for the 2020 Denver Broncos which can be found in a table entitled `Team Stats and Rankings'. We'll need to figure in which element or \textbf{node} the table lives within the underlying HTML. To do this, we will utilize the CSS selector gadget. If we highlight over and click the table with the selector gadget, we will see that the desired table lives in an element called `\#team\_stats'.
\begin{figure}
{\centering \includegraphics[width=22.22in]{src/images/broncos_selector_gadget}
}
\caption{finding the team statistics element using the selector gadget}\label{fig:unnamed-chunk-33}
\end{figure}
Alternatively, we could view the page source and search for the table name. I've highlighted the information identified by the selector gadget with the cursor.
\begin{figure}
{\centering \includegraphics[width=22.22in]{src/images/broncos_page_source}
}
\caption{finding the team statistics element using the page source}\label{fig:unnamed-chunk-34}
\end{figure}
\begin{caution}
While the selector gadget is always a great first option, it is not
always reliable. There are instances when the selector gadget identifies
a node that is hidden or inaccessible without JavaScript. In these
situations, it is best view the page source directly for more guidance
on how to proceed. Practice with both the selector gadget and the page
source.
\end{caution}
Once we have found the name of the element containing the desired data, we can utilize the \texttt{rvest} package to scrape the table. The general process for scraping an HTML table is
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Read the HTML identified by the web address.
\item
Isolate the node containing the data we desire.
\item
Parse the HTML table.
\item
Take a look at the data to ensure the columns are appropriate labels.
\end{enumerate}
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{library}\NormalTok{(rvest)}
\KeywordTok{library}\NormalTok{(janitor)}
\NormalTok{pfr_url <-}\StringTok{ "https://www.pro-football-reference.com"}
\NormalTok{broncos_url <-}\StringTok{ }\KeywordTok{str_c}\NormalTok{(pfr_url, }\StringTok{'/teams/den/2020.htm'}\NormalTok{)}
\NormalTok{broncos_url }\OperatorTok{%>%}
\StringTok{ }\CommentTok{# read the HTML}
\StringTok{ }\KeywordTok{read_html}\NormalTok{(.) }\OperatorTok{%>%}
\StringTok{ }\CommentTok{# isolate the node containing the HTML table}
\StringTok{ }\KeywordTok{html_node}\NormalTok{(., }\DataTypeTok{css =} \StringTok{'#team_conversions'}\NormalTok{) }\OperatorTok{%>%}
\StringTok{ }\CommentTok{# parse the html table}
\StringTok{ }\KeywordTok{html_table}\NormalTok{(.) }\OperatorTok{%>%}
\StringTok{ }\CommentTok{# make the first row of the table column headers and clean up column names}
\StringTok{ }\KeywordTok{row_to_names}\NormalTok{(., }\DataTypeTok{row_number =} \DecValTok{1}\NormalTok{) }\OperatorTok{%>%}
\StringTok{ }\KeywordTok{clean_names}\NormalTok{()}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
player x3d_att x3d_conv x3d_percent x4d_att x4d_conv x4d_percent
2 Team Stats 50 19 38.0 4 0 0.0
3 Opp. Stats 63 25 39.7 6 3 50.0
4 Lg Rank Offense 25 30
5 Lg Rank Defense 12 12
rz_att rztd rz_pct
2 13 6 46.2
3 13 6 46.2
4 28
5 4
\end{verbatim}
While these data need cleaning up before they can be used in practice, we will defer these responsibilities to Chapter \ref{Wrangling}.