| title | Fossil Location HTML Scraping - processing tables into maps | ||||
|---|---|---|---|---|---|
| output |
|
Don Kelly (http://donaldkenney.x10.mx/FOSSINDX.HTM) has created a well curated website that contains, among other things, a list of 16000 fossil sites in the United States and Canada. This list is nicely formatted into a series of html tables on webpages for each state/province. The table includes various information (e.g., lat-long, formation, fossils) for each locality.
The structure and content of the website make it an interesting dataset to practice web scraping as well as producing geospatial visualizations. The goal of this project is to scrape the data from the multiple websites and html tables, format data into a single table representing all 16000 fossil localities, then produce some visualizations from the dataset.
Required libraries
library(dplyr)
library(rvest)
library(leaflet)
library(qdapRegex)This is going to be done in by gathering a list of the URL's to the indivdual tables (listed on: http://donaldkenney.x10.mx/FOSSINDX.HTM). Then we will loop through the list of URL's scraping the html tables and appending them to a master table. Once we have a master table we will reformat and tidy the data. This first section will coerce many variables due to the structure and completeness of the html tables.
#Get structure of main website and create list of links to individual tables
website <- "http://donaldkenney.x10.mx/FOSSINDX.HTM"
website_structure <- read_html(website)
website_structure <- website_structure %>% html_nodes("li")
#Set up some constants prior to entering the for loop
url_prefix <- "http://donaldkenney.x10.mx/"
fossil_table <- NULL
#start the for loop that will scrape individual tables on separate URL's
for (i in 1:length(website_structure)){
#assemble the entire URL that we need to navigate to
url_suffex <- qdapRegex::ex_between(website_structure[i], '"', '"')
url <- paste(url_prefix,url_suffex,sep="")
#scrape the indvidual table
fossils <- read_html(url)
fossildf <- fossils %>% html_nodes("table") %>% html_table(fill=TRUE)
fossildf <- fossildf[[1]]
#bind indiviudal table to the master table
fossil_table <- rbind(fossil_table,fossildf)
}
#rename some columns
names <- colnames(fossil_table)
names[10:15] <- c("10","LatLong","12","13","14","15")
colnames(fossil_table) <- names
#Mutate in lat/long columns and select the required columns
fossil_table <- fossil_table %>%
mutate(Latitude = substr(LatLong,1,7), Longitude = substr(LatLong,9,17)) %>%
select(Location,County,`State/Province`,`Directions,Notes`,Age,Formation,Fossils,Latitude,Longitude)
#Change lat/long to numeric from string.
fossil_table$Latitude <- as.numeric(fossil_table$Latitude)
fossil_table$Longitude <- as.numeric(fossil_table$Longitude)
#view final table
str(fossil_table)## 'data.frame': 16882 obs. of 9 variables:
## $ Location : chr "Central Alberta*" "Mount Dawson Creek[?]" "Southern Alberta" "Southern Alberta" ...
## $ County : chr "[?]" "[?]" "[?]" "[?]" ...
## $ State/Province : chr "AB" "AB" "AB" "AB" ...
## $ Directions,Notes: chr "" "" "Alberta in" "" ...
## $ Age : chr "K" "Kl" "Ku" "Ku" ...
## $ Formation : chr "Edmonton Group" "Commotion" "Oldman" "Judith River" ...
## $ Fossils : chr "vertebrates-reptilia-dinosauria-Leptoceratops,ornithischia-Pachyrhinosaurus" "plants" "vertebrates-reptilia-dinosauria-Edmontosaurus,hadrosauroidea-Corythosaurus,Saurolophus;ornithischia-Monoclonius"| __truncated__ "vertebrates-reptilia-dinosauria-ornithischia-ankylosauria-Panoplosaurus;nodosauridae-Edmontonia,Euoplocephalus" ...
## $ Latitude : num 53.5 51.2 49.6 49.6 49.1 ...
## $ Longitude : num -114 -117 -113 -113 -114 ...
It should be emphasized here that there are ~16000 observations in this dataframe, so trying to plot all the fossil localities in North America will be computationally taxing. It is recommended that you filter by state as done below.
#Filter data to include one or more states/provinces. State/province codes are listed below:
#"AB" "AK" "AL" "AR" "AZ" "BA" "BC" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN" "KS" "KY" "LA" "MA" "MB" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NB" "NC"
#"ND" "NE" "NH" "NJ" "NL" "NM" "NS" "NT" "NU" "NV" "NY" "OH" "OK" "ON" "OR" "PE" "PA" "QC" "RI" "SC" "SD" "SK" "TN" "TX" "UT" "VA" "VT" "WA" "WI" "WV" "WY" "YT"
SearchLocation <- c("BC")
fossil_map <- fossil_table %>% filter(`State/Province`==SearchLocation) # the master table will be filtered into a dataframe for mapping (fossil_map)
#Use leaflet package to make a map of fossil localtities within the fossil_map dataframe
m <- leaflet() %>%
addTiles() %>% # Add default OpenStreetMap map tiles
addMarkers(lng=fossil_map$Longitude, lat=fossil_map$Latitude,
popup=paste("Fossils: ", fossil_map$Fossils, "<br/>",
"Formation: ", fossil_map$Formation))
mFormation: ","Fossils: invertebrates-mollusks-cephalopoda-ammonoidea
Formation: ","Fossils: invertebrates-arthropoda-insecta;;plants,vertebrates-fish?
Formation: ","Fossils: invertebrates-mollusks-cephalopoda-ammonoidea;;;plants(petrified wood)
Formation: ","Fossils: plants
Formation: ","Fossils: invertebrates-arthropoda-insecta-hemiptera-Cercopidae;;;;plants-angiosperms-sequoioidae-Metasequoia;;gymnospermopsida-Glyptostrobus;;vertebrates(?)
Formation: Smithers?","Fossils: invertebrates(marine)
Formation: ","Fossils: invertebrates-mollusks
Formation: ","Fossils: -
Formation: ","Fossils: invertebrates-mollusks-bivalvia,gastropoda
Formation: ","Fossils: invertebrates(obscure)
Formation: ","Fossils:
Formation: ","Fossils: invertebrates-mollusks-cephalopoda-belemnoidea
Formation: ","Fossils: invertebrates(Ediacaran)
Formation: ","Fossils: invertebrates-arthropoda-insecta?;;plants?,vertebrates-fish-actinopterygii-Amyzon
Formation: ","Fossils: vertebrates-fish-actinopterygii-Eohiodon
Formation: ","Fossils: plants(leaves)(impressions),vertebrates-fish
Formation: ","Fossils: invertebrates-arthropoda-insecta(55Taxa);;plants(41Taxa),vertebrates-fish(3Taxa)
Formation: ","Fossils: plants(leaves),vertebrates-fish
Formation: ","Fossils: plants
Formation: Huntington_BC","Fossils:
Formation: ","Fossils: invertebrates-arthropoda-trilobita-Zacanthoides
Formation: Stephen Shale","Fossils: invertebrates-arthropoda-trilobita(diverse),(others)
Formation: ","Fossils: invertebrates-arthropoda-trilobita-bathyuridae-Bathyuriscus;Burlingia,Neolenus
Formation: Stephen","Fossils: invertebrates-arthropoda-trilobita-olenellidae-Olenellus(O gilberti)
Formation: ","Fossils: invertebrates
Formation: Stephen","Fossils: invertebrates-arthropoda-trilobita-Ogygopsis,(others)
Formation: Stephen","Fossils: invertebrates-mollusks
Formation: Cathedral (Canada)","Fossils: invertebrates-mollusks
Formation: Mount Whyte","Fossils: invertebrates-brachiopoda-acrotretida-Acrothele
Formation: ","Fossils: invertebrates-mollusks
Formation: Paget","Fossils: invertebrates-arthropoda-trilobita-Ogygopsis,(others)
Formation: Stephen","Fossils: invertebrates-mollusks
Formation: Mount Whyte","Fossils: invertebrates-arthropoda-trilobita-Elrathia,Ogygopsis,Zacanthoides
Formation: ","Fossils: invertebrates-mollusks
Formation: Paget","Fossils: invertebrates-mollusks-bivalvia-Inoceramus,Nemodon;cephalopoda-ammonoidea-Gaudryceras,Hypophylloceras,Nostoceras,Pachydiscus,Phyllopachyceras,Pseudophyllites
Formation: Lambert_BC","Fossils: plants(seeds)(cones)
Formation: Spray","Fossils: plants
Formation: Nanaimo","Fossils: plants(seeds)(cones)
Formation: Comox","Fossils:
Formation: ","Fossils: vertebrates-reptilia(marine)-plesiosauria,squamata-mosasauridae;turtles
Formation: Haslam,Pender","Fossils: invertebrates-mollusks-bivalvia,cephalopoda-ammonoidea
Formation: ","Fossils: invertebrates?,vertebrates-reptilia-Elasmosaurus,Tylosaurus
Formation: ","Fossils: ?
Formation: ","Fossils: invertebrates(55 taxa)-arthropoda-trilobita(3 taxa);brachiopoda(22 taxa)
Formation: Mount Mark","Fossils: invertebrates-cnidaria-corals
Formation: Simla","Fossils: invertebrates-mollusks
Formation: Mount Whyte","Fossils: invertebrates-mollusks
Formation: Cathedral (Canada)","Fossils: invertebrates-arthropoda-trilobita-olenellidae-Olenellus;Wanneria
Formation: Eager","Fossils: invertebrates-mollusks
Formation: Eager","Fossils: invertebrates-arthropoda-trilobita-Labiostria
Formation: ","Fossils: invertebrates-brachiopoda-Obolus
Formation: Eldon","Fossils: invertebrates-brachiopoda,cnidaria-corals;mollusks-cephalopoda-ammonoidea;;;protists-forams
Formation: ","Fossils: invertebrates-arthropoda-trilobita-olenellidae-Olenellus;Wanneria
Formation: Eager","Fossils: invertebrates-arthropoda-trilobita-Albertella,bathyuridae-Bathyuriscus;;;brachiopoda-acrotretida-Acrothele;Micromitra,Obolus,Wimanella
Formation: ","Fossils: invertebrates-mollusks
Formation: Sherbrooke","Fossils: invertebrates-mollusks
Formation: Paget","Fossils: invertebrates-cnidaria-corals
Formation: Simla","Fossils: vertebrates-fish-sarcopterygii-dipnoi-coelacanthiformes-Bobasatrania?
Formation: Grayling","Fossils: vertebrates-fish-chondrichthyes-selachii(sharks)-holocephali-Listracanthus
Formation: Sulphur Mountain","Fossils: vertebrates-fish(large)-sarcopterygii-dipnoi-coelacanthiformes-Albertonia,Bobasatrania;;;;reptilia-ichthyosauria
Formation: ","Fossils: vertebrates-fish-actinopterygii-palaeoniscidae-Boreosomus,Pteronisculus;Perleidus,Saurichthys;chondrichthyes-selachii-holocephali-Listracanthus;;;sarcopterygii-dipnoi-coelacanthiformes-Albertonia,Bobasatrania,Whitea
Formation: Sulphur Mountain","Fossils: vertebrates-fish-chondrichthyes-selachii-hybodontidae-Palaeobates;;;;reptilia-ichthyosauria-Shastasaurus
Formation: Sulphur Mountain","Fossils: vertebrates-fish(bones)
Formation: Toad,Grayling","Fossils: invertebrates-mollusks-bivalvia-Buchia;cephalopoda-belemnoidea
Formation: ","Fossils: invertebrates-mollusks-bivalvia-Aucella
Formation: ","Fossils: invertebrates-mollusks-cephalopoda-ammonoidea-Paracadoceras,Pseudocadoceras
Formation: Mysterious Creek","Fossils: invertebrates-mollusks-bivalvia-Buchia;cephalopoda-ammonoidea,belemnoidea
Formation: Mysterious Creek","Fossils: invertebrates-mollusks-bivalvia-Buchia;cephalopoda-belemnoidea
Formation: ","Fossils: invertebrates-mollusks-cephalopoda-ammonoidea
Formation: ","Fossils: invertebrates-arthropoda-insecta;;plants(leaves-ginkgophyta-Ginkgo(seed pods);;vertebrates-fish
Formation: Kamloops Group","Fossils: invertebrates-echinoderms-crinoids(fragments)
Formation: Chilliwack Group","Fossils: invertebrates-brachiopoda-Gigantoproductus;bryozoa,cnidaria-coelenterata
Formation: ","Fossils: plants?
Formation: ","Fossils: plants
Formation: Huntington_BC","Fossils: invertebrates-mollusks(shells)
Formation: ","Fossils: invertebrates-brachiopoda,echinoderms-echinoidea(sea_urchins)(fragments);mollusks;worms(tubes)-annelida-Serpula
Formation: ","Fossils: plants(leaves)(wood)(fruits)(others)(poorly preserved)
Formation: Burrard","Fossils: plants(leaves)
Formation: Burrard,Kitsilano","Fossils:
Formation: ","Fossils: plants(leaves)(coal)
Formation: Kitsilano","Fossils: vertebrates-fish-sarcopterygii-dipnoi-coelacanthiformes
Formation: Grayling","Fossils: plants(fragments)-gymnospermopsida-coniferophyta;pteridophyta-(ferns)
Formation: Longarm (Canada)","Fossils: ?
Formation: ","Fossils: plants
Formation: Nanaimo","Fossils: plants(seeds)(cones)
Formation: Haslam","Fossils: invertebrates-mollusks-cephalopoda-ammonoidea-Pachydiscus?
Formation: Haslam","Fossils:
Formation: ","Fossils: invertebrates(diverse)(marine)
Formation: ","Fossils: plants-angiosperms-Ficus_fig,Magnolia,Platanus(sycamore);ginkgophyta-Ginkgo;gymnospermopsida-coniferophyta;pteridophyta-cycadopsida,(ferns)
Formation: ","Fossils: plants(leaves)(pine_needles)
Formation: ","Fossils:
Formation: ","Fossils: invertebrates-arthropoda-insecta(rare);mollusks-gastropoda;;plants(well preserved),vertebrates-mammals(teeth)
Formation: Allenby_BC?","Fossils: plants(leaves)(pine needles)(seeds)(roots casts)
Formation: ","Fossils: vertebrates-fish-actinopterygii-Eohiodon
Formation: Allenby_BC","Fossils: plants
Formation: ","Fossils: plants
Formation: ","Fossils: plants(coal)
Formation: Allenby_BC","Fossils: plants(leaves)-gymnospermopsida-coniferophyta-pineacea(needles)(cones)
Formation: ","Fossils: plants-angiosperms-Paleorosa
Formation: Allenby_BC","Fossils: invertebrates-arthropoda-insecta;;plants(leaves)
Formation: ","Fossils: plants(leaves)(impressions)
Formation: ","Fossils: vertebrates-reptilia(marine)-rauisuchidae
Formation: Pardonet","Fossils: vertebrates-reptilia-ichthyosauria
Formation: ","Fossils: -
Formation: ","Fossils: -
Formation: ","Fossils: -
Formation: ","Fossils: invertebrates-cnidaria-corals;;(others)
Formation: ","Fossils: invertebrates-mollusks
Formation: ","Fossils: ichnofossils-burrows_invertebrate-Skolithos
Formation: Monkman Quartzite","Fossils: invertebrates-cnidaria-corals
Formation: Kakisa","Fossils: invertebrates(marine)-mollusks-cephalopoda-ammonoidea
Formation: ","Fossils:
Formation: ","Fossils: invertebrates-mollusks
Formation: ","Fossils: invertebrates-mollusks-cephalopoda-ammonoidea-Desmoceras
Formation: Haida","Fossils: plants
Formation: ","Fossils: invertebrates-mollusks-cephalopoda-ammonoidea-Teloceras
Formation: Yakoun","Fossils: -
Formation: ","Fossils: invertebrates-brachiopoda-Gigantoproductus;bryozoa,cnidaria-coelenterata
Formation: ","Fossils: -
Formation: ","Fossils: invertebrates-arthropoda-crustacea-decapoda-Longusorbis
Formation: Spray","Fossils: invertebrates-mollusks-cephalopoda-ammonoidea
Formation: ","Fossils: invertebrates-mollusks-cephalopoda-ammonoidea(sparse)
Formation: ","Fossils: invertebrates-brachiopoda,conodonts,mollusks-bivalvia,cephalopoda-ammonoidea,belemnoidea;;;plants-(microflora),(pollen),pteridophyta-(ferns)
Formation: ","Fossils: plants(amber)
Formation: ","Fossils: invertebrates-mollusks-bivalvia-Aucella
Formation: ","Fossils: vertebrates-fish-actinopterygii-Eohiodon
Formation: Tranquille","Fossils: vertebrates-fish-actinopterygii-Eohiodon
Formation: ","Fossils: plants(leaves)-pteridophyta-(ferns)
Formation: ","Fossils: invertebrates-arthropoda-insecta(55Taxa);;plants(41Taxa),vertebrates-fish(3Taxa)
Formation: ","Fossils: invertebrates-echinoderms-crinoids
Formation: ","Fossils: plants
Formation: "],null,null,null,null,{"interactive":false,"permanent":false,"direction":"auto","opacity":1,"offset":[0,0],"textsize":"10px","textOnly":false,"className":"","sticky":true},null]}],"limits":{"lat":[48.3297,59.5163],"lng":[-132.0034,-115.2003]}},"evals":[],"jsHooks":[]}</script>