Skip to content

Commit f96bece

Browse files
author
feddelegrand7
committed
update metadata
1 parent ef4766e commit f96bece

40 files changed

+4128
-4070
lines changed

R/files_scrap.R

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,17 +48,21 @@
4848
contain = ext
4949
)
5050

51+
if (length(urls_containing_files) == 1 && is.na(urls_containing_files)) {
52+
message("No file has been found. Returning NULL.")
53+
return(NULL)
54+
}
55+
5156
files_to_consider <- urls_containing_files %>%
5257
purrr::keep(function(x) {
5358
tolower(tools::file_ext(x)) == ext
5459
})
5560

5661
if (length(files_to_consider) == 0) {
5762
message("No file has been found. Returning NULL.")
58-
return(invisible(NULL))
63+
return(NULL)
5964
}
6065

61-
6266
files_to_consider <- purrr::map_chr(
6367
files_to_consider,
6468
.format_url,

R/weblink_scrap.R

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,6 @@ weblink_scrap <- function(link,
6767

6868
links <- unlist(links)
6969

70-
7170
if (is.null(contain)) {
7271
return(links)
7372
} else {

README.Rmd

Lines changed: 65 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -72,19 +72,13 @@ head(best_uni, 10)
7272
Thanks to the [robotstxt](https://github.com/ropensci/robotstxt), you can set `askRobot = TRUE` to ask the `robots.txt` file if it's permitted to scrape a specific web page.
7373

7474
If you want to scrap multiple list pages, just use `scrap()` in conjunction with `paste0()`.
75-
Suppose that you want to scrape all `RStudio::conf 2021` speakers:
7675

7776
```{r}
77+
base_link <- "http://quotes.toscrape.com/page/"
78+
links <- paste0(base_link, 1:3)
79+
node <- ".text"
7880
79-
base_link <- "https://global.rstudio.com/student/catalog/list?category_ids=1796-speakers&page="
80-
81-
links <- paste0(base_link, 1:3) # the speakers are listed from page 1 to 3
82-
83-
node <- ".pr-1"
84-
85-
86-
head(scrap(links, node), 10) # printing the first 10 speakers
87-
81+
head(scrap(links, node), 10)
8882
```
8983

9084
## `attribute_scrap()`
@@ -104,7 +98,7 @@ attributes <- attribute_scrap(link = "https://ropensci.org/",
10498
head(attributes, 10) # NA values are a tags without a class attribute
10599
```
106100

107-
Another example, let's we want to get all javascript dependencies within the same web page:
101+
Another example, let's say we want to get all javascript dependencies within the same web page:
108102

109103
```{r}
110104
@@ -145,26 +139,19 @@ Sometimes you'll find some useful information on the internet that you want to e
145139

146140
### Example
147141

148-
We'll work on the famous [IMDb website](https://www.imdb.com/). Let's say we need a data frame composed of:
149-
150-
- The title of the 50 best ranked movies of all time
151-
- Their release year
152-
- Their rating
153-
154142
We will need to use the `tidy_scrap()` function as follows:
155143

156144
```{r example3, message=FALSE, warning=FALSE}
157145
158-
my_link <- "https://www.imdb.com/search/title/?groups=top_250&sort=user_rating"
146+
my_link <- "http://books.toscrape.com/catalogue/page-1.html"
159147
160148
my_nodes <- c(
161-
".lister-item-header a", # The title
162-
".text-muted.unbold", # The year of release
163-
".ratings-imdb-rating strong" # The rating)
164-
)
165-
166-
names <- c("title", "year", "rating") # respect the nodes order
149+
"h3 > a", # Title
150+
".price_color", # Price
151+
".availability" # Availability
152+
)
167153
154+
names <- c("title", "price", "availability") # respect the order
168155
169156
tidy_scrap(link = my_link, nodes = my_nodes, colnames = names)
170157
@@ -179,19 +166,16 @@ Note that all columns will be of *character* class. you'll have to convert them
179166

180167
Using `titles_scrap()`, one can efficiently scrape titles which correspond to the _h1, h2 & h3_ HTML tags.
181168

182-
183-
184169
### Example
185170

186171
If we go to the [New York Times](https://www.nytimes.com/), we can easily extract the titles displayed within a specific web page :
187172

188173

189174
```{r example4}
190175
176+
title <- titles_scrap(link = "https://www.nytimes.com/")
191177
192-
titles_scrap(link = "https://www.nytimes.com/")
193-
194-
178+
head(titles)
195179
196180
```
197181

@@ -200,9 +184,9 @@ Further, it's possible to filter the results using the `contain` argument:
200184

201185
```{r}
202186
203-
titles_scrap(link = "https://www.nytimes.com/", contain = "TrUMp", case_sensitive = FALSE)
204-
187+
titles <- titles_scrap(link = "https://www.nytimes.com/", contain = "TrUMp", case_sensitive = FALSE)
205188
189+
head(titles)
206190
207191
```
208192

@@ -217,8 +201,9 @@ Let's get some paragraphs from the lovely [ropensci.org](https://ropensci.org/)
217201

218202
```{r}
219203
220-
paragraphs_scrap(link = "https://ropensci.org/")
204+
pgs <- paragraphs_scrap(link = "https://ropensci.org/")
221205
206+
head(pgs)
222207
```
223208

224209
If needed, it's possible to collapse the paragraphs into one bag of words:
@@ -238,11 +223,11 @@ paragraphs_scrap(link = "https://ropensci.org/", collapse = TRUE)
238223

239224
```{r}
240225
241-
weblink_scrap(link = "https://www.worldbank.org/en/access-to-information/reports/",
226+
links <- weblink_scrap(link = "https://www.worldbank.org/en/access-to-information/reports/",
242227
contain = "PDF",
243228
case_sensitive = FALSE)
244229
245-
230+
head(links)
246231
```
247232

248233
## `images_scrap() ` and `images_preview()`
@@ -254,8 +239,9 @@ Let's say we want to list all the images from the official [RStudio](https://rst
254239

255240
```{r}
256241
257-
images_preview(link = "https://rstudio.com/")
242+
imgs <- images_preview(link = "https://posit.co/")
258243
244+
head(imgs)
259245
```
260246

261247
`images_scrap()` on the other hand download the images. It takes the following arguments:
@@ -273,22 +259,61 @@ In the following example we extract all the `png` images from [RStudio](https://
273259

274260

275261
```{r, eval=FALSE}
276-
277262
# Suppose we're in a project which has a folder called my_images:
263+
images_scrap(
264+
link = "http://books.toscrape.com/",
265+
imgpath = here::here("my_images"),
266+
extn = "jpg" # images here use .jpg
267+
)
268+
```
269+
270+
## `pdf_scrap`
271+
272+
The function can be used to download `PDF` documents from a particular website, note that the `PDFs` need to be hosted within the website statically. Also, the access should not be restricted:
278273

279-
images_scrap(link = "https://rstudio.com/",
280-
imgpath = here::here("my_images"),
281-
extn = "png") # without the .
274+
```{r, eval=FALSE}
275+
pdf_scrap(
276+
link = "https://www.make-it-in-germany.com/en/visa-residence/types/eu-blue-card",
277+
path = here::here("my_pdfs")
278+
)
279+
```
280+
281+
## `csv_scrap`
282+
283+
```{r, eval=FALSE}
284+
csv_scrap(
285+
link = "https://sample-files.com/data/csv/",
286+
path = here::here("my_csvs")
287+
)
288+
```
289+
290+
291+
## `xlsx_scrap`
282292

293+
```{r, eval=FALSE}
294+
xlsx_scrap(
295+
link = "https://file-examples.com/index.php/sample-documents-download/sample-xls-download/",
296+
path = here::here("my_xlsx")
297+
)
298+
```
299+
300+
301+
## `xls_scrap`
302+
303+
```{r, eval=FALSE}
304+
xls_scrap(
305+
link = "https://file-examples.com/index.php/sample-documents-download/sample-xls-download/",
306+
path = here::here("my_xls")
307+
)
283308
```
284309

285310

311+
286312
# Accessibility related functions
287313

288314

289315
## `images_noalt_scrap()`
290316

291-
292317
`images_noalt_scrap()` can be used to get the images within a specific web page that don't have an `alt` attribute which can be annoying for people using a screen reader:
293318

294319

0 commit comments

Comments
 (0)