You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.Rmd
+65-40Lines changed: 65 additions & 40 deletions
Original file line number
Diff line number
Diff line change
@@ -72,19 +72,13 @@ head(best_uni, 10)
72
72
Thanks to the [robotstxt](https://github.com/ropensci/robotstxt), you can set `askRobot = TRUE` to ask the `robots.txt` file if it's permitted to scrape a specific web page.
73
73
74
74
If you want to scrap multiple list pages, just use `scrap()` in conjunction with `paste0()`.
75
-
Suppose that you want to scrape all `RStudio::conf 2021` speakers:
`images_scrap()` on the other hand download the images. It takes the following arguments:
@@ -273,22 +259,61 @@ In the following example we extract all the `png` images from [RStudio](https://
273
259
274
260
275
261
```{r, eval=FALSE}
276
-
277
262
# Suppose we're in a project which has a folder called my_images:
263
+
images_scrap(
264
+
link = "http://books.toscrape.com/",
265
+
imgpath = here::here("my_images"),
266
+
extn = "jpg" # images here use .jpg
267
+
)
268
+
```
269
+
270
+
## `pdf_scrap`
271
+
272
+
The function can be used to download `PDF` documents from a particular website, note that the `PDFs` need to be hosted within the website statically. Also, the access should not be restricted:
278
273
279
-
images_scrap(link = "https://rstudio.com/",
280
-
imgpath = here::here("my_images"),
281
-
extn = "png") # without the .
274
+
```{r, eval=FALSE}
275
+
pdf_scrap(
276
+
link = "https://www.make-it-in-germany.com/en/visa-residence/types/eu-blue-card",
277
+
path = here::here("my_pdfs")
278
+
)
279
+
```
280
+
281
+
## `csv_scrap`
282
+
283
+
```{r, eval=FALSE}
284
+
csv_scrap(
285
+
link = "https://sample-files.com/data/csv/",
286
+
path = here::here("my_csvs")
287
+
)
288
+
```
289
+
290
+
291
+
## `xlsx_scrap`
282
292
293
+
```{r, eval=FALSE}
294
+
xlsx_scrap(
295
+
link = "https://file-examples.com/index.php/sample-documents-download/sample-xls-download/",
296
+
path = here::here("my_xlsx")
297
+
)
298
+
```
299
+
300
+
301
+
## `xls_scrap`
302
+
303
+
```{r, eval=FALSE}
304
+
xls_scrap(
305
+
link = "https://file-examples.com/index.php/sample-documents-download/sample-xls-download/",
306
+
path = here::here("my_xls")
307
+
)
283
308
```
284
309
285
310
311
+
286
312
# Accessibility related functions
287
313
288
314
289
315
## `images_noalt_scrap()`
290
316
291
-
292
317
`images_noalt_scrap()` can be used to get the images within a specific web page that don't have an `alt` attribute which can be annoying for people using a screen reader:
0 commit comments