Skip to content

Commit 5981423

Browse files
author
feddelegrand7
committed
new comments_scrap function
1 parent dd03a62 commit 5981423

38 files changed

+479
-207
lines changed

DESCRIPTION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Package: ralger
22
Type: Package
33
Title: Easy Web Scraping
4-
Version: 2.2.4
4+
Version: 2.3.0
55
Authors@R: c(
66
person("Mohamed El Fodil", "Ihaddaden", email = "ihaddaden.fodeil@gmail.com", role = c("aut", "cre")),
77
person("Ezekiel", "Ogundepo", role = c("ctb")),

NAMESPACE

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Generated by roxygen2: do not edit by hand
22

33
export(attribute_scrap)
4+
export(comments_scrap)
45
export(csv_scrap)
56
export(images_noalt_scrap)
67
export(images_preview)

NEWS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
- `xls_scrap`
55
- `xlsx_scrap`
66
- `csv_scrap`
7+
- `comments_scrap`
78

89
# ralger 2.2.4
910

R/comments_scrap.R

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
#' Scrape HTML comments from a web page
2+
#'
3+
#' @description Extracts HTML comments (<!-- comment -->) from a webpage. Useful for detecting hidden notes, debug info, or developer messages.
4+
#'
5+
#' @param link Character. The URL of the web page to scrape.
6+
#' @param askRobot Logical. Should the function check robots.txt before scraping? Default is FALSE.
7+
#' @return A character vector of HTML comments found on the page.
8+
#'
9+
#' @examples
10+
#' \donttest{
11+
#' link <- "https://example.com"
12+
#' comments_scrap(link)
13+
#' }
14+
#'
15+
#' @export
16+
#' @importFrom xml2 read_html
17+
#' @importFrom rvest html_nodes
18+
#' @importFrom robotstxt paths_allowed
19+
#' @importFrom curl has_internet
20+
#' @importFrom crayon green bgRed
21+
comments_scrap <- function(link, askRobot = FALSE) {
22+
23+
###################### Ask Robot part ######################################################
24+
25+
if (askRobot) {
26+
if (paths_allowed(link)) {
27+
message(green("robots.txt allows scraping this web page"))
28+
} else {
29+
message(bgRed("WARNING: robots.txt prohibits scraping this web page"))
30+
return(NA)
31+
}
32+
}
33+
34+
############################################################################################
35+
36+
tryCatch(
37+
expr = {
38+
if (!has_internet()) {
39+
stop("No internet connection.")
40+
}
41+
42+
html_content <- read_html(link)
43+
44+
raw_content <- as.character(html_content)
45+
46+
comments <- regmatches(
47+
raw_content,
48+
gregexpr("<!--(.*?)-->", raw_content, perl = TRUE)
49+
)[[1]]
50+
51+
comments <- trimws(comments)
52+
53+
if (length(comments) == 0) {
54+
message("No HTML comments found.")
55+
return(NA)
56+
}
57+
58+
return(comments)
59+
},
60+
error = function(cond) {
61+
message("Error while scraping comments: ", cond$message)
62+
return(NA)
63+
}
64+
)
65+
}

R/table_scrap.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
#' @examples \donttest{
1313
#' # Extracting premier ligue 2019/2020 top scorers
1414
#'
15-
#' link <- "https://www.topscorersfootball.com/premier-league"
15+
#' link <- "https://www.topscorersfootball.com/premier-league"
1616
#' table_scrap(link)
1717
#'
1818
#' }

README.Rmd

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -305,6 +305,15 @@ xls_scrap(
305305
```
306306

307307

308+
## `comments_scrap()`
309+
310+
Useful when you want to extract the `HTML` comments within a webpage:
311+
312+
```{r}
313+
head(comments_scrap("https://posit.co"))
314+
```
315+
316+
308317

309318
# Accessibility related functions
310319

README.md

Lines changed: 21 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,12 +15,10 @@ downloads](https://cranlogs.r-pkg.org/badges/grand-total/ralger)](https://cran.r
1515
<!-- [![license](https://img.shields.io/github/license/mashape/apistatus.svg)](https://choosealicense.com/licenses/mit/) -->
1616
[![R
1717
badge](https://img.shields.io/badge/Build%20with-♥%20and%20R-blue)](https://github.com/feddelegrand7/ralger)
18-
[![R
19-
badge](https://img.shields.io/badge/-Sponsor-brightgreen)](https://www.buymeacoffee.com/Fodil)
18+
2019
[![R build
2120
status](https://github.com/feddelegrand7/ralger/workflows/R-CMD-check/badge.svg)](https://github.com/feddelegrand7/ralger/actions)
22-
[![Codecov test
23-
coverage](https://codecov.io/gh/feddelegrand7/ralger/branch/master/graph/badge.svg)](https://codecov.io/gh/feddelegrand7/ralger?branch=master)
21+
2422
<!-- badges: end -->
2523

2624
The goal of **ralger** is to facilitate web scraping in R. For a quick
@@ -248,9 +246,9 @@ easily extract the titles displayed within a specific web page :
248246
titles <- titles_scrap(link = "https://www.nytimes.com/")
249247

250248
head(titles)
251-
#> [1] "New York Times - Top Stories" "More News"
252-
#> [3] "The AthleticSports coverage" "Well"
253-
#> [5] "Culture and Lifestyle" "AudioPodcasts and narrated articles"
249+
#> [1] "New York Times - Top Stories" "What to Watch and Read"
250+
#> [3] "More News" "The AthleticSports coverage"
251+
#> [5] "Well" "Culture and Lifestyle"
254252
```
255253

256254
Further, it’s possible to filter the results using the `contain`
@@ -399,6 +397,20 @@ xls_scrap(
399397
)
400398
```
401399

400+
## `comments_scrap()`
401+
402+
Useful when you want to extract the `HTML` comments within a webpage:
403+
404+
``` r
405+
head(comments_scrap("https://posit.co"))
406+
#> [1] "<!-- Start VWO Common Smartcode -->"
407+
#> [2] "<!-- End VWO Common Smartcode -->"
408+
#> [3] "<!-- Start VWO Async SmartCode -->"
409+
#> [4] "<!-- End VWO Async SmartCode -->"
410+
#> [5] "<!-- This site is optimized with the Yoast SEO plugin v25.2 - https://yoast.com/wordpress/plugins/seo/ -->"
411+
#> [6] "<!-- / Yoast SEO plugin. -->"
412+
```
413+
402414
# Accessibility related functions
403415

404416
## `images_noalt_scrap()`
@@ -410,9 +422,8 @@ people using a screen reader:
410422
``` r
411423

412424
images_noalt_scrap(link = "https://www.r-consortium.org/")
413-
#> [1] <img loading="lazy" src="./posts/r-consortium-awards-first-round-of-2025-isc-grants/isc-grantees-2025-1.png" class="thumbnail-image card-img" style="height: 150px;">
414-
#> [2] <img loading="lazy" src="./posts/exploring-kuzco-making-computer-vision-for-r-easily-accessible/frankthull.png" class="thumbnail-image card-img" style="height: 150px;">
415-
#> [3] <img loading="lazy" src="./posts/quantifying-participation-risk-with-r-and-r-shiny-a-new-frontier-in-financial-risk-modeling/demo.png" class="thumbnail-image card-img" style="height: 150px;">
425+
#> No images without 'alt' attribute found at: https://www.r-consortium.org/
426+
#> NULL
416427
```
417428

418429
If no images without `alt` attributes are found, the function returns

docs/404.html

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/CODE_OF_CONDUCT.html

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/LICENSE-text.html

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)