-
Notifications
You must be signed in to change notification settings - Fork 79
Description
Hello and thank you for maintaining this package.
I'm using RSelenium in R to scrape metadata from Google Scholar profiles (e.g., citations, h-index, publication count). My goal is to process a list of ~700 profile URLs, visiting each one to extract this public data.
However, the automation is blocked immediately: the CAPTCHA ("I'm not a robot") challenge appears already when opening the first profile, preventing any further interaction via Selenium.
My questions are:
- Are there any recommended approaches in
RSeleniumto deal with CAPTCHA challenges like this? - Are techniques like rotating proxies, randomized delays, or custom user-agents compatible with
RSeleniumin R? - Do you recommend any browser configuration (e.g., Chrome vs. Firefox) or headless mode settings to reduce the risk of blocks when scraping public pages like Scholar profiles?
I’m currently using Firefox with this setup:
Clear environment
rm(list = ls())
Load packages
library(pacman)
p_load("tidyverse", "readxl", "RSelenium", "httpuv", "wdman")
Read list of URLs
investigadores <- read_excel("0- Datos brutos/Investigadores.xlsx")
total_perfiles <- nrow(investigadores)
Start RSelenium
puerto_libre <- httpuv::randomPort()
driver <- RSelenium::rsDriver(
port = puerto_libre,
browser = "firefox",
version = "latest"
)
remote_driver <- driver[["client"]]
Any suggestions or best practices to avoid immediate CAPTCHA blocks would be highly appreciated. If useful, I can share the full code I'm using.
Thanks in advance!