Skip to content

Google Scholar CAPTCHA appears on first profile using RSelenium #296

@avilavictor997

Description

@avilavictor997

Hello and thank you for maintaining this package.

I'm using RSelenium in R to scrape metadata from Google Scholar profiles (e.g., citations, h-index, publication count). My goal is to process a list of ~700 profile URLs, visiting each one to extract this public data.

However, the automation is blocked immediately: the CAPTCHA ("I'm not a robot") challenge appears already when opening the first profile, preventing any further interaction via Selenium.

My questions are:

  1. Are there any recommended approaches in RSelenium to deal with CAPTCHA challenges like this?
  2. Are techniques like rotating proxies, randomized delays, or custom user-agents compatible with RSelenium in R?
  3. Do you recommend any browser configuration (e.g., Chrome vs. Firefox) or headless mode settings to reduce the risk of blocks when scraping public pages like Scholar profiles?

I’m currently using Firefox with this setup:

Clear environment

rm(list = ls())

Load packages

library(pacman)
p_load("tidyverse", "readxl", "RSelenium", "httpuv", "wdman")

Read list of URLs

investigadores <- read_excel("0- Datos brutos/Investigadores.xlsx")
total_perfiles <- nrow(investigadores)

Start RSelenium

puerto_libre <- httpuv::randomPort()
driver <- RSelenium::rsDriver(
port = puerto_libre,
browser = "firefox",
version = "latest"
)
remote_driver <- driver[["client"]]

Any suggestions or best practices to avoid immediate CAPTCHA blocks would be highly appreciated. If useful, I can share the full code I'm using.

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions