Skip to content

SEO bots whitelist doesn't cover Google's special-case crawlers (AdsBot, Mediapartners-Google, etc.) #1787

@nokimaro

Description

@nokimaro

Was looking into how crowdsecurity/seo-bots-whitelist verifies Google crawlers and noticed it doesn't really cover AdsBot-Google, Mediapartners-Google (the AdSense crawler), AdSense-RestrictedContent and other special-case Google crawlers. Wanted to ask whether this is intentional or a gap before opening a PR.

Current state

The whitelist resolves Google traffic via three sources:

  • rdns_seo_bots.txt: .googlebot.com. only
  • rdns_seo_bots.regex: crawl-N.N.N.N.googlebot.com.$, rate-limited-proxy-N.N.N.N.google.com.$, google-proxy-N.N.N.N.google.com.$
  • ip_seo_bots.txt: nothing Google-related (only DuckDuckGo and Pinterest)

What Google itself documents

Per https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot, Google splits crawlers into three groups, each with a different reverse DNS zone and a different IP range file:

  • common crawlers (Googlebot etc.): googlebot.com or google.com, ranges in googlebot.json
  • special-case crawlers (AdsBot, Mediapartners-Google, AdSense-RestrictedContent, APIs-Google): google.com, ranges in special-crawlers.json
  • user-triggered fetchers: google.com or googleusercontent.com, ranges in user-triggered-fetchers.json

The current rDNS list fully covers only the first group. The rate-limited-proxy-*.google.com regex partially catches Mediapartners as a side effect of #25, but that's it. AdsBot-Google and the rest aren't whitelisted via either rDNS or IP.

Why this looks like a gap rather than a design choice

I searched issues and PRs across crowdsecurity/hub, crowdsecurity/sec-lists and crowdsecurity/crowdsec for adsense, adsbot, mediapartners. Zero hits. No prior discussion, no rationale in YAML or data files, no comment explaining the exclusion. The Google rDNS coverage that does exist was added reactively (#25 + sec-lists#9) based on what individual users reported being blocked, not from a review of Google's documented crawler list. Could be wrong about the intent, that's why I'm asking before doing the work.

Architectural note

special-crawlers.json is a nested JSON object, so the existing data: directive can't consume it directly (supported types are string, regexp, map for JSON-lines). It would need a small converter step (a GH Action that pulls the JSON daily and commits a flat CIDR list to sec-lists). Straightforward to add.

Proposal

Two options, depending on what you prefer:

  1. Extend seo-bots-whitelist with an ip_google_special_crawlers.txt data source, auto-generated from special-crawlers.json. Pro: one knob, opt-in via the existing collection. Con: pulls in commercial AdsBot for everyone using the SEO whitelist, even sites without AdSense.
  2. New collection crowdsecurity/google-ads-bots-whitelist, opt-in for AdSense publishers. Pro: cleaner separation, no surprise for non-AdSense users. Con: more moving parts.

Same converter pipeline works for both. Happy to do the PR either way, just want a quick read on which direction makes sense, or whether I'm missing the actual reason this isn't already there.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions