SEO bots whitelist doesn't cover Google's special-case crawlers (AdsBot, Mediapartners-Google, etc.)

Was looking into how `crowdsecurity/seo-bots-whitelist` verifies Google crawlers and noticed it doesn't really cover AdsBot-Google, Mediapartners-Google (the AdSense crawler),  AdSense-RestrictedContent and other special-case Google crawlers. Wanted to ask whether this is intentional or a gap before opening a PR.

  Current state

  The whitelist resolves Google traffic via three sources:

  - `rdns_seo_bots.txt`: `.googlebot.com.` only
  - `rdns_seo_bots.regex`: `crawl-N.N.N.N.googlebot.com.$`, `rate-limited-proxy-N.N.N.N.google.com.$`, `google-proxy-N.N.N.N.google.com.$`
  - `ip_seo_bots.txt`: nothing Google-related (only DuckDuckGo and Pinterest)

  What Google itself documents

  Per https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot, Google splits crawlers into three groups, each with a different reverse DNS zone and a different IP range file:

  - common crawlers (Googlebot etc.): `googlebot.com` or `google.com`, ranges in [googlebot.json](https://developers.google.com/search/apis/ipranges/googlebot.json)
  - special-case crawlers (AdsBot, Mediapartners-Google, AdSense-RestrictedContent, APIs-Google): google.com, ranges in [special-crawlers.json](https://developers.google.com/search/apis/ipranges/special-crawlers.json)
  - user-triggered fetchers: google.com or googleusercontent.com, ranges in [user-triggered-fetchers.json](https://developers.google.com/search/apis/ipranges/user-triggered-fetchers.json)

  The current rDNS list fully covers only the first group. The `rate-limited-proxy-*.google.com` regex partially catches Mediapartners as a side effect of #25, but that's it. AdsBot-Google and the rest  aren't whitelisted via either rDNS or IP.

  Why this looks like a gap rather than a design choice

  I searched issues and PRs across `crowdsecurity/hub`, `crowdsecurity/sec-lists` and `crowdsecurity/crowdsec` for adsense, adsbot, mediapartners. Zero hits. No prior discussion, no rationale in YAML or  data files, no comment explaining the exclusion. The Google rDNS coverage that does exist was added reactively (#25 + sec-lists#9) based on what individual users reported being blocked, not from a  review of Google's documented crawler list. Could be wrong about the intent, that's why I'm asking before doing the work.

  Architectural note

  `special-crawlers.json` is a nested JSON object, so the existing data: directive can't consume it directly (supported types are string, regexp, map for JSON-lines). It would need a small converter  step (a GH Action that pulls the JSON daily and commits a flat CIDR list to sec-lists). Straightforward to add.

  Proposal

  Two options, depending on what you prefer:

  1. Extend `seo-bots-whitelist` with an `ip_google_special_crawlers.txt` data source, auto-generated from `special-crawlers.json`. Pro: one knob, opt-in via the existing collection. Con: pulls in  commercial AdsBot for everyone using the SEO whitelist, even sites without AdSense.
  2. New collection `crowdsecurity/google-ads-bots-whitelist`, opt-in for AdSense publishers. Pro: cleaner separation, no surprise for non-AdSense users. Con: more moving parts.

  Same converter pipeline works for both. Happy to do the PR either way, just want a quick read on which direction makes sense, or whether I'm missing the actual reason this isn't already there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEO bots whitelist doesn't cover Google's special-case crawlers (AdsBot, Mediapartners-Google, etc.) #1787

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SEO bots whitelist doesn't cover Google's special-case crawlers (AdsBot, Mediapartners-Google, etc.) #1787

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions