Was looking into how crowdsecurity/seo-bots-whitelist verifies Google crawlers and noticed it doesn't really cover AdsBot-Google, Mediapartners-Google (the AdSense crawler), AdSense-RestrictedContent and other special-case Google crawlers. Wanted to ask whether this is intentional or a gap before opening a PR.
Current state
The whitelist resolves Google traffic via three sources:
rdns_seo_bots.txt: .googlebot.com. only
rdns_seo_bots.regex: crawl-N.N.N.N.googlebot.com.$, rate-limited-proxy-N.N.N.N.google.com.$, google-proxy-N.N.N.N.google.com.$
ip_seo_bots.txt: nothing Google-related (only DuckDuckGo and Pinterest)
What Google itself documents
Per https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot, Google splits crawlers into three groups, each with a different reverse DNS zone and a different IP range file:
- common crawlers (Googlebot etc.):
googlebot.com or google.com, ranges in googlebot.json
- special-case crawlers (AdsBot, Mediapartners-Google, AdSense-RestrictedContent, APIs-Google): google.com, ranges in special-crawlers.json
- user-triggered fetchers: google.com or googleusercontent.com, ranges in user-triggered-fetchers.json
The current rDNS list fully covers only the first group. The rate-limited-proxy-*.google.com regex partially catches Mediapartners as a side effect of #25, but that's it. AdsBot-Google and the rest aren't whitelisted via either rDNS or IP.
Why this looks like a gap rather than a design choice
I searched issues and PRs across crowdsecurity/hub, crowdsecurity/sec-lists and crowdsecurity/crowdsec for adsense, adsbot, mediapartners. Zero hits. No prior discussion, no rationale in YAML or data files, no comment explaining the exclusion. The Google rDNS coverage that does exist was added reactively (#25 + sec-lists#9) based on what individual users reported being blocked, not from a review of Google's documented crawler list. Could be wrong about the intent, that's why I'm asking before doing the work.
Architectural note
special-crawlers.json is a nested JSON object, so the existing data: directive can't consume it directly (supported types are string, regexp, map for JSON-lines). It would need a small converter step (a GH Action that pulls the JSON daily and commits a flat CIDR list to sec-lists). Straightforward to add.
Proposal
Two options, depending on what you prefer:
- Extend
seo-bots-whitelist with an ip_google_special_crawlers.txt data source, auto-generated from special-crawlers.json. Pro: one knob, opt-in via the existing collection. Con: pulls in commercial AdsBot for everyone using the SEO whitelist, even sites without AdSense.
- New collection
crowdsecurity/google-ads-bots-whitelist, opt-in for AdSense publishers. Pro: cleaner separation, no surprise for non-AdSense users. Con: more moving parts.
Same converter pipeline works for both. Happy to do the PR either way, just want a quick read on which direction makes sense, or whether I'm missing the actual reason this isn't already there.
Was looking into how
crowdsecurity/seo-bots-whitelistverifies Google crawlers and noticed it doesn't really cover AdsBot-Google, Mediapartners-Google (the AdSense crawler), AdSense-RestrictedContent and other special-case Google crawlers. Wanted to ask whether this is intentional or a gap before opening a PR.Current state
The whitelist resolves Google traffic via three sources:
rdns_seo_bots.txt:.googlebot.com.onlyrdns_seo_bots.regex:crawl-N.N.N.N.googlebot.com.$,rate-limited-proxy-N.N.N.N.google.com.$,google-proxy-N.N.N.N.google.com.$ip_seo_bots.txt: nothing Google-related (only DuckDuckGo and Pinterest)What Google itself documents
Per https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot, Google splits crawlers into three groups, each with a different reverse DNS zone and a different IP range file:
googlebot.comorgoogle.com, ranges in googlebot.jsonThe current rDNS list fully covers only the first group. The
rate-limited-proxy-*.google.comregex partially catches Mediapartners as a side effect of #25, but that's it. AdsBot-Google and the rest aren't whitelisted via either rDNS or IP.Why this looks like a gap rather than a design choice
I searched issues and PRs across
crowdsecurity/hub,crowdsecurity/sec-listsandcrowdsecurity/crowdsecfor adsense, adsbot, mediapartners. Zero hits. No prior discussion, no rationale in YAML or data files, no comment explaining the exclusion. The Google rDNS coverage that does exist was added reactively (#25 + sec-lists#9) based on what individual users reported being blocked, not from a review of Google's documented crawler list. Could be wrong about the intent, that's why I'm asking before doing the work.Architectural note
special-crawlers.jsonis a nested JSON object, so the existing data: directive can't consume it directly (supported types are string, regexp, map for JSON-lines). It would need a small converter step (a GH Action that pulls the JSON daily and commits a flat CIDR list to sec-lists). Straightforward to add.Proposal
Two options, depending on what you prefer:
seo-bots-whitelistwith anip_google_special_crawlers.txtdata source, auto-generated fromspecial-crawlers.json. Pro: one knob, opt-in via the existing collection. Con: pulls in commercial AdsBot for everyone using the SEO whitelist, even sites without AdSense.crowdsecurity/google-ads-bots-whitelist, opt-in for AdSense publishers. Pro: cleaner separation, no surprise for non-AdSense users. Con: more moving parts.Same converter pipeline works for both. Happy to do the PR either way, just want a quick read on which direction makes sense, or whether I'm missing the actual reason this isn't already there.