-
-
Notifications
You must be signed in to change notification settings - Fork 506
Closed
Labels
Description
What
Web crawlers can currently index every country code/language code pairs possible:
For example, fr-es.openfoodfacts.org currently has 120k webpages indexed: URL
There are 242 country code (+ world subdomain) and 183 language codes = 34 749 subdomains to crawl.
This is obviously not the right approach to index everything.
Proposed solution
- Only allow access to crawlers to subdomains with (country code, lang code) pairs that makes sense (official language for this country). This can be done by (we can implement both approach):
- having a dynamic robots.txt that is generated on the fly by product opener (Disable /)
- returning
noindexpage for web crawlers
- Deny access to bots to all world-{lc} where {lc} != 'en', to avoid allowing the indexation of 530 millions webpages (2,9M products * 183 language codes).
By proceeding this way, we reduce the number of product pages to index to 7.07M: 2,9M on world + 3,07 (other cc) + 1,1M (6M - 4.9M, remaining cc-lc combinations)
(Previous proposal)
on https://es.openfoodfacts.org, have as the canonical URL of the product page https://world-es.openfoodfacts.org/product/{barcode}.Disallow (with a noindex page) all non product pages for world-{lc}.openfoodfacts.org, to avoid allowing the indexation of 530 millions webpages (2,9M products * 183 language codes). This could be also implemented in robots.txt (and avoid many unnecessary queries to OFF server) if we only used english words in URLs in all webpage links (ex:https://fr.openfoodfacts.org/brand/{BRAND}instead ofhttps://fr.openfoodfacts.org/marque/{BRAND}).}
By proceeding this way, we only index a product in a language if there is one country (with the lang as supported language) where this product is available.
edit: I've updated the proposal after feedbacks from stephane.