Skip to content

Disable completely indexation on subdomains with unusal cc-lc pairs (ex: fr-it, uk-es,...) #8779

@raphael0202

Description

@raphael0202

What

Web crawlers can currently index every country code/language code pairs possible:
For example, fr-es.openfoodfacts.org currently has 120k webpages indexed: URL

There are 242 country code (+ world subdomain) and 183 language codes = 34 749 subdomains to crawl.
This is obviously not the right approach to index everything.

Proposed solution

  1. Only allow access to crawlers to subdomains with (country code, lang code) pairs that makes sense (official language for this country). This can be done by (we can implement both approach):
    1. having a dynamic robots.txt that is generated on the fly by product opener (Disable /)
    2. returning noindex page for web crawlers
  2. Deny access to bots to all world-{lc} where {lc} != 'en', to avoid allowing the indexation of 530 millions webpages (2,9M products * 183 language codes).

By proceeding this way, we reduce the number of product pages to index to 7.07M: 2,9M on world + 3,07 (other cc) + 1,1M (6M - 4.9M, remaining cc-lc combinations)

(Previous proposal)

  • on https://es.openfoodfacts.org, have as the canonical URL of the product page https://world-es.openfoodfacts.org/product/{barcode}.
  • Disallow (with a noindex page) all non product pages for world-{lc}.openfoodfacts.org, to avoid allowing the indexation of 530 millions webpages (2,9M products * 183 language codes). This could be also implemented in robots.txt (and avoid many unnecessary queries to OFF server) if we only used english words in URLs in all webpage links (ex: https://fr.openfoodfacts.org/brand/{BRAND} instead of https://fr.openfoodfacts.org/marque/{BRAND}).}

By proceeding this way, we only index a product in a language if there is one country (with the lang as supported language) where this product is available.

edit: I've updated the proposal after feedbacks from stephane.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions