Disable completely indexation on subdomains with unusal cc-lc pairs (ex: fr-it, uk-es,...)

## What

Web crawlers can currently index every country code/language code pairs possible:
For example, `fr-es.openfoodfacts.org` currently has 120k webpages indexed: [URL](https://www.google.com/search?q=site%3Afr-es.openfoodfacts.org&source=hp&ei=3AHJZIfaCrjskdUP77m06AQ&iflsig=AD69kcEAAAAAZMkP7EQ5ZVdRcuyiFccsKDgoWH9BBOMJ&ved=0ahUKEwiHuIeFwruAAxU4dqQEHe8cDU0Q4dUDCAg&uact=5&oq=site%3Afr-es.openfoodfacts.org&gs_lp=Egdnd3Mtd2l6IhxzaXRlOmZyLWVzLm9wZW5mb29kZmFjdHMub3JnSJkxUABYgTBwAXgAkAEAmAFKoAHyC6oBAjI5uAEDyAEA-AEBwgILEAAYgAQYsQMYgwHCAhEQLhiDARjHARixAxjRAxiABMICERAuGIAEGLEDGIMBGMcBGNEDwgILEC4YigUYsQMYgwHCAggQABiABBixA8ICCBAuGIAEGLEDwgILEC4YgAQYsQMYgwHCAgUQABiABMICCxAAGIoFGLEDGIMBwgIIEAAYgAQYyQPCAggQABiKBRiSAw&sclient=gws-wi)

There are [242 country code](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2) (+ world subdomain) and [183 language codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) = 34 749 subdomains to crawl.
This is obviously not the right approach to index everything.

## Proposed solution

1. Only allow access to crawlers to subdomains with (country code, lang code) pairs that makes sense (official language for this country). This can be done by (we can implement both approach):
    1. having a dynamic robots.txt that is generated on the fly by product opener (Disable /)
    2. returning `noindex` page for web crawlers
2. Deny access to bots to all world-{lc} where {lc} != 'en', to avoid allowing the indexation of 530 millions webpages (2,9M products * 183 language codes).

By proceeding this way, we reduce the number of product pages to index to 7.07M: 2,9M on world + 3,07 (other cc) + 1,1M (6M - 4.9M, remaining cc-lc combinations)

(Previous proposal)
- ~on https://es.openfoodfacts.org, have as the canonical URL of the product page https://world-es.openfoodfacts.org/product/{barcode}~.
- ~Disallow (with a noindex page) all non product pages for world-{lc}.openfoodfacts.org, to avoid allowing the indexation of 530 millions webpages (2,9M products * 183 language codes). This could be also implemented in robots.txt (and avoid many unnecessary queries to OFF server) if we only used english words in URLs in all webpage links (ex: `https://fr.openfoodfacts.org/brand/{BRAND}` instead of `https://fr.openfoodfacts.org/marque/{BRAND}`).}~

~By proceeding this way, we only index a product in a language if there is one country (with the lang as supported language) where this product is available.~

edit: I've updated the proposal after feedbacks from stephane.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Disable completely indexation on subdomains with unusal cc-lc pairs (ex: fr-it, uk-es,...) #8779

What

Proposed solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Disable completely indexation on subdomains with unusal cc-lc pairs (ex: fr-it, uk-es,...) #8779

Description

What

Proposed solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions