Skip to content

Python script to generate a .json.gz file per each locale#5

Open
abaevbog wants to merge 11 commits into
zotero:masterfrom
abaevbog:master
Open

Python script to generate a .json.gz file per each locale#5
abaevbog wants to merge 11 commits into
zotero:masterfrom
abaevbog:master

Conversation

@abaevbog
Copy link
Copy Markdown

@abaevbog abaevbog commented May 2, 2023

No description provided.

Copy link
Copy Markdown
Member

@dstillman dstillman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also have this output the necessary lines to copy into .htaccess to map locale parameter values to the files between written. We mostly won't need to update them, but good to keep them coupled to the available locales, and the logic might need tweaking.

And for that logic, we basically want to implement this code:

https://github.com/zotero/zotero/blob/3e12f3f20b097d56006c040ea466fb422ba78308/chrome/content/zotero/xpcom/utilities_internal.js#L1581-L1620

…but entirely with rewrite rules. (I haven't totally though through how possible this is, but I think it's mostly doable by writing out a lot of rules in a given order. Let's see what we can do.)

en-US → en-US (exact match)
ar → ar (exact match)
de-DE → de (matching language part)
ca → ca-AD (matching language part)
en-NZ → en-US (prefer en-US over other available en locales for inexact match)
pt → pt-PT (prefer country code matching language code if unspecified — this one's debatable, and 217 million Brazilians would presumably disagree, but it's what we do elsewhere)
zh → zh-CN (this we just get from sorting the country codes, but it makes sense for our userbase)
zz → full file (ignore locale parameter if language code is unknown)

Comment thread scripts/update-gz.py Outdated
schema_text = f.read()
schema = json.loads(schema_text)

if not os.path.exists('../locales'):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Script should be runnable from any folder. You're already getting the parent folder above, so you should just use that for other paths.

You should also wipe locales to remove any existing files. (Locales will pretty much never be removed, but just to fix possible bugs, etc.)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, this has been done.

Comment thread scripts/update-gz.py Outdated
for creator_type in item_type['creatorTypes']:
if creator_type['creatorType'] in current_locale["itemTypes"]:
creator_type['creatorType'] = current_locale["itemTypes"][creator_type['creatorType']]
del schema_with_one_locale['locales']
Copy link
Copy Markdown
Member

@dstillman dstillman May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely don't need all of this — sorry for not specifying. Can just keep the one locale keyed properly in locales, as below.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got rid of it!

@abaevbog
Copy link
Copy Markdown
Author

abaevbog commented May 2, 2023

Added the generation of the .htaccess file. The example output is as below:

RewriteRule ^schema/(af-ZA|ar|bg-BG|br|ca-AD|cs-CZ|da-DK|de|el-GR|en-GB|en-US|es-ES|et-EE|eu-ES|fa|fi-FI|fr-FR|gl-ES|he-IL|hr-HR|hu-HU|id-ID|is-IS|it-IT|ja-JP|km|ko-KR|lt-LT|mn-MN|nb-NO|nl-NL|nn-NO|pl-PL|pt-BR|pt-PT|ro-RO|ru-RU|sk-SK|sl-SI|sr-RS|sv-SE|th-TH|tr-TR|uk-UA|vi-VN|zh-CN|zh-TW)$ /zotero-schema/locales/$1.gz [L]
...
RewriteRule ^schema/he(-.*)?$ /zotero-schema/locales/he-IL.gz [L]
RewriteRule ^schema/hr(-.*)?$ /zotero-schema/locales/hr-HR.gz [L]
RewriteRule ^schema/hu(-.*)?$ /zotero-schema/locales/hu-HU.gz [L]
RewriteRule ^schema/id(-.*)?$ /zotero-schema/locales/id-ID.gz [L]
...
RewriteRule ^schema/* /zotero-schema/schema.json.gz [L]

Each of these rules matches /schema/{country_code}-... or /schema/{country_code} but not /schema/{country_code}something_else

@abaevbog abaevbog requested a review from dstillman May 2, 2023 19:58
Comment thread scripts/update-gz.py Outdated
os.mkdir(locales_folder)

# String that accumulates the rules to paste into htaccess
htaccess_rules = f"RewriteRule ^schema/({'|'.join(schema['locales'].keys())})$ /zotero-schema/locales/$1.gz [L]"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a locale query string parameter, not a path component.

Comment thread scripts/update-gz.py Outdated
# Catch all for default schema with all locales
htaccess_rules += f'\nRewriteRule ^schema/* /zotero-schema/schema.json.gz [L]'

print("--- .htacess rules --- \n" + htaccess_rules + "\n--- ---")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can skip the header and footer

@abaevbog abaevbog requested a review from dstillman May 3, 2023 14:59
@dstillman
Copy link
Copy Markdown
Member

Can replace update-gz with update-gz.py, no extension, mode 755

@dstillman
Copy link
Copy Markdown
Member

dstillman commented May 4, 2023

I think we'll need some additional conditions/rules for the locale gz files, like we already have for schema.json.gz:

https://github.com/zotero/dataserver/blob/5fba6e4c4a9df97a305764a69ebf70a872d38cd8/htdocs/.htaccess#L22-L38

E.g., checking Accept-Encoding

Let's just generate the whole block in this script. And we don't need newlines — better to keep this as a single block.

@dstillman
Copy link
Copy Markdown
Member

Not sure it matters, but since we're generating this in a script anyway and know the total count, we can probably add a skip flag (S) to skip all of these rules unless the URL begins with schema. These will be checked for every single dataserver request, so we do want to avoid any slowdown.

(Will this all be meaningfully faster than just having a single schema.php file that generate a locale-specific file and dumped it in memcached? Unclear. But this is certainly the more fun way to do it…)

@abaevbog
Copy link
Copy Markdown
Author

abaevbog commented May 4, 2023

After testing it out with a version of dataserver I ran locally, I had to add one more condition RewriteCond %{SCRIPT_FILENAME} !/zotero-schema/ at the end for everything to work properly. This is a gist of the .htaccess file I ended up with

https://gist.github.com/abaevbog/23a986e6966000325f609652cd25e6ce

@dstillman
Copy link
Copy Markdown
Member

I think we can do a slightly simpler block:

https://gist.githubusercontent.com/dstillman/82db17dca11e2bda0bd87b4d27c02c89/raw/2398339bc434c5dfef7d44e8824b58df09122381/gistfile1.txt

Specifically:

  • Last for everything
  • Removed the lines with E=no-gzip:1 and move those things (setting correct content type, preventing double-gzip) into the FilesMatch
  • No need for zotero-schema in the Skip RewriteCond, since we're setting Last
  • No need for RewriteCond %{SCRIPT_FILENAME} !/zotero-schema/ at the end, since we're setting Last

@abaevbog
Copy link
Copy Markdown
Author

abaevbog commented May 5, 2023

Yes, you are right...
Adding the [L] flag solved whatever issues I had

Comment thread scripts/update-gz Outdated
htaccess_rules = f'''RewriteCond %{{REQUEST_URI}} !^/(schema|zotero-schema)
RewriteRule ".?" "-" [S=LINES_TO_SKIP]
htaccess_rules = f'''RewriteCond %{{REQUEST_URI}} !^/schema
RewriteRule ".?" "-" [S=LINES_TO_SKIP,L]
Copy link
Copy Markdown
Member

@dstillman dstillman May 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L isn't right here, though — that breaks the dataserver completely by skipping the main redirect at the bottom of the file.

Copy link
Copy Markdown
Author

@abaevbog abaevbog May 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean - I removed it.

For some reason when I was testing it on my local dataserver setup, all non /schema requests reached index.php as they were supposed to even with that L flag. Probably can blame it on my local dataserver or apache setup

Comment thread scripts/update-gz Outdated
RewriteCond %{{HTTP:Accept-Encoding}} !gzip
RewriteRule ^schema(/.*)?$ /zotero-schema/schema.json [QSD,L]
RewriteCond %{{QUERY_STRING}} (?:^|&)locale=({'|'.join(schema['locales'].keys())})(?:&|$)
RewriteRule ^schema(/.*)?$ /zotero-schema/locales/%1.json.gz [QSD]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing L here, so exact matches (e.g., locale=en-US) don't work

Comment thread scripts/update-gz Outdated
# For every country code, sort locale candidates and add rule to htacecss
for country_code in htaccess_mapings.keys():
htaccess_mapings[country_code].sort(key=locale_sort_key)
# Each rule is only applid is gzip encoding is accepted
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applid isapplied if

@abaevbog abaevbog requested a review from dstillman May 8, 2023 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants