Add script to detect and fix HTML-escaped unicode in OL dumps (dry-run)#12224
Add script to detect and fix HTML-escaped unicode in OL dumps (dry-run)#12224Chisomnwa wants to merge 22 commits intointernetarchive:masterfrom
Conversation
|
The cause of CI failures is:
We'd either want or |
jimchamp
left a comment
There was a problem hiding this comment.
Thanks for looking into this, @Chisomnwa!
As mentioned here, we should split this processing into two phases:
Phase 1
Read the data dump file, identify any affected records, then output the keys of affected records.
Phase 2
Read the keys from the output file that was generated in phase 1. Fetch and update the corresponding records, keeping track of the number of records processed. If the script ends prematurely, the count of records previously processed can be used as an offset from which to begin reading the input file.
We can probably determine the phase by the arguments that are passed when the script is executed. If a --dump file is passed, execute phase one. If a --keys file is passed, then execute phase two.
Phase 1 need only output keys to stdout, one per line. A person can then run ./fix_unicode_html_entities.py --dump /path/to/dump > /path/to/output/file to write the keys to the file. Important: Nothing other than keys should be written to stdout during this phase.
During phase 2, the records will need to be fetched from Open Library, modified, then saved. To do so, you'll have to setup an instance of Open Library and Infogami in the script. You can refer to this code, which sets up a different script in the appropriate manner. The init_signal_handler() call should also be included, as that will allow us to stop the script using ctrl-c (more on this later).
With that setup complete, you should be able to fetch a record by calling web.ctx.site.get(key). Calling the dict() method of the fetched object will give you a dict representation of the object's state. You can update this dict, then pass it to web.ctx.site.save() to save the changes.
As you're iterating over these keys and updating objects, be sure to periodically check if was_shutdown_requested() returns True. This signifies that a shutdown signal was received, and the script should exit when this occurs. Both init_signal_handler and was_shutdown_requested must be imported from scripts.utils.graceful_shutdown.
I think that's it. Please let me know if you need additional information or clarifications.
|
Thanks for the detailed feedback, @mekarpeles and @jimchamp! From what I already understand, I'll restructure the script into two phases as described:
I'll also fix the mypy error, move the file to scripts/migrations/, and remove the --limit argument. I'll push the updated version shortly. |
9f983fb to
23f97f6
Compare
|
Hi @mekarpeles and @jimchamp. I have updated the PR description to reflect the two-phase redesign based on feedback received. Phase 1 has been tested locally. Phase 2 follows the |
89ee1c4 to
a1a79a6
Compare
tfmorris
left a comment
There was a problem hiding this comment.
I recommend that the script check all string fields in the record rather than using a fixed list of fields. That way everything will get fixed in a single pass.
I had a quick look at the results from modified script running against the March dumps and it found the following fields to fix:
16142 authors
[('name', 16050), ('personal_name', 241), ('bio', 3), ('title', 2), ('death_date', 1)]
72332 works
[('title', 72254), ('description', 62), ('subtitle', 22)]
143359 Editions
[('subtitle', 90067), ('title', 85479), ('full_title', 77451), ('edition_name', 400), ('by_statement', 305), ('description', 89), ('pagination', 63), ('first_sentence', 45), ('notes', 14), ('copyright_date', 14), ('ocaid', 1)]
The OCAID seemed weird, so I had a look and it's from
/books/OL24271075M godwenttobeautys00ryla_0<ScRiPt>SENj(9613)<
which kind of looks like a failed HTML injection attack.
I spot checked a few of the other fields for sanity. The copyright dates are systemically bad metadata from Pressbooks which apparently added copyright holder names to the copyright date field (some of which include HTML encoded ampersands). The pagination field appears to mostly be daggers and double dagger symbols (as well as some replacement symbols which may have started off life as daggers that couldn't be interpreted). The edition name field is almost all the "feminine ordinal indicator" https://unicodeplus.com/U+00AA which looks like a superscript "a".
Overall everything looks good. The only weird thing I saw was an "&c;" (old school etc) which passed the regex, but isn't a valid entity, so didn't get changed.
|
Hi @tfmorris, thank you for the detailed review and for running the script against the March dumps. That breakdown of affected fields is really helpful. I'll implement both suggestions which includes checking all string fields instead of a fixed list, and adding the early exit before JSON parsing. I'll push the updated version shortly. |
|
Hi @tfmorris, I've implemented both suggestions: the |
|
@Chisomnwa Happy to help. You'll need to wait for the assigned reviewer to move this forward. One thing that I failed to notice is that there are a couple of fields like |
|
Hi @jimchamp, while waiting for guidance on the description field dichotomy, I am going to work on adding unit tests for Phase 1 and improving the docstrings. I will also look at how other scripts handle the description object form and propose an approach. Will update the PR shortly. |
|
I am unable to find any convenience methods for the values that are objects. The description on this page is an object, and is being rendered properly. I believe that this is somehow handled in Infogami when pages are rendered (perhaps by calling a Thing's I queried the production database and found that these properties may be saved as objects with a You may want to check the type of these fields by using |
|
Thanks for querying the production database and sharing that list, @jimchamp. That is really helpful. From what I already understand, I will use |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
918833d to
6b29ab8
Compare
|
Hi @jimchamp, I've updated Testing the fix locally Then I opened a Python shell inside the web container:
And fetched the record from the local database to inspect the description field: The output confirmed that the
Running
To confirm the fix actually works, I constructed a fake record with HTML entities injected into both a plain string field (
Unit tests All 12 tests pass:
Docstrings |




Add two-phase script to detect and fix HTML-escaped Unicode in OL dumps
Part of #10909
Problem
During imports, Unicode text was incorrectly stored as HTML numeric character references. For example:
Сергей → СергейThis affects approximately 15,000 author names, 72,000 work titles, and 148,500 edition titles.
What this PR does
Adds a two-phase migration script at
scripts/migrations/fix_unicode_html_entities.pyto detect and fix HTML entity encoding errors in author names, edition titles, and work titles.Phase 1 — scans the OL dump file and outputs only the keys of affected records to stdout, one per line, so the output can be piped directly to a file:
Preview affected keys in the terminal:
Save affected keys to a file:
Phase 2 — reads the keys file from Phase 1, connects to the live OL database, fetches each record, applies
html.unescape()to affected fields, and saves it back:Phase 2 includes:
ctrl-cusinginit_signal_handlerandwas_shutdown_requestedfromscripts.utils.graceful_shutdown--dry-runflag to preview without savingTesting
Phase 1 has been tested locally against
ol_dump_authors_latest.txt.gzand correctly outputs only affected keys to stdout.And here is what it looks like:
Phase 2 follows the pattern from
scripts/migrations/write_prefs_to_store.pyand is ready for review and testing against a live OL instance with maintainer guidance.Notes
--dumptriggers Phase 1,--keys triggers Phase 2.Stakeholders
@mekarpeles @jimchamp @cdrini