Skip to content

Add script to detect and fix HTML-escaped unicode in OL dumps (dry-run)#12224

Open
Chisomnwa wants to merge 22 commits intointernetarchive:masterfrom
Chisomnwa:fix-html-entity-unescape
Open

Add script to detect and fix HTML-escaped unicode in OL dumps (dry-run)#12224
Chisomnwa wants to merge 22 commits intointernetarchive:masterfrom
Chisomnwa:fix-html-entity-unescape

Conversation

@Chisomnwa
Copy link
Copy Markdown

@Chisomnwa Chisomnwa commented Mar 30, 2026

Add two-phase script to detect and fix HTML-escaped Unicode in OL dumps

Part of #10909

Problem

During imports, Unicode text was incorrectly stored as HTML numeric character references. For example:
Сергей → Сергей

This affects approximately 15,000 author names, 72,000 work titles, and 148,500 edition titles.

What this PR does

Adds a two-phase migration script at scripts/migrations/fix_unicode_html_entities.py to detect and fix HTML entity encoding errors in author names, edition titles, and work titles.

Phase 1 — scans the OL dump file and outputs only the keys of affected records to stdout, one per line, so the output can be piped directly to a file:

Preview affected keys in the terminal:

python3 scripts/migrations/fix_unicode_html_entities.py \
  --dump ol_dump_authors_latest.txt.gz

Save affected keys to a file:

python3 scripts/migrations/fix_unicode_html_entities.py \
  --dump ol_dump_authors_latest.txt.gz > author_keys.txt

Phase 2 — reads the keys file from Phase 1, connects to the live OL database, fetches each record, applies html.unescape() to affected fields, and saves it back:

python3 scripts/migrations/fix_unicode_html_entities.py \
  --keys author_keys.txt \
  --config /olsystem/etc/openlibrary.yml

Phase 2 includes:

  • Graceful shutdown support via ctrl-c using init_signal_handler and was_shutdown_requested from scripts.utils.graceful_shutdown
  • Progress tracking so the script can resume from where it stopped if interrupted
  • --dry-run flag to preview without saving

Testing

Phase 1 has been tested locally against ol_dump_authors_latest.txt.gz and correctly outputs only affected keys to stdout.

And here is what it looks like:

phase_1_output

Phase 2 follows the pattern from scripts/migrations/write_prefs_to_store.py and is ready for review and testing against a live OL instance with maintainer guidance.

Notes

  • The script determines the phase from the argument passed — --dump triggers Phase 1, --keys triggers Phase 2.
  • Progress is tracked in a .progress file alongside the keys file, enabling resumability.
  • The edit comment for saved records is: Fix HTML entity encoding in Unicode fields.

Stakeholders
@mekarpeles @jimchamp @cdrini

@Chisomnwa Chisomnwa marked this pull request as draft March 30, 2026 17:42
@mekarpeles
Copy link
Copy Markdown
Member

The cause of CI failures is:

mypy.....................................................................Failed
- hook id: mypy
- exit code: 1

scripts/utils/scheduler.py: note: In member "add_job" of class "OlAsyncIOScheduler":
scripts/utils/scheduler.py:48: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
scripts/fix_unicode_html_entities.py: note: In function "process_dump":
scripts/fix_unicode_html_entities.py:61: error: Incompatible default for
argument "limit" (default has type "None", argument has type "int") 
[assignment]
    ...p_path: str, record_type: str, dry_run: bool = True, limit: int = None
                                                                         ^~~~
scripts/fix_unicode_html_entities.py:61: note: PEP 484 prohibits implicit Optional. Accordingly, mypy has changed its default to no_implicit_optional=True
scripts/fix_unicode_html_entities.py:61: note: Use https://github.com/hauntsaninja/no_implicit_optional to automatically upgrade your codebase

limit: int = None is invalid under no_implicit_optional

We'd either want

def process_dump(..., limit: int | None = None):

or

from typing import Optional

def process_dump(
    dump_path: str,
    record_type: str,
    dry_run: bool = True,
    limit: Optional[int] = None,
):

Copy link
Copy Markdown
Collaborator

@jimchamp jimchamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this, @Chisomnwa!

As mentioned here, we should split this processing into two phases:

Phase 1
Read the data dump file, identify any affected records, then output the keys of affected records.

Phase 2
Read the keys from the output file that was generated in phase 1. Fetch and update the corresponding records, keeping track of the number of records processed. If the script ends prematurely, the count of records previously processed can be used as an offset from which to begin reading the input file.

We can probably determine the phase by the arguments that are passed when the script is executed. If a --dump file is passed, execute phase one. If a --keys file is passed, then execute phase two.

Phase 1 need only output keys to stdout, one per line. A person can then run ./fix_unicode_html_entities.py --dump /path/to/dump > /path/to/output/file to write the keys to the file. Important: Nothing other than keys should be written to stdout during this phase.

During phase 2, the records will need to be fetched from Open Library, modified, then saved. To do so, you'll have to setup an instance of Open Library and Infogami in the script. You can refer to this code, which sets up a different script in the appropriate manner. The init_signal_handler() call should also be included, as that will allow us to stop the script using ctrl-c (more on this later).

With that setup complete, you should be able to fetch a record by calling web.ctx.site.get(key). Calling the dict() method of the fetched object will give you a dict representation of the object's state. You can update this dict, then pass it to web.ctx.site.save() to save the changes.

As you're iterating over these keys and updating objects, be sure to periodically check if was_shutdown_requested() returns True. This signifies that a shutdown signal was received, and the script should exit when this occurs. Both init_signal_handler and was_shutdown_requested must be imported from scripts.utils.graceful_shutdown.

I think that's it. Please let me know if you need additional information or clarifications.

Comment thread scripts/fix_unicode_html_entities.py
Comment thread scripts/fix_unicode_html_entities.py Outdated
@Chisomnwa
Copy link
Copy Markdown
Author

Thanks for the detailed feedback, @mekarpeles and @jimchamp!

From what I already understand, I'll restructure the script into two phases as described:

  • Phase 1 will output only affected keys to stdout, and

  • Phase 2 will handle fetching, fixing, and saving with graceful shutdown support and progress tracking so the script can continue from where it stopped.

I'll also fix the mypy error, move the file to scripts/migrations/, and remove the --limit argument. I'll push the updated version shortly.

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Apr 2, 2026
@Chisomnwa Chisomnwa force-pushed the fix-html-entity-unescape branch from 9f983fb to 23f97f6 Compare April 2, 2026 20:27
@Chisomnwa
Copy link
Copy Markdown
Author

Hi @mekarpeles and @jimchamp. I have updated the PR description to reflect the two-phase redesign based on feedback received. Phase 1 has been tested locally. Phase 2 follows the write_prefs_to_store.py pattern and is ready for review.

@Chisomnwa Chisomnwa force-pushed the fix-html-entity-unescape branch from 89ee1c4 to a1a79a6 Compare April 6, 2026 10:54
Copy link
Copy Markdown
Contributor

@tfmorris tfmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend that the script check all string fields in the record rather than using a fixed list of fields. That way everything will get fixed in a single pass.

I had a quick look at the results from modified script running against the March dumps and it found the following fields to fix:

16142 authors
[('name', 16050), ('personal_name', 241), ('bio', 3), ('title', 2), ('death_date', 1)]

72332 works
[('title', 72254), ('description', 62), ('subtitle', 22)]

143359 Editions
[('subtitle', 90067), ('title', 85479), ('full_title', 77451), ('edition_name', 400), ('by_statement', 305), ('description', 89), ('pagination', 63), ('first_sentence', 45), ('notes', 14), ('copyright_date', 14), ('ocaid', 1)]

The OCAID seemed weird, so I had a look and it's from

/books/OL24271075M godwenttobeautys00ryla_0<ScRiPt>SENj(9613)<

which kind of looks like a failed HTML injection attack.

I spot checked a few of the other fields for sanity. The copyright dates are systemically bad metadata from Pressbooks which apparently added copyright holder names to the copyright date field (some of which include HTML encoded ampersands). The pagination field appears to mostly be daggers and double dagger symbols (as well as some replacement symbols which may have started off life as daggers that couldn't be interpreted). The edition name field is almost all the "feminine ordinal indicator" https://unicodeplus.com/U+00AA which looks like a superscript "a".

Overall everything looks good. The only weird thing I saw was an "&c;" (old school etc) which passed the regex, but isn't a valid entity, so didn't get changed.

Comment thread scripts/migrations/fix_unicode_html_entities.py Outdated
Comment thread scripts/migrations/fix_unicode_html_entities.py
@Chisomnwa
Copy link
Copy Markdown
Author

Hi @tfmorris, thank you for the detailed review and for running the script against the March dumps. That breakdown of affected fields is really helpful. I'll implement both suggestions which includes checking all string fields instead of a fixed list, and adding the early exit before JSON parsing. I'll push the updated version shortly.

@Chisomnwa
Copy link
Copy Markdown
Author

Hi @tfmorris, I've implemented both suggestions: the get_field_updates() now checks all string fields in the record instead of a fixed list, and I've added an early exit before JSON parsing to skip records that don't contain any HTML entities. I also removed the --type argument and FIELDS_BY_TYPE dictionary since they're no longer needed. Thanks for running the script against the March dumps too. That breakdown was really useful.

@tfmorris
Copy link
Copy Markdown
Contributor

tfmorris commented Apr 7, 2026

@Chisomnwa Happy to help. You'll need to wait for the assigned reviewer to move this forward.

One thing that I failed to notice is that there are a couple of fields like description which are kind of messy in the database and can occur in one of two different forms - a plain string or an object with a value property containing the string. This seems wrong to me, but it's the current state of play. I'm not sure if there are convenience methods available make this dichotomy easier to handle.

@Chisomnwa
Copy link
Copy Markdown
Author

Hi @tfmorris, thanks for the additional insight. I'll look into how the description field appears in the dumps on my end, and I'll wait for @jimchamp to weigh in on whether there are existing convenience methods for handling both forms before implementing anything.

@Chisomnwa
Copy link
Copy Markdown
Author

Hi @jimchamp, while waiting for guidance on the description field dichotomy, I am going to work on adding unit tests for Phase 1 and improving the docstrings. I will also look at how other scripts handle the description object form and propose an approach. Will update the PR shortly.

@jimchamp
Copy link
Copy Markdown
Collaborator

I am unable to find any convenience methods for the values that are objects. The description on this page is an object, and is being rendered properly. I believe that this is somehow handled in Infogami when pages are rendered (perhaps by calling a Thing's __str__ method?). Regardless, you can import this record to your local environment for testing purposes by running the following:

docker exec -it -e PYTHONPATH=. openlibrary-web-1 ./scripts/copydoc
s.py /books/OL20590179M

I queried the production database and found that these properties may be saved as objects with a type and value:

      key      |          name
---------------+-------------------------
 /type/author  | bio
 /type/author  | body
 /type/edition | classifications
 /type/edition | description
 /type/edition | first_sentence
 /type/edition | notes
 /type/edition | table_of_contents
 /type/work    | description
 /type/work    | exerpts
 /type/work    | first_sentence
 /type/work    | notes
 /type/work    | subject_people

You may want to check the type of these fields by using isinstance, then modify the value of the field when it is an object. If you're fixing all fields on works, editions, and authors, you'll likely want to check the type anyway (some of these may be lists as well).

@jimchamp jimchamp removed the Needs: Response Issues which require feedback from lead label Apr 16, 2026
@Chisomnwa
Copy link
Copy Markdown
Author

Thanks for querying the production database and sharing that list, @jimchamp. That is really helpful.

From what I already understand, I will use isinstance to check the type of each field before applying html.unescape(): if the field is a plain string, fix it directly; if it is a dict with a value while preserving the rest of the object; and if it is a list, iterate through its items. I will also import /books/OL20590179M record locally using the copydocs.py command you shared to test the description object form end-to-end.

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Apr 16, 2026
@Chisomnwa Chisomnwa force-pushed the fix-html-entity-unescape branch from 918833d to 6b29ab8 Compare April 16, 2026 21:25
@Chisomnwa Chisomnwa marked this pull request as ready for review April 16, 2026 23:22
@Chisomnwa
Copy link
Copy Markdown
Author

Hi @jimchamp, I've updated fix_unicode_html_entities.py to handle HTML entity errors in string fields regardless of the field's data type. That is whether it's a plain string, an object with a value key, or a list of such objects.

Testing the fix locally
To test this locally, I imported the sample record you provided using:

docker exec -it -e PYTHONPATH=. openlibrary-web-1 ./scripts/copydoc
s.py /books/OL20590179M

Then I opened a Python shell inside the web container:

docker exec -it -e PYTHONPATH=. openlibrary-web-1 python3

And fetched the record from the local database to inspect the description field:

import web
import infogami
from openlibrary.config import load_config

load_config('/openlibrary/conf/openlibrary.yml')
infogami._setup()

record = web.ctx.site.get('/books/OL20590179M')
data = record.dict()
print(data.get('description'))

The output confirmed that the description field in that record is stored in the object form ({"type": "/type/text", "value": "..."}) with no HTML entities present:

print_description_field

Running get_field_updates on that record returned an empty dict {}, which is the correct behaviour when there is nothing to fix:

returned _nothing

To confirm the fix actually works, I constructed a fake record with HTML entities injected into both a plain string field (title) and the object form (description), then ran get_field_updates on it. The entities were correctly unescaped in both cases, and the type key in the description object was preserved:

unicode_errors_fixed

Unit tests
I added unit tests covering all three field shapes. That is plain string, dict object, and list of objects, as well as edge cases like clean records with no entities, lists with no entities, and records with multiple affected fields. The tests live in scripts/tests/test_fix_unicode_html_entities.py.

All 12 tests pass:

docker exec -it -e PYTHONPATH=. openlibrary-web-1 python3 -m pytest scripts/tests/test_fix_unicode_html_entities.py -v

ran_unit_tests

Docstrings
The docstrings for all functions were also updated to make it easier to understand what each function does at a glance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs: Response Issues which require feedback from lead

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants