Add script to detect and fix HTML-escaped unicode in OL dumps (dry-run) by Chisomnwa · Pull Request #12224 · internetarchive/openlibrary

Chisomnwa · 2026-03-30T16:23:03Z

Add two-phase script to detect and fix HTML-escaped Unicode in OL dumps

Problem

During imports, Unicode text was incorrectly stored as HTML numeric character references. For example:
Сергей → Сергей

This affects approximately 15,000 author names, 72,000 work titles, and 148,500 edition titles.

What this PR does

Adds a two-phase migration script at scripts/migrations/fix_unicode_html_entities.py to detect and fix HTML entity encoding errors in author names, edition titles, and work titles.

Phase 1 — scans the OL dump file and outputs only the keys of affected records to stdout, one per line, so the output can be piped directly to a file:

Preview affected keys in the terminal:

python3 scripts/migrations/fix_unicode_html_entities.py \
  --dump ol_dump_authors_latest.txt.gz

Save affected keys to a file:

python3 scripts/migrations/fix_unicode_html_entities.py \
  --dump ol_dump_authors_latest.txt.gz > author_keys.txt

Phase 2 — reads the keys file from Phase 1, connects to the live OL database, fetches each record, applies html.unescape() to affected fields, and saves it back:

python3 scripts/migrations/fix_unicode_html_entities.py \
  --keys author_keys.txt \
  --config /olsystem/etc/openlibrary.yml

Phase 2 includes:

Graceful shutdown support via ctrl-c using init_signal_handler and was_shutdown_requested from scripts.utils.graceful_shutdown
Progress tracking so the script can resume from where it stopped if interrupted
--dry-run flag to preview without saving

Testing

Phase 1 has been tested locally against ol_dump_authors_latest.txt.gz and correctly outputs only affected keys to stdout.

And here is what it looks like:

Phase 2 follows the pattern from scripts/migrations/write_prefs_to_store.py and is ready for review and testing against a live OL instance with maintainer guidance.

Notes

The script determines the phase from the argument passed — --dump triggers Phase 1, --keys triggers Phase 2.
Progress is tracked in a .progress file alongside the keys file, enabling resumability.
The edit comment for saved records is: Fix HTML entity encoding in Unicode fields.

Stakeholders
@mekarpeles @jimchamp @cdrini

mekarpeles · 2026-04-01T21:52:50Z

The cause of CI failures is:

mypy.....................................................................Failed
- hook id: mypy
- exit code: 1

scripts/utils/scheduler.py: note: In member "add_job" of class "OlAsyncIOScheduler":
scripts/utils/scheduler.py:48: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
scripts/fix_unicode_html_entities.py: note: In function "process_dump":
scripts/fix_unicode_html_entities.py:61: error: Incompatible default for
argument "limit" (default has type "None", argument has type "int") 
[assignment]
    ...p_path: str, record_type: str, dry_run: bool = True, limit: int = None
                                                                         ^~~~
scripts/fix_unicode_html_entities.py:61: note: PEP 484 prohibits implicit Optional. Accordingly, mypy has changed its default to no_implicit_optional=True
scripts/fix_unicode_html_entities.py:61: note: Use https://github.com/hauntsaninja/no_implicit_optional to automatically upgrade your codebase

limit: int = None is invalid under no_implicit_optional

We'd either want

def process_dump(..., limit: int | None = None):

or

from typing import Optional

def process_dump(
    dump_path: str,
    record_type: str,
    dry_run: bool = True,
    limit: Optional[int] = None,
):

jimchamp

Thanks for looking into this, @Chisomnwa!

As mentioned here, we should split this processing into two phases:

Phase 1
Read the data dump file, identify any affected records, then output the keys of affected records.

Phase 2
Read the keys from the output file that was generated in phase 1. Fetch and update the corresponding records, keeping track of the number of records processed. If the script ends prematurely, the count of records previously processed can be used as an offset from which to begin reading the input file.

We can probably determine the phase by the arguments that are passed when the script is executed. If a --dump file is passed, execute phase one. If a --keys file is passed, then execute phase two.

Phase 1 need only output keys to stdout, one per line. A person can then run ./fix_unicode_html_entities.py --dump /path/to/dump > /path/to/output/file to write the keys to the file. Important: Nothing other than keys should be written to stdout during this phase.

During phase 2, the records will need to be fetched from Open Library, modified, then saved. To do so, you'll have to setup an instance of Open Library and Infogami in the script. You can refer to this code, which sets up a different script in the appropriate manner. The init_signal_handler() call should also be included, as that will allow us to stop the script using ctrl-c (more on this later).

With that setup complete, you should be able to fetch a record by calling web.ctx.site.get(key). Calling the dict() method of the fetched object will give you a dict representation of the object's state. You can update this dict, then pass it to web.ctx.site.save() to save the changes.

As you're iterating over these keys and updating objects, be sure to periodically check if was_shutdown_requested() returns True. This signifies that a shutdown signal was received, and the script should exit when this occurs. Both init_signal_handler and was_shutdown_requested must be imported from scripts.utils.graceful_shutdown.

I think that's it. Please let me know if you need additional information or clarifications.

Chisomnwa · 2026-04-02T05:11:40Z

Thanks for the detailed feedback, @mekarpeles and @jimchamp!

From what I already understand, I'll restructure the script into two phases as described:

Phase 1 will output only affected keys to stdout, and
Phase 2 will handle fetching, fixing, and saving with graceful shutdown support and progress tracking so the script can continue from where it stopped.

I'll also fix the mypy error, move the file to scripts/migrations/, and remove the --limit argument. I'll push the updated version shortly.

Chisomnwa · 2026-04-03T11:23:45Z

Hi @mekarpeles and @jimchamp. I have updated the PR description to reflect the two-phase redesign based on feedback received. Phase 1 has been tested locally. Phase 2 follows the write_prefs_to_store.py pattern and is ready for review.

tfmorris

I recommend that the script check all string fields in the record rather than using a fixed list of fields. That way everything will get fixed in a single pass.

I had a quick look at the results from modified script running against the March dumps and it found the following fields to fix:

16142 authors
[('name', 16050), ('personal_name', 241), ('bio', 3), ('title', 2), ('death_date', 1)]

72332 works
[('title', 72254), ('description', 62), ('subtitle', 22)]

143359 Editions
[('subtitle', 90067), ('title', 85479), ('full_title', 77451), ('edition_name', 400), ('by_statement', 305), ('description', 89), ('pagination', 63), ('first_sentence', 45), ('notes', 14), ('copyright_date', 14), ('ocaid', 1)]

The OCAID seemed weird, so I had a look and it's from

/books/OL24271075M godwenttobeautys00ryla_0<ScRiPt>SENj(9613)<

which kind of looks like a failed HTML injection attack.

I spot checked a few of the other fields for sanity. The copyright dates are systemically bad metadata from Pressbooks which apparently added copyright holder names to the copyright date field (some of which include HTML encoded ampersands). The pagination field appears to mostly be daggers and double dagger symbols (as well as some replacement symbols which may have started off life as daggers that couldn't be interpreted). The edition name field is almost all the "feminine ordinal indicator" https://unicodeplus.com/U+00AA which looks like a superscript "a".

Overall everything looks good. The only weird thing I saw was an "&c;" (old school etc) which passed the regex, but isn't a valid entity, so didn't get changed.

Chisomnwa · 2026-04-07T02:52:48Z

Hi @tfmorris, thank you for the detailed review and for running the script against the March dumps. That breakdown of affected fields is really helpful. I'll implement both suggestions which includes checking all string fields instead of a fixed list, and adding the early exit before JSON parsing. I'll push the updated version shortly.

Chisomnwa · 2026-04-07T04:25:11Z

Hi @tfmorris, I've implemented both suggestions: the get_field_updates() now checks all string fields in the record instead of a fixed list, and I've added an early exit before JSON parsing to skip records that don't contain any HTML entities. I also removed the --type argument and FIELDS_BY_TYPE dictionary since they're no longer needed. Thanks for running the script against the March dumps too. That breakdown was really useful.

tfmorris · 2026-04-07T19:33:27Z

@Chisomnwa Happy to help. You'll need to wait for the assigned reviewer to move this forward.

One thing that I failed to notice is that there are a couple of fields like description which are kind of messy in the database and can occur in one of two different forms - a plain string or an object with a value property containing the string. This seems wrong to me, but it's the current state of play. I'm not sure if there are convenience methods available make this dichotomy easier to handle.

Chisomnwa · 2026-04-07T20:32:30Z

Hi @tfmorris, thanks for the additional insight. I'll look into how the description field appears in the dumps on my end, and I'll wait for @jimchamp to weigh in on whether there are existing convenience methods for handling both forms before implementing anything.

Chisomnwa · 2026-04-14T20:22:43Z

Hi @jimchamp, while waiting for guidance on the description field dichotomy, I am going to work on adding unit tests for Phase 1 and improving the docstrings. I will also look at how other scripts handle the description object form and propose an approach. Will update the PR shortly.

jimchamp · 2026-04-16T00:08:15Z

I am unable to find any convenience methods for the values that are objects. The description on this page is an object, and is being rendered properly. I believe that this is somehow handled in Infogami when pages are rendered (perhaps by calling a Thing's __str__ method?). Regardless, you can import this record to your local environment for testing purposes by running the following:

docker exec -it -e PYTHONPATH=. openlibrary-web-1 ./scripts/copydoc
s.py /books/OL20590179M

I queried the production database and found that these properties may be saved as objects with a type and value:

      key      |          name
---------------+-------------------------
 /type/author  | bio
 /type/author  | body
 /type/edition | classifications
 /type/edition | description
 /type/edition | first_sentence
 /type/edition | notes
 /type/edition | table_of_contents
 /type/work    | description
 /type/work    | exerpts
 /type/work    | first_sentence
 /type/work    | notes
 /type/work    | subject_people

You may want to check the type of these fields by using isinstance, then modify the value of the field when it is an object. If you're fixing all fields on works, editions, and authors, you'll likely want to check the type anyway (some of these may be lists as well).

Chisomnwa · 2026-04-16T07:20:43Z

Thanks for querying the production database and sharing that list, @jimchamp. That is really helpful.

From what I already understand, I will use isinstance to check the type of each field before applying html.unescape(): if the field is a plain string, fix it directly; if it is a dict with a value while preserving the rest of the object; and if it is a list, iterate through its items. I will also import /books/OL20590179M record locally using the copydocs.py command you shared to test the description object form end-to-end.

for more information, see https://pre-commit.ci

…raceful shutdown

for more information, see https://pre-commit.ci

…SON parsing

for more information, see https://pre-commit.ci

Chisomnwa · 2026-04-16T23:22:14Z

Hi @jimchamp, I've updated fix_unicode_html_entities.py to handle HTML entity errors in string fields regardless of the field's data type. That is whether it's a plain string, an object with a value key, or a list of such objects.

Testing the fix locally
To test this locally, I imported the sample record you provided using:

docker exec -it -e PYTHONPATH=. openlibrary-web-1 ./scripts/copydoc
s.py /books/OL20590179M

Then I opened a Python shell inside the web container:

docker exec -it -e PYTHONPATH=. openlibrary-web-1 python3

And fetched the record from the local database to inspect the description field:

import web
import infogami
from openlibrary.config import load_config

load_config('/openlibrary/conf/openlibrary.yml')
infogami._setup()

record = web.ctx.site.get('/books/OL20590179M')
data = record.dict()
print(data.get('description'))

The output confirmed that the description field in that record is stored in the object form ({"type": "/type/text", "value": "..."}) with no HTML entities present:

Running get_field_updates on that record returned an empty dict {}, which is the correct behaviour when there is nothing to fix:

To confirm the fix actually works, I constructed a fake record with HTML entities injected into both a plain string field (title) and the object form (description), then ran get_field_updates on it. The entities were correctly unescaped in both cases, and the type key in the description object was preserved:

Unit tests
I added unit tests covering all three field shapes. That is plain string, dict object, and list of objects, as well as edge cases like clean records with no entities, lists with no entities, and records with multiple affected fields. The tests live in scripts/tests/test_fix_unicode_html_entities.py.

All 12 tests pass:

docker exec -it -e PYTHONPATH=. openlibrary-web-1 python3 -m pytest scripts/tests/test_fix_unicode_html_entities.py -v

Docstrings
The docstrings for all functions were also updated to make it easier to understand what each function does at a glance.

Chisomnwa marked this pull request as draft March 30, 2026 17:42

mekarpeles assigned jimchamp Apr 1, 2026

jimchamp requested changes Apr 2, 2026

View reviewed changes

Comment thread scripts/fix_unicode_html_entities.py

Comment thread scripts/fix_unicode_html_entities.py Outdated

github-actions bot added the Needs: Response Issues which require feedback from lead label Apr 2, 2026

Chisomnwa force-pushed the fix-html-entity-unescape branch from 9f983fb to 23f97f6 Compare April 2, 2026 20:27

Chisomnwa force-pushed the fix-html-entity-unescape branch from 89ee1c4 to a1a79a6 Compare April 6, 2026 10:54

tfmorris reviewed Apr 6, 2026

View reviewed changes

Comment thread scripts/migrations/fix_unicode_html_entities.py Outdated

Comment thread scripts/migrations/fix_unicode_html_entities.py

tfmorris mentioned this pull request Apr 8, 2026

Prevent Importer from HTML escaping Unicode Numeric Character References #10860

Open

jimchamp removed the Needs: Response Issues which require feedback from lead label Apr 16, 2026

github-actions bot added the Needs: Response Issues which require feedback from lead label Apr 16, 2026

Chisom Nnamani and others added 10 commits April 16, 2026 22:23

Add script to detect and fix HTML-escaped unicode in OL dumps (dry-run)

82707eb

[pre-commit.ci] auto fixes from pre-commit.com hooks

9771630

for more information, see https://pre-commit.ci

Move fix_unicode_html_entities.py to scripts/migrations

aab8129

Fix mypy error: use Optional type for limit parameter

c3351aa

[pre-commit.ci] auto fixes from pre-commit.com hooks

fa427da

for more information, see https://pre-commit.ci

Move fix_unicode_html_entities.py to scripts/migrations

9343a35

Remove --limit argument from fix_unicode_html_entities.py

322dca9

Remove --limit argument and fix mypy type error

d869006

[pre-commit.ci] auto fixes from pre-commit.com hooks

dd8ab20

for more information, see https://pre-commit.ci

Refactor process_dump to output only affected keys to stdout (Phase 1)

3d60ae0

Chisomnwa and others added 12 commits April 16, 2026 22:23

Add phase 2: fetch, fix and save records with progress tracking and g…

e1744c4

…raceful shutdown

[pre-commit.ci] auto fixes from pre-commit.com hooks

56674b5

for more information, see https://pre-commit.ci

Fix ruff SIM102: combine nested if statements

3b5e861

Check all string fields for HTML entities and add early exit before J…

47c0ff1

…SON parsing

[pre-commit.ci] auto fixes from pre-commit.com hooks

f5c00a0

for more information, see https://pre-commit.ci

Handle object and list field shapes in get_field_updates

00ff836

Add detailed docstrings to all functions

3a1d9f0

[pre-commit.ci] auto fixes from pre-commit.com hooks

59cbe16

for more information, see https://pre-commit.ci

Add unit tests for has_entities and get_field_updates

d7ba962

[pre-commit.ci] auto fixes from pre-commit.com hooks

7dd4e3f

for more information, see https://pre-commit.ci

Uncomment Phase 2 imports required for OL database access

a50cd2f

Fix mypy type annotation for get_field_updates updates dict

6b29ab8

Chisomnwa force-pushed the fix-html-entity-unescape branch from 918833d to 6b29ab8 Compare April 16, 2026 21:25

Chisomnwa marked this pull request as ready for review April 16, 2026 23:22

Uh oh!

Conversation

Chisomnwa commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What this PR does

Testing

Notes

Uh oh!

mekarpeles commented Apr 1, 2026

Uh oh!

jimchamp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Chisomnwa commented Apr 2, 2026

Uh oh!

Chisomnwa commented Apr 3, 2026

Uh oh!

tfmorris left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Chisomnwa commented Apr 7, 2026

Uh oh!

Chisomnwa commented Apr 7, 2026

Uh oh!

tfmorris commented Apr 7, 2026

Uh oh!

Chisomnwa commented Apr 7, 2026

Uh oh!

Chisomnwa commented Apr 14, 2026

Uh oh!

jimchamp commented Apr 16, 2026

Uh oh!

Chisomnwa commented Apr 16, 2026

Uh oh!

Chisomnwa commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Chisomnwa commented Mar 30, 2026 •

edited

Loading