Skip to content

feat: use author identifiers in import API#10110

Merged
cdrini merged 47 commits intointernetarchive:masterfrom
pidgezero-one:9448/feat/use-known-author-identifiers-in-import
Mar 27, 2025
Merged

feat: use author identifiers in import API#10110
cdrini merged 47 commits intointernetarchive:masterfrom
pidgezero-one:9448/feat/use-known-author-identifiers-in-import

Conversation

@pidgezero-one
Copy link
Contributor

@pidgezero-one pidgezero-one commented Dec 3, 2024

This should be squash merged

Corresponding model update pr: internetarchive/openlibrary-client#419

This strictly expands the import schema.
It is not a breaking change.
Import records that don't include author IDs will continue to work as they currently do.

Closes #9448
Closes #9411

Technical

  • Adds support for author identifiers in import records.
    • If the import record for a book includes an author that has an "ol_id" property, the import API will attempt to find an author that matches that OL ID.
    • If the import does not include an ol_id field OR includes an ol_id field that doesn't match any existing authors, then if the import record for a book includes an author that has a "remote_ids" property, the import API will attempt to find an existing author that matches the most remote IDs within the record.
      • Q: Should the case of specifying an ol_id that doesn't exist our DB be an error that should reject the import?
    • If the record doesn't include or match any of the above, it will continue to be author-matched based on name and birth/death date, which is how the import api already operates in production.
  • Wikisource script updates:
    • Fixes incorrect birth/death date parsing.
    • Books with no identified author, title, or publish date will not be included in the jsonl output.
    • The name formatting helper function is only used when the author's name came specifically from wikisource and not from wikidata.
      • The majority of the time, WS import records produced by this script will strictly use author info from WD. However, not every WD item corresponding to a WS book is properly linked to an author. In those cases, the script falls back to attempt getting author information from the WS api response instead. WS data is highly unstructured, so only in those cases will the name formatter be used.
    • Moved dependencies specific to the wikisource script into a separate requirements.txt file that is intended to be installed only temporarily, since they're not required for OL to run. Instructions are included for how to run the script with this consideration.
    • Adds author identifiers to its output records, since it uses the Wikidata API, which includes OL IDs and most other remote_ids.
      • WS was the easiest source for me to use for generating records that had enough information to test these additions with. Nothing in the updated author matching logic is actually specific to WS, except for the next bullet point:
    • Wikisource records are exempt from being rejected for having a 1900 publish date. I don't know if this is a good idea or not, seeking feedback on that.

Issues:

  • Importing books is successful and matching authors are being found and used as expected, however navigating to the author's page from that new book's page does not show that new book on the author's page. Solr updater delay, it appeared after a while!

Testing

I put the entire output of the wikisource script into /import/batch/new.

Stakeholders

@cdrini @Freso

Attribution Disclaimer: By proposing this pull request, I affirm to have made a best-effort and exercised my discretion to make sure relevant sections of this code which substantially leverage code suggestions, code generation, or code snippets from sources (e.g. Stack Overflow, GitHub) have been annotated with basic attribution so reviewers & contributors may have confidence and access to the correct context to evaluate and use this code.

@pidgezero-one pidgezero-one marked this pull request as ready for review February 11, 2025 17:40
@pidgezero-one pidgezero-one changed the title [WIP] feat: use author known identifiers in import API feat: use author identifiers in import API Feb 11, 2025
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open questions:

  • key vs ol_id in author import record
  • remote_ids vs identifiers in author import record
    • ^ For both of these, since there are subtle differences between eg remote_ids (authors, Dict[str, str]) and identifiers (works/editions, Dict[str, list[str]]), I think it might be easiest if we re-use the shape of our existing open library records. So remote_ids: dict[str,str] for authors, and key to hold the open library key.
  • Should any identifier conflicts cause import error?
    • As a first stab, let's err on precaution, and error on any identifier conflicts.

return authors

# Look for OL ID first.
if (key := author.get("ol_id")) and (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to name this one as key to be consistent with our book/thing records. Having the import endpoint mirror the shape of our core book records is convenient.

Suggested change
if (key := author.get("ol_id")) and (
if (key := author.get("key")) and (

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Freso do you have any strong opinions on this? ^

(see also Drini's comment above about remote_ids vs identifiers!)

@cdrini cdrini added the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label Feb 21, 2025
Co-authored-by: Drini Cami <cdrini@gmail.com>
@github-actions github-actions bot removed the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label Mar 4, 2025
pidgezero-one and others added 2 commits March 4, 2025 08:37
Co-authored-by: Drini Cami <cdrini@gmail.com>
Co-authored-by: Drini Cami <cdrini@gmail.com>
@pidgezero-one pidgezero-one requested a review from cdrini March 4, 2025 13:57
@pidgezero-one
Copy link
Contributor Author

I've made the requested changes, but importing fails now with TypeError("cannot pickle 'Client' object") - I have no ides what this means. 😵‍💫 Will do some investigation later as time permits...

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Mar 5, 2025
"""Returns the author's remote IDs merged with a given remote IDs object, as well as a count for how many IDs had conflicts.
If incoming_ids is empty, or if there are more conflicts than matches, no merge will be attempted, and the output will be (author.remote_ids, -1).
"""
output = {**self.remote_ids}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up having to revert to this deconstruction - self.remote_ids is being treated as a Thing and not a dict, for some reason (despite every other operation on it in this codebase suggesting it should be a dict) so deepcopy fails. I'm stumped on why that's happening.

@cdrini cdrini force-pushed the 9448/feat/use-known-author-identifiers-in-import branch from 51705ac to 5bfaada Compare March 27, 2025 15:57
@cdrini cdrini force-pushed the 9448/feat/use-known-author-identifiers-in-import branch 2 times, most recently from 878b527 to e54bc88 Compare March 27, 2025 16:13
@cdrini cdrini force-pushed the 9448/feat/use-known-author-identifiers-in-import branch from e54bc88 to 4895023 Compare March 27, 2025 16:17
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm! We tested on a call and importing is working like a charm! We've got some tweaks/fixes to the wikisource import script which we'll push up in a separate PR. Great work + perseverance on this one @pidgezero-one .

Note I decided to run with using the remote_ids / key for consistency with our type scheme, but we can always revisit.
Also decided to have match_remote_ids throw an error if there's a conflict for now, but this could be a mistake we'll want to change later 😁

@cdrini cdrini merged commit 4b7ea29 into internetarchive:master Mar 27, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs: Response Issues which require feedback from lead

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Import endpoint should allow for any (known) author identifiers Import endpoint should allow for Open Library author identifiers

2 participants