Skip to content

bibliographic post-processing improvement #49

@alliyya

Description

@alliyya

Suggestion from Susan to reduce the amount of duplicate Works/expressions

  • If after processing we end up with
    • Work Title A and Author B with Date C and URI D, plus associated Expression (Record 1)
    • Work Title A and Author B with Date D, which is later than Date C, and URI E, plus associated Expression (Record 2)
  • Then replace URI E with URI D in all triples
  • And delete the Work with URI E
  • And so on for any additional Works whose author/title match those of URI D

This will likely cause too few Works to be created in some cases (e.g. those poets who just repeatedly published Poems that can only be distinguished in particular years. So we might want to exclude certain titles, such as Poems, Collected Poems, Works, Complete Works, Collected Works, Essays, Collected Essays, Prose Works, Collected Prose (if we grab the most frequently recurring words in titles that may help us decide on additional ones--this is just off the top of my head).

Does this seem feasible? (I have to admit that one thing I can't get my head around is how we deal with diffs as the files for these things change.

Looks more feasible than altering the conversion process since the current scripts are a placeholder until CWRC has its new schema in place instead of MODS.

This will likely be a bit of a slow process since we have so many records and there would be quite a few triples to delete since every work/expression has title and timespan triples associated.

Potentially: I think we could keep the expression from record 2 and have that realize the work from record 1. But would that be too much duplication as you'd still have fairly similar expressions? Expressions might have different edition information associated that might make a difference.

related questions:

  • What if the date is identical?
  • Multiple authors listed in one vs the other?
  • What happens to genres attached to Record 2? will get merged?
  • How will this impact Writing extraction, when an entry references a merged record? Doing lookups with every mention of a work would be expensive.

Next steps:

  • sample queries to get at similar works (determine how many records this could reduce
  • further discussion about the above questions and results.

Metadata

Metadata

Assignees

Labels

Conversion: CWRCThis is related to the conversion process using the CWRC ontologies. (Classic Branch)project:bibliography extractionrelated to extraction of bibliography entriestype:ideaIdea that should be discussed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions