Releases: kingsdigitallab/eb-pre
1.1
May 2025
Main changes in this new release.
New seeds for domains
The embedding of the domains has been recalculated to reflect the changes to the seed keywords received on 2025-04-14.
All entries have been reclassified based on the proximity to those new domain embeddings.
Broken references to the TEI and TXT files
The user interfaces now points to the correct locations for the TEI and Text files for each entry. The links were out of sync after folders had been renamed in the TEI corpus repository. Namely the folders considered by this release are: eb07/TXT (and eb07/XML) and eb09/TXT_v1 (and eb09/TEI_v1).
Before re-indexing the search we also applied a few fixes on a local copy of the corpus:
- eb07/XML/l12/ was renamed eb07/XML/l13/ to match the metadata in the TEI referring to the 13rd volume and also match the corresponding files under eb07/TXT/l13/
- In eb07/XML/p16/kp-eb0716-069301-5876.xml, line 6, we manually added the missing title
<title level="a" type="main">PAINTING</title>because PAINTING is used as a seed for the fine_arts domain. - We also manually corrected the seed keyword "DANCE" to "DANCE, or Dancing" to match the title of the entry in the 7th edition. (The title is "DANCE" in the 9th edition).
Seed were treated as words/ngrams
We fixed a bug where all the domain seeds were treated as ngrams or words rather than documents. This means that the previous classification for the domains done in 2024-07-09 were not valid. In that classification PAINTING (the document) was treated as painting (the ngram) when computing the embedding for fine_arts.
This affected the domains assigned to each entry in the keyword search interface.
How has this been fixed:
Now all seed terms spelled in lowercase correctly refer to the ngram, whereas others terms (with at least an uppercase letter) are correctly refer to document.
The keyword search interface has a drop down in the bottom left to select the domain definitions:
- 2025-04-30: the classification based on the new seed terms
- 2024-07-09-bugged: the wrong classification for the seed terms received in 2024, where all term are treated as ngram
- 2024-07-09-fixed: the correct classification for the seed terms received in 2024, where all term are treated as document or ngram depending on the letter case.
- 2023: the classification from 2023 seed word. This was hasn't changed as its terms were not meant to point to documents, only to ngrams.
Software dependencies
To make the prototype less susceptible to lack of availability of external cloud services, all resources imported by the user interface are now copied into the microsite hosted on github. It means that failure of Content Delivery Network (CDN) to respond won't affect the the loading of the prototype any more.
v1.0
feat(ui): added max length filter to the seamntic search UI.
