Skip to content

Commit 022a558

Browse files
committed
2024 Meeting Notes
1 parent c3ae71c commit 022a558

11 files changed

+559
-88
lines changed

_toc.yml

+8
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,14 @@ parts:
77
- caption: Meetings
88
chapters:
99
- file: meetings/about
10+
- file: meetings/2024-12-10
11+
- file: meetings/2024-11-12
12+
- file: meetings/2024-10-08
13+
- file: meetings/2024-09-10
14+
- file: meetings/2024-08-13
15+
- file: meetings/2024-06-11
16+
- file: meetings/2024-05-07
17+
- file: meetings/2024-04-09
1018
- file: meetings/2024-03-12
1119
- file: meetings/2024-02-13
1220
- file: meetings/2024-01-09

meetings/2021-09-14.md

-52
Original file line numberDiff line numberDiff line change
@@ -4,60 +4,8 @@ title: ai4lam Metadata/Discovery WG Monthly Meeting
44

55
9 AM California | 12 PM Washington DC | 5 PM UK | 6 PM Oslo & Paris
66

7-
**Connection Information**
8-
9-
10-
Topic: AI-LAM Metadata Working Group
11-
12-
13-
Time: This is a recurring meeting. Meet anytime
14-
15-
16-
Join from PC, Mac, Linux, iOS or Android: [https://stanford.zoom.us/j/91421044393?pwd=L0VLbnQ0WlE4SDV0MDY5SUhTQnVydz09](https://stanford.zoom.us/j/91421044393?pwd=L0VLbnQ0WlE4SDV0MDY5SUhTQnVydz09)
17-
18-
19-
Password: 306295
20-
21-
22-
Or iPhone one-tap (US Toll): +18333021536,,91421044393# or +16507249799,,91421044393#
23-
24-
25-
Or Telephone:
26-
27-
28-
Dial: +1 650 724 9799 (US, Canada, Caribbean Toll) or +1 833 302 1536 (US, Canada, Caribbean Toll Free)
29-
30-
31-
32-
33-
34-
Meeting ID: 914 2104 4393
35-
36-
37-
Password: 306295
38-
39-
40-
International numbers available: https://stanford.zoom.us/u/aeoeCDrpd
41-
42-
43-
Meeting ID: 914 2104 4393
44-
45-
46-
Password: 306295
47-
48-
49-
50-
51-
52-
Password: 306295
53-
54-
55-
Zoom recording: https://stanford.zoom.us/rec/share/QaghrbGoKoStuPm36LciC8tXv_41vQQWD8ZnfsqMAbPo3mkzjICl02KM8tqjC-5l.AfmJLKlWbkyPksgP
56-
577
**Attending**
588

59-
60-
619
* Tim Thompson (Yale)
6210
* Jeremy Nelson (Stanford)
6311
* Erik Radio (CU Boulder)

meetings/2021-10-12.md

-36
Original file line numberDiff line numberDiff line change
@@ -4,42 +4,6 @@ title: ai4lam Metadata/Discovery WG Monthly Meeting
44

55
9 AM California \| 12 PM Washington DC \| 5 PM UK \| 6 PM Oslo & Paris
66

7-
<<<<<<< HEAD
8-
=======
9-
**Connection Information**
10-
11-
Topic: AI-LAM Metadata Working Group
12-
13-
Time: This is a recurring meeting. Meet anytime
14-
15-
Join from PC, Mac, Linux, iOS or Android:
16-
[*https://stanford.zoom.us/j/91421044393?pwd=L0VLbnQ0WlE4SDV0MDY5SUhTQnVydz09*](https://stanford.zoom.us/j/91421044393?pwd=L0VLbnQ0WlE4SDV0MDY5SUhTQnVydz09)
17-
18-
Password: 306295
19-
20-
Or iPhone one-tap (US Toll): +18333021536,,91421044393# or
21-
+16507249799,,91421044393#
22-
23-
Or Telephone:
24-
25-
Dial: +1 650 724 9799 (US, Canada, Caribbean Toll) or +1 833 302 1536
26-
(US, Canada, Caribbean Toll Free)
27-
28-
Meeting ID: 914 2104 4393
29-
30-
Password: 306295
31-
32-
International numbers available: https://stanford.zoom.us/u/aeoeCDrpd
33-
34-
Meeting ID: 914 2104 4393
35-
36-
Password: 306295
37-
38-
SIP: 91421044393\@zoomcrc.com
39-
40-
Password: 306295
41-
>>>>>>> 9c0c03e (Adds remaining 2021 meetings)
42-
437
**Attending**
448

459
- Jeremy Nelson (Stanford)

meetings/2024-04-09.md

+57
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
title: ai4lam Metadata/Discovery WG Monthly Meeting
2+
3+
# Apr 9, 2024
4+
5+
8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris
6+
7+
**Attending**
8+
9+
* Jeremy Nelson (Stanford)
10+
* Andrew Elliot
11+
* Sara Amato
12+
* Joy Panigabutra-Roberts (University of Tennessee)
13+
* Sarah Mann
14+
* Erik Radio (University of Colorado)
15+
* Ian Bogus (ReCAP)
16+
* Craig Rosenbeck
17+
18+
## Helpful Links
19+
20+
* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library)
21+
22+
## Project Documents and Data
23+
24+
* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing)
25+
* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing)
26+
27+
## Agenda
28+
29+
* Announcements
30+
* Next meeting will be on May 7 2024 for Abigail Potter presentation on AI at LOC
31+
* Joy Panigabutra-Roberts’ presentation on AI authors and performers in the context of identity management.
32+
* Head of Cataloging at the University of Tennessee Libraries
33+
* What about AI Authors and A Robot Comedian?
34+
* Beta Writer and Steffen Pauly, *Lithium-Ion Batteries: A Machine-Generated Summary of Current Research,* 2022\. From Artificial Intelligence in Libraries and Publishing
35+
* Brent Katz, Josh Morgenthua, and Simon Rich, I Am Code: An Artifical Intelligence Speaks. Poems by code-devinci-002. 2023\. Simon Rich. New York, NY: Back Bay Books.
36+
* Jon the Robot (comedian)
37+
* PCC does not consider AI to be authors
38+
* Consider a Named AI or generative computer program used to create a resource to be a related work, not as an agent…
39+
* AstroLLLaMA-Chat \- [https://huggingface.co/universeTBD](https://huggingface.co/universeTBD)
40+
* The first open-source conversational AI tool tailored for teh astronomy community [https://doi.org/10.48550/arXiv.2401.01916](https://doi.org/10.48550/arXiv.2401.01916)
41+
* Excerpt from Youtube (https://www.youtube.com/watch?v=OkCoTixo-MM \- A Bot and Costello \- Let's Power the Whole Thing off
42+
* Philosophical and Legal implications are greater than cataloging these works generated by AI
43+
* Questions:
44+
* Amazon generated books?
45+
* Case in wikipedia articles \- predatory publishiser, declined to cataloged to. WIkipedia articles have varied quality,
46+
* Comes down to collection development to screen out poor-quality
47+
* Attributions, but include disclaimer, assisted by AI generated,
48+
* Publisher, authors not claim attribution if generated by AI.
49+
* How do you screen for these works? Be up front about how you are using generated
50+
* Not too hard to tell now if work is AI generated, aesthetical judgements by Cataloger. How would cataloger training be in 5 years?
51+
* Poor quality, don’t catalog
52+
* Responsibility of Author and Publisher to disclose and up front they need transparency.
53+
* Run into this issue in your institutions?
54+
* OCLC RLP Metadata Managers Focus Group will discuss on a topic related to AI and cataloging/metadata with both domestic and international institutions next week
55+
* Creating cataloging records on ebooks, convert publisher data, Ex Libris training a model using PDFs to train to extract publisher information from records
56+
* Another presentation Joy attended recently on a library that did an experiment to use ChatGPT with OCLC records, once you feed the data to the model, the data is now in the training set for OpenAI
57+

meetings/2024-05-07.md

+174
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
title: ai4lam Metadata/Discovery WG May Monthly Meeting
2+
3+
# May 7, 2024
4+
5+
8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris
6+
7+
**Attending**
8+
9+
* Name, institution
10+
* Jeremy Nelson, Stanford
11+
* Abigail Porter, Library of Congress
12+
* Caroline Saccucci, Library of Congress
13+
* Julia Kim, Library of Congress
14+
* Erik Radio, Colorado
15+
* Sara Amato, Eastern Academic Scholars’ Trust
16+
* Ian Bogus, ReCAP
17+
18+
## Helpful Links
19+
20+
* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library)
21+
22+
## Project Documents and Data
23+
24+
* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing)
25+
* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing)
26+
27+
## Agenda
28+
29+
* Announcements
30+
* Presentation “Exploring Computational Description: An update on Library of Congress experiments to automatically create MARC data metadata from ebooks” by Abigail Porter and Caroline Saccucci
31+
* Update on FF 2023 presentation
32+
* Cataloging and OC Labs collaborating on this experiment
33+
* LC Labs AI Planning Framework
34+
* Understand, Experiment, Implement \- Governance and Policy
35+
* Come with quality baseline and then implement, have a robust auditing and shared quality standards
36+
* Tools
37+
* Articulating Principles
38+
* Use case Risks & Benefits
39+
* Domain Profiles
40+
* Data Assessment
41+
* Acquisitions
42+
* Goal of Exploring Machine Learning
43+
* First task order in the Digital Innovation in IDIA which is scoped for experiments inAI and ML
44+
* What are examples, benefits, risks, costs, and quality benchmarks
45+
* What technologies and workflow models are most promising to support metadata creation and assist with cataloging wordings
46+
* Similar activities being employed by other organizations
47+
* Is experiment, not building towards production
48+
* E-Books with Cataloging Records Used to Train Models
49+
* Ground Truth fo testing
50+
* CIP 13802 items
51+
* Open Access 5825 items
52+
* E Deposit ebooks 403 items
53+
* Legal reports 3750
54+
* Test data
55+
* Existing catalog records for the ebooks
56+
* Test models against test data
57+
* Generate performance reports
58+
* Target data
59+
* Uncataloged ebooks
60+
* Run most models
61+
* What are testing?
62+
* Token classification \- extracting specific bibliography metadata from the text such as Title or Author name
63+
* Text classification \- charactering the whole of the text for example into subject heading or genres
64+
* Models: Bert, Spay GPTs with variations (NLP, NER, LLM, transformer and non-transformer models) Vendor picked models, 2022 was before ChatGPT
65+
* Results: Token Classification for All Fields
66+
* F1 score for each of the models, ranked from highest (best) to lowest (worst)
67+
* 80% for some fields
68+
* Results: Token Classification for One Field
69+
* Fields
70+
* 700
71+
* 655
72+
* 264
73+
* 245
74+
* Expectation \~80% Quality standard \~95%
75+
* Exceeded \~80% for some fields (SBN, Author, Title, did pretty well in identifying these field)
76+
* Date \- not easy for it identified, Vendors setting for date parameter, not sure if is the Model.
77+
* LCCN 100%, list of records almost always had the LCCN. Copyright page in e-book would LCCN.
78+
* Early Metrics Analysis: Matches and Non-Matches
79+
* Results after applying Annif
80+
* Green shaded areas are exact match to MARC XML
81+
* 1-1 match
82+
* White rows ML model that wasn’t
83+
* Didn’t put in fiction, nothing in text indicated that is fiction
84+
* Ability to get an URL with a freeform sub divisions.
85+
* The ones that have fiction, established with URL, maybe, no URL
86+
* Combination of Positive and Negative hits questions about relevancy and accuracy, how well does the model do
87+
* Assisted Cataloging HITL Prototype
88+
* Vendor met with a group of catalogers workshop conversation, benefitting catalogers for this work.
89+
* What is the name of the author but the name of the author in the authority field
90+
* Authorized form of name or concept.
91+
* Two low-fidelity prototypes
92+
* Number of tabs for the cataloger to go through
93+
* Abstracts and summary,
94+
* Extracted summary directly from the text of the book, Machine picks main sentences
95+
* Abstract of Summary scanning through the text
96+
* Cataloger selects model suggestions, opportunity for cataloger to select to what the machine is suggesting. Opportunity how well does ML workflow suggest Subjects and Names in the authority records.
97+
* Both cases narrow, broader terms from LCSH, wikidata, and other linked-data resource
98+
* How well did the cataloger appreciated the suggestions, what is and is not beneficial, provide feedback to the model
99+
* Assessing Quality
100+
* How to access quality for these tools in different ways
101+
* F1 score, highest performing scores by field
102+
* Humans in the loop prototypes to increase quality for the records.
103+
* Contractor qualitative scoring of the models
104+
* Reliability
105+
* Compute cost
106+
* Training data
107+
* Activity
108+
* Documentation
109+
* Developer
110+
* Compliance to security and privacy considerations
111+
* Overall quality of the service, program evaluation factors
112+
* Likelihood of maintaining quality over time
113+
* Reasonable cost
114+
* Benefits to staff/catalogers
115+
* Benefits to users
116+
* Benefits to organization
117+
* Fair and equitable risks
118+
* Risks to users
119+
* Risks to organization
120+
* Security risks
121+
* Privacy risks
122+
* Authenticity risks
123+
* Reputation risks
124+
* Compliance risks
125+
* Want to collaborate with other or
126+
* Challenges
127+
* Unbalanced data- long tail of subject terms
128+
* Create well balanced training data to train and test models
129+
* Exploring correcting over representation of English language \* other bias
130+
* Short-term timeline \- Several NLP tools have ben in development for over 20 years, can’t reach state of the art
131+
* Need to develop quality standards and policies for these approaches
132+
* Stability of tools, unknown costs, tooling lock-in
133+
* What did I learn?
134+
* There two type so ML classification, text and token
135+
* One of the models, Annif, had some success with text classification predicting subjects
136+
* Some models very successful at token classification, predicating authors, titles, and identifiers, such as ISBN and LCCN
137+
* ML requires lots of training data to improve results
138+
* ½ of the training data contained similar patterns of LCSH
139+
* ½ of the training data contained unique LCSH
140+
* Catalogers reacted more positively to the results than expected
141+
* Cataloger-assisted prototypes were really cool and have potential
142+
* Catalogers interested in ML and seem less afraid of it than expected
143+
* LLM (ChatGPT) shows promise but need more experimentation
144+
* Room for both HILN and LLM
145+
* What do I still want to learn?
146+
* Faceted subject headings (post-coordinated) be more success than subject strings a la LCSH (pre-coordinated) in ML processes?
147+
* Subject categories are more successfully cataloged using ML?
148+
* Could a model be trained to accurately predict LC Classification and/or Dewey Decimal Classification?
149+
* What other metadata elements can be extracted from ebooks?
150+
* Can LLMs like ChatGP be trained to predicate accurate bibliography descriptions?
151+
* What are ML policies/decision that the Library need ot make, e.g.g:
152+
* Copyright concerns
153+
* Accuracy vs. Relevance
154+
* Training data biases
155+
* Next steps
156+
* ECD2: Toward Piloting Computational Description
157+
* Where are the most effective combinations of automation and human intervention in generating high-quality catalog records tht will be usable at the LOC
158+
* What are the benefits, risks, and requirement for building a pilot application for ML-assisted cataloging workflows
159+
* ECD3: Extending Experiments to Explore Computational Description
160+
* How can ML methods support the CIP cataloging workflow
161+
* How can CIP metadata generated through ML be ingested and used in BFDB
162+
* How can additional elements added to BF descriptions improve quality and usefulness of the metadata compared to ECD1 and ECD2?
163+
* Experiment with 3 ML models
164+
* Use prepublication galleys, all in PDF, some with minimal text provided
165+
* Created BIBFRAME descriptions that can be loaded to test BFDB
166+
* Require more metadata beyond the 6 fields required in task order 1
167+
* Allows for cataloger review in the BIBFRAME Editor
168+
* Extension of cataloger assistant prototypes
169+
* Questions:
170+
* Other LLMs? Current Using ChatGPT maybe Claude, tested Llama almost as good as ChatGPT 3.5, more permissive with later GPT 4.0
171+
* Timeline for next steps and BF Work? Current task order, working on prototypes, task order ends in August, BF begins in August 2024\.
172+
* Can you expand on faceted subjects as potentially more successful? Extreme text classification breaks strings down, going to use linked-data, URLs necessary for BF. There is controversy about pre-coordinate vs post-coordinated strings useful for users. If only one or two patterns of data across the training data, interesting question for policy
173+
* AI metadata clean-up? National Library of Medicine from Alvin Stockdale using ChatGPT to run TOC format properly for a MARC record. Give prompts to format TOC. Running a script to fix things is more automation than ML. In an automated find/replace, ML making choices of what it thinks it should be. Difference in application, ML method.
174+
* Prototype integrated w/FOLIO? Ebook cataloging in a number of ways. Will the same method in ML produce the same quality. Getting results, very open to plans for integrating with FOLIO. Not serialized in MARC format, in weird space next string. BF description is going into BF test instance, then migrate to FOLIO instance. Not near pilot phase, but experiments. Learning about requirements and what is possible, creating future systems or creating requirements for future systems. Not dealing with production systems, experiments with Prototypes. Fortunately to work with LC labs to work with cataloging to experiments with ML methods.

0 commit comments

Comments
 (0)