AI4LAM
diff --git a/‎_toc.yml
+8 b/‎_toc.yml
+8
diff --git a/‎meetings/2021-09-14.md
-52 b/‎meetings/2021-09-14.md
-52
diff --git a/‎meetings/2021-10-12.md
-36 b/‎meetings/2021-10-12.md
-36
diff --git a/‎meetings/2024-04-09.md
+57 b/‎meetings/2024-04-09.md
+57
diff --git a/‎meetings/2024-05-07.md
+174 b/‎meetings/2024-05-07.md
+174
@@ -7,6 +7,14 @@ parts:
 - caption: Meetings
   chapters:
   - file: meetings/about
+  - file: meetings/2024-12-10
+  - file: meetings/2024-11-12
+  - file: meetings/2024-10-08
+  - file: meetings/2024-09-10
+  - file: meetings/2024-08-13
+  - file: meetings/2024-06-11
+  - file: meetings/2024-05-07
+  - file: meetings/2024-04-09
   - file: meetings/2024-03-12
   - file: meetings/2024-02-13
   - file: meetings/2024-01-09
 
@@ -4,60 +4,8 @@ title: ai4lam Metadata/Discovery WG Monthly Meeting
 
 9 AM California | 12 PM Washington DC | 5 PM UK | 6 PM Oslo & Paris
 
-**Connection Information**
-
-
-    Topic: AI-LAM Metadata Working Group
-
-
-    Time: This is a recurring meeting. Meet anytime
-
-
-    Join from PC, Mac, Linux, iOS or Android: [https://stanford.zoom.us/j/91421044393?pwd=L0VLbnQ0WlE4SDV0MDY5SUhTQnVydz09](https://stanford.zoom.us/j/91421044393?pwd=L0VLbnQ0WlE4SDV0MDY5SUhTQnVydz09) 
-
-
-    	Password: 306295
-
-
-    Or iPhone one-tap (US Toll): +18333021536,,91421044393# or +16507249799,,91421044393#
-
-
-    Or Telephone:
-
-
-    	Dial: +1 650 724 9799 (US, Canada, Caribbean Toll) or +1 833 302 1536 (US, Canada, Caribbean Toll Free)
-
-
-         	 
-
-
-    	Meeting ID: 914 2104 4393
-
-
-    	Password: 306295
-
-
-    	International numbers available: https://stanford.zoom.us/u/aeoeCDrpd
-
-
-    	Meeting ID: 914 2104 4393
-
-
-    	Password: 306295
-
-
-    	SIP: [email protected]
-
-
-    	Password: 306295
-
-
-    Zoom recording: https://stanford.zoom.us/rec/share/QaghrbGoKoStuPm36LciC8tXv_41vQQWD8ZnfsqMAbPo3mkzjICl02KM8tqjC-5l.AfmJLKlWbkyPksgP
-
 **Attending**
 
-
-
 * Tim Thompson (Yale)
 * Jeremy Nelson (Stanford)
 * Erik Radio (CU Boulder)
 
@@ -4,42 +4,6 @@ title: ai4lam Metadata/Discovery WG Monthly Meeting
 
 9 AM California \| 12 PM Washington DC \| 5 PM UK \| 6 PM Oslo & Paris
 
-<<<<<<< HEAD
-=======
-**Connection Information**
-
-Topic: AI-LAM Metadata Working Group
-
-Time: This is a recurring meeting. Meet anytime
-
-Join from PC, Mac, Linux, iOS or Android:
-[*https://stanford.zoom.us/j/91421044393?pwd=L0VLbnQ0WlE4SDV0MDY5SUhTQnVydz09*](https://stanford.zoom.us/j/91421044393?pwd=L0VLbnQ0WlE4SDV0MDY5SUhTQnVydz09)
-
-Password: 306295
-
-Or iPhone one-tap (US Toll): +18333021536,,91421044393# or
-+16507249799,,91421044393#
-
-Or Telephone:
-
-Dial: +1 650 724 9799 (US, Canada, Caribbean Toll) or +1 833 302 1536
-(US, Canada, Caribbean Toll Free)
-
-Meeting ID: 914 2104 4393
-
-Password: 306295
-
-International numbers available: https://stanford.zoom.us/u/aeoeCDrpd
-
-Meeting ID: 914 2104 4393
-
-Password: 306295
-
-SIP: 91421044393\@zoomcrc.com
-
-Password: 306295
->>>>>>> 9c0c03e (Adds remaining 2021 meetings)
-
 **Attending**
 
 -   Jeremy Nelson (Stanford)
 
@@ -0,0 +1,57 @@
+title: ai4lam Metadata/Discovery WG	 Monthly Meeting
+
+# Apr 9, 2024
+
+8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris
+
+**Attending**
+
+* Jeremy Nelson (Stanford)  
+* Andrew Elliot  
+* Sara Amato  
+* Joy Panigabutra-Roberts (University of Tennessee)  
+* Sarah Mann  
+* Erik Radio (University of Colorado)  
+* Ian Bogus (ReCAP)  
+* Craig Rosenbeck
+
+## Helpful Links
+
+* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library)
+
+## Project Documents and Data
+
+* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing)  
+* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing)
+
+## Agenda
+
+* Announcements  
+  * Next meeting will be on  May 7 2024 for Abigail Potter presentation on AI at LOC  
+* Joy Panigabutra-Roberts’ presentation on AI authors and performers in the context of identity management.  
+  * Head of Cataloging at the University of Tennessee Libraries  
+  * What about AI Authors and A Robot Comedian?  
+    * Beta Writer and Steffen Pauly, *Lithium-Ion Batteries: A Machine-Generated Summary of Current Research,* 2022\. From Artificial Intelligence in Libraries and Publishing  
+    * Brent Katz, Josh Morgenthua, and Simon Rich, I Am Code: An Artifical Intelligence Speaks. Poems by code-devinci-002. 2023\. Simon Rich. New York, NY: Back Bay Books.  
+    * Jon the Robot (comedian)  
+  *  PCC does not consider AI to be authors  
+    * Consider a Named AI or generative computer program used to create a resource to be a related work, not as an agent…  
+  * AstroLLLaMA-Chat \- [https://huggingface.co/universeTBD](https://huggingface.co/universeTBD)  
+    * The first open-source conversational AI tool tailored for teh astronomy community [https://doi.org/10.48550/arXiv.2401.01916](https://doi.org/10.48550/arXiv.2401.01916)  
+  * Excerpt from Youtube (https://www.youtube.com/watch?v=OkCoTixo-MM \- A Bot and Costello \- Let's Power the Whole Thing off  
+  * Philosophical and Legal implications are greater than cataloging these works generated by AI  
+  * Questions:  
+    * Amazon generated books?  
+      * Case in wikipedia articles \- predatory publishiser, declined to cataloged to. WIkipedia articles have varied quality,  
+      * Comes down to collection development to screen out poor-quality   
+      * Attributions, but include disclaimer, assisted by AI generated,   
+      * Publisher, authors not claim attribution if generated by AI.  
+      * How do you screen for these works? Be up front about how you are using generated   
+    * Not too hard to tell now if work is AI generated, aesthetical judgements by Cataloger. How would cataloger training be in 5 years?  
+      * Poor quality, don’t catalog  
+      * Responsibility of Author and Publisher to disclose and up front they need transparency.  
+    * Run into this issue in your institutions?   
+      * OCLC RLP Metadata Managers Focus Group will discuss on a topic related to AI and cataloging/metadata with both domestic and international institutions next week  
+      * Creating cataloging records on ebooks, convert publisher data, Ex Libris training a model using PDFs to train to extract publisher information from records  
+      * Another presentation Joy attended recently on a library that did an experiment to use ChatGPT with OCLC records, once you feed the data to the model, the data is now in the training set for OpenAI  
+        
@@ -0,0 +1,174 @@
+title: ai4lam Metadata/Discovery WG May Monthly Meeting
+
+# May 7, 2024
+
+8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris
+
+**Attending**
+
+* Name, institution  
+* Jeremy Nelson, Stanford  
+* Abigail Porter, Library of Congress  
+* Caroline Saccucci, Library of Congress  
+* Julia Kim, Library of Congress  
+* Erik Radio, Colorado  
+* Sara Amato, Eastern Academic Scholars’ Trust  
+* Ian Bogus, ReCAP
+
+## Helpful Links
+
+* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library)
+
+## Project Documents and Data
+
+* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing)  
+* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing)
+
+## Agenda
+
+* Announcements  
+* Presentation “Exploring Computational Description: An update on Library of Congress experiments to automatically create MARC data metadata from ebooks” by Abigail Porter and Caroline Saccucci  
+  * Update on FF 2023 presentation  
+  * Cataloging and OC Labs collaborating on this experiment  
+  * LC Labs AI Planning Framework  
+    * Understand, Experiment, Implement \- Governance and Policy  
+    * Come with quality baseline and then implement, have a robust auditing and shared quality standards  
+  * Tools  
+    * Articulating Principles  
+    * Use case Risks & Benefits  
+    * Domain Profiles  
+    * Data Assessment  
+    * Acquisitions   
+  * Goal of Exploring Machine Learning  
+    * First task order in the Digital Innovation in IDIA which is scoped for experiments inAI and ML  
+    * What are examples, benefits, risks, costs, and quality benchmarks  
+    * What technologies and workflow models are most promising to support metadata creation and assist with cataloging wordings  
+    * Similar activities being employed by other organizations  
+    * Is experiment, not building towards production  
+  * E-Books with Cataloging Records Used to Train Models  
+    * Ground Truth fo testing  
+      * CIP 13802 items  
+      * Open Access 5825 items  
+      * E Deposit ebooks 403 items  
+      * Legal reports 3750  
+    * Test data  
+      * Existing catalog records for the ebooks  
+        * Test models against test data  
+        * Generate performance reports  
+    * Target data  
+      * Uncataloged ebooks  
+        * Run most models  
+  * What are testing?  
+    * Token classification \- extracting specific bibliography metadata from the text such as Title or Author name  
+    * Text classification \- charactering the whole of the text for example into subject heading or genres  
+    * Models: Bert, Spay GPTs with variations (NLP, NER, LLM, transformer and non-transformer models) Vendor picked models, 2022 was before ChatGPT  
+  * Results: Token Classification for All Fields  
+    * F1 score for each of the models, ranked from highest (best) to lowest (worst)  
+    * 80% for some fields  
+  * Results: Token Classification for One Field  
+    * Fields  
+      * 700  
+      * 655  
+      * 264  
+      * 245  
+    * Expectation \~80% Quality standard \~95%  
+    * Exceeded \~80% for some fields (SBN, Author, Title, did pretty well in identifying these field)  
+    * Date \- not easy for it identified, Vendors setting for date parameter, not sure if is the Model.  
+    * LCCN 100%, list of records almost always had the LCCN. Copyright page in e-book would LCCN.  
+  * Early Metrics Analysis: Matches and Non-Matches  
+    * Results after applying Annif  
+      * Green shaded areas are exact match to MARC XML  
+      * 1-1 match  
+      * White rows ML model that wasn’t   
+      * Didn’t put in fiction, nothing in text indicated that is fiction  
+      * Ability to get an URL with a freeform sub divisions.  
+      * The ones that have fiction, established with URL, maybe, no URL  
+      * Combination of Positive and Negative hits questions about relevancy and accuracy, how well does the model do  
+  * Assisted Cataloging HITL Prototype  
+    * Vendor met with a group of catalogers workshop conversation, benefitting catalogers for this work.   
+      * What is the name of the author but the name of the author in the authority field  
+      * Authorized form of name or concept.  
+    * Two low-fidelity prototypes  
+    * Number of tabs for the cataloger to go through  
+      * Abstracts and summary,   
+      * Extracted summary directly from the text of the book, Machine picks main sentences  
+      * Abstract of Summary scanning through the text  
+    * Cataloger selects model suggestions, opportunity for cataloger to select to what the machine is suggesting. Opportunity how well does ML workflow suggest Subjects and Names in the authority records.  
+    * Both cases narrow, broader terms from LCSH, wikidata, and other linked-data resource  
+    * How well did the cataloger appreciated the suggestions, what is and is not beneficial, provide feedback to the model  
+  * Assessing Quality  
+    * How to access quality for these tools in different ways  
+    * F1 score, highest performing scores by field  
+    * Humans in the loop prototypes to increase quality for the records.  
+    * Contractor qualitative scoring of the models  
+      * Reliability  
+      * Compute cost  
+      * Training data  
+      * Activity   
+      * Documentation  
+      * Developer  
+      * Compliance to security and privacy considerations  
+    * Overall quality of the service, program evaluation factors  
+      * Likelihood of maintaining quality over time  
+      * Reasonable cost  
+      * Benefits to staff/catalogers  
+      * Benefits to users  
+      * Benefits to organization  
+      * Fair and equitable risks  
+      * Risks to users  
+      * Risks to organization  
+      * Security risks  
+      * Privacy risks  
+      * Authenticity risks  
+      * Reputation risks  
+      * Compliance risks  
+    * Want to collaborate with other or  
+  * Challenges  
+    * Unbalanced data- long tail of subject terms  
+      * Create well balanced training data to train and test models  
+      * Exploring correcting over representation of English language \* other bias  
+    * Short-term timeline \- Several NLP tools have ben in development for over 20 years, can’t reach state of the art  
+    * Need to develop quality standards and policies for these approaches  
+    * Stability of tools, unknown costs, tooling lock-in  
+  * What did I learn?  
+    * There two type so ML classification, text and token  
+    * One of the models, Annif, had some success with text classification predicting subjects  
+    * Some models very successful at token classification, predicating authors, titles, and identifiers, such as ISBN and LCCN  
+    * ML requires lots of training data to improve results  
+      * ½ of the training data contained similar patterns of LCSH  
+      * ½ of the training data contained unique LCSH  
+    * Catalogers reacted more positively to the results than expected  
+    * Cataloger-assisted prototypes were really cool and have potential  
+    * Catalogers interested in ML and seem less afraid of it than expected  
+    * LLM (ChatGPT) shows promise but need more experimentation  
+    * Room for both HILN and LLM  
+  * What do I still want to learn?  
+    * Faceted subject headings (post-coordinated) be more success than subject strings a la LCSH (pre-coordinated) in ML processes?  
+    * Subject categories are more successfully cataloged using ML?  
+    * Could a model be trained to accurately predict LC Classification and/or Dewey Decimal Classification?  
+    * What other metadata elements can be extracted from ebooks?  
+    * Can LLMs like ChatGP be trained to predicate accurate bibliography descriptions?  
+    * What are ML policies/decision that the Library need ot make, e.g.g:  
+      * Copyright concerns  
+      * Accuracy vs. Relevance  
+      * Training data biases  
+  * Next steps  
+    * ECD2: Toward Piloting Computational Description  
+      * Where are the most effective combinations of automation and human intervention in generating high-quality catalog records tht will be usable at the LOC  
+      * What are the benefits, risks, and requirement for building a pilot application for ML-assisted cataloging workflows  
+    * ECD3: Extending Experiments to Explore Computational Description  
+      * How can ML methods support the CIP cataloging workflow  
+      * How can CIP metadata generated through ML be ingested and used in BFDB  
+      * How can additional elements added to BF descriptions improve quality and usefulness of the metadata compared to ECD1 and ECD2?  
+      * Experiment with 3 ML models  
+      * Use prepublication galleys, all in PDF, some with minimal text provided  
+      * Created BIBFRAME descriptions that can be loaded to test BFDB  
+      * Require more metadata beyond the 6 fields required in task order 1  
+      * Allows for cataloger review in the BIBFRAME Editor  
+      * Extension of cataloger assistant prototypes  
+  * Questions:  
+    * Other LLMs? Current Using ChatGPT maybe Claude, tested Llama almost as good as ChatGPT 3.5, more permissive with later GPT 4.0   
+    * Timeline for next steps and BF Work? Current task order, working on prototypes, task order ends in August, BF begins in August 2024\.  
+    * Can you expand on faceted subjects as potentially more successful? Extreme text classification breaks strings down, going to use linked-data, URLs necessary for BF. There is controversy about pre-coordinate vs post-coordinated strings useful for users. If only one or two patterns of data across the training data, interesting question for policy   
+    * AI metadata clean-up? National Library of Medicine from Alvin Stockdale using ChatGPT to run TOC format properly for a MARC record. Give prompts to format TOC.  Running a script to fix things is more automation than ML. In an automated find/replace, ML making choices of what it thinks it should be. Difference in application, ML method.  
+    * Prototype integrated w/FOLIO? Ebook cataloging in a number of ways. Will the same method in ML produce the same quality. Getting results, very open to plans for integrating with FOLIO. Not serialized in MARC format, in weird space next string. BF description is going into BF test instance, then migrate to FOLIO instance. Not near pilot phase, but experiments. Learning about requirements and what is possible, creating future systems or creating requirements for future systems. Not dealing with production systems, experiments with Prototypes. Fortunately to work with LC labs to work with cataloging to experiments with ML methods.