
Presenter: Laura McGuiness
The National Security Research Center is a library within the Los Alamos National Laboratory (LANL) – a US laboratory responsible for solving national security challenges. In a field where reliable author attribution and access to research is critical, Laura McGuiness talks about the challenges of name disambiguation stemming from inconsistent sources, their investigation into utilising machine learning to create valid name authority records and the complexities of implementing a robust system.
Technology, language, history and creativity converged in Canberra for four days as cultural leaders gather for the world's first in-depth exploration of the opportunities and challenges of AI for the cultural sector.
Want to learn more about this event?
Visit the Fantastic Futures 2024 Hub
This transcript was generated by NFSA Bowerbird and may contain errors.
Hi everyone. So my name is Laura McGinnis. I'm a metadata librarian at Los Alamos National Laboratory. I don't know how popular the Oppenheimer movie was outside of the United States. So for those of you who haven't seen it, I'll introduce Los Alamos just a little bit. Let's see. Okay, so Los Alamos National Laboratory is a United States national laboratory responsible for solving national security challenges. LANL's National Security Research Center is the lab's library housing tens of millions of scientific, technical, and historical materials related to the United States nuclear history. Authority records, for those of you who are not familiar with that, are used to disambiguate author names within databases. Frequently created in Mark 21, authority records are highly structured data that authorize a preferred name format and then cross-reference variant or related names to the correct author. These authority records ensure standardized access points for users, creating a linking framework for related or identical names in a database. As you could imagine at LANL, a researcher's work is often critical to ensuring national security, making it imperative that all possible access points are available for use in a database. Currently, LANL has several large institutional databases lacking name authority records. So this lack of access points is just contributing to information noise, producing overwhelming and ambiguous search results. I believe that we can solve LANL's name authority problem with three steps. The first one being the generation of valid name authority records. The second being the disambiguation of similar names and the implementation of automatic authority control within our institutional repositories being the last. The automatic authority control is my utopia, and I think it is also for anyone who spent a significant amount of time doing authority control, they'd probably agree with that. So far, we've pursued a couple different options for authority record generation, including the use of an LLM. At this time, we found success with using a name entity recognition to identify parts of names and then employing a transformer embedding model to create a vector representation of those names. This output of embedding has allowed us to query the records with the provided author name and then an inferred corresponding name in the database can be obtained. We're using several API calls to grab identifiers such as ORCIDs, Scopus IDs, Library of Congress and VIAF authority records, et cetera. Just to increase the fullness of the record and ensure it's adhering to applicable standards, it can be used in our library. To aid in disambiguation and ensure adherence to authority record standards, we've also chosen to include author affiliations if present. For authority control within the repositories, automatic authority control that is, We're considering an LLM or RAG architecture to compare new author data to existing authority records. At first, manual inspection of the authority records is going to be key. But after that, we're hoping to create a harmonic score where we can compare our authority files to ground truth authority files pulled from BIAF, Library of Congress, created by me, et cetera. Then we'll push that to do iterative changes to the LLM if that's what we choose to use, essentially engaging in reinforcement learning from human feedback. All right, and just a side note that if anyone has the same particular interest as I do in ethical questions in name authority control, there's a really great book. So accurate authority records are especially important for traditionally underrepresented groups who've been disproportionately affected by name ambiguity in databases. This includes individuals who are more likely to change their name over time, names that have been transliterated from a foreign language into English or remain in non-Latin script. The last couple talks were really interesting to me for that reason, as well as double-barreled surnames. Name disambiguation contributes to reliable bibliometrics, and it helps ensure scientific discoveries at LANL are correctly attributed. In a field where gender and racial minorities often experience lower authorship attributions, authority control is giving recognition to scientists whose contributions have been overlooked in scientific databases for several decades. While the benefits to name authority control are many, there's a lot of ethical concerns to consider in the projects surrounding both AI use and authority control. For the former, and we talked a lot about it here, we're going to be mindful of ensuring transparency of the training set, of the outputs, of the decision-making processes. If we do use an LLM at any point, our team is going to be documenting criteria chosen for inclusion or exclusion of certain data sets. to provide a clearer understanding of the LLM's learning processes. This includes ethical concerns with the training data, such as the incorporation of RDA's Rule 9.7 or Mark 21's 375 field, which required assigning gender to authority records. Attributed by scholars is responsible for outing gender identities through authority records. This is one piece of training data that we're going to exclude. While LANL librarians didn't abide by these cataloging rules to begin with, the possibility of using external authority records from BIAFRA Library of Congress, for instance, may introduce this concept, so that's something that we have to keep an eye out for. To mitigate these concerns, our project team is establishing currently internal accountability measures that are going to assess potential impact and biases. That's probably the end of my five minutes, so I'm going to turn it over to the next speaker, but thank you so much.
The National Film and Sound Archive of Australia acknowledges Australia’s Aboriginal and Torres Strait Islander peoples as the Traditional Custodians of the land on which we work and live and gives respect to their Elders both past and present.