Fantastic Futures 2024 - Day 2 - Session 4

Title:

Fantastic Futures 2024 - Day 2 - Session 4

Year

2024

Summary

Whisper applied to digitised historical audiovisual materials

Presenter: Jon Dunn

Beginning in 2015, Indiana University (IU) undertook a massive project to digitise rare and unique audio, video and film assets from across IU’s collections which resulted in over 350,000 digitised items. With accessibility needs in mind, IU Libraries is exploring the feasibility of using the open source Whisper automated speech recognition tool in conjunction with IU’s HPE Cray EX supercomputer to generate captions or transcripts for this collection. Jon Dunn shares successes and failures, experiences processing files using a shared research-oriented high-performance computing environment and future plans for using the output to support accessibility and discovery.

Fantastic Futures 2024

Technology, language, history and creativity converged in Canberra for four days as cultural leaders gather for the world's first in-depth exploration of the opportunities and challenges of AI for the cultural sector.

Want to learn more about this event?
Visit the Fantastic Futures 2024 Hub

TRANSCRIPT

This transcript was generated by NFSA Bowerbird and may contain errors.

Thank you, Katrina, and thank you all. So this is another talk on WISPR, which you may be tired of hearing about by now, but I'm going to talk a bit about some experiments we have conducted with WISPR, applied to collections, and how that might feed into larger efforts within the community around exploring how WISPR and similar automated speech recognition technologies can be used in cultural heritage setting. So my name is John Dunn. Again, I'm from Indiana University. I'm presenting work that was done in collaboration with and actually in large part by my colleagues Emily and John Cameron who selected the content that was used in these experiments and helped to design the experiments. And then Brian Wheeler, who has engineered a lot of the workflow and pipeline that you'll see and the data that has come out, which I'll discuss. So I'm going to first just do some background and context for this work. I am, as I mentioned, from Indiana University. This is a large public university in the Midwestern United States, so sort of in the middle of the country. We have collections that are distributed across the organization. And the work that my group does, a lot of it is sort of being a service provider for archival repositories and libraries across the university in terms of digitization, digital collections, online access and discovery, digital preservation, digital exhibits, and related services. So we try to provide services to support workflows used by librarians, archivists, and others across the university. We have a particular history of work on audiovisual media, online digitization and access projects going back into the 1990s, in part because we have sort of an unusually large number of audiovisual collections for a university of our sort. We have a very large music school, so we have a large performance archive there. We have a moving image archive. that has significant research collections in educational films and television. Research collection of television commercials. We have some filmmaker collections around Orson Welles, John Ford, Peter Bogdanovich, among others. We also have a large ethnographic audio archive that, among other things, contains what is arguably the first stereo sound recording in the world made in China in 1901. And along with that, all of the kinds of things that American public universities typically have in their collections, including university archives and lots of athletics and sport video and imagery. So a diverse range of collections we're dealing with. And luckily, over the period 2015 to 2021, we carried out a project to digitize the vast majority of our archival audio, video, and moving image collections. So we worked in a private-public partnership with Memnon, based in Brussels, who set up shop on our campus, and we collaborated with them to digitize over 350,000 pieces of media, again, coming from 70 different libraries and archival repositories across the university, producing ultimately 20 petabytes of data to be preserved and with a goal of providing as much access as possible subject to copyright and ethical restrictions. Now, one of the challenges in providing that access is that one needs to have metadata or something to be able to search on to discover that. And these collections were quite varied in terms of the level of metadata that they came with into this digitization effort. And metadata work, rights work, were not funded as part of this project, unfortunately. And so we started looking at this, you know, are there ways that AI could not do the cataloging, do the description, help professionals to maybe more efficiently do this work. And I talked last year at Fantastic Features in Vancouver about the AMP project that we have undertaken with a number of partners and funding from the Mellon Foundation as a tool to help build workflows that collection managers and archivists can use to help create metadata or information to inform metadata for AV collections. And the work I'm going to talk here or talking here about today about Whisper, ultimately we hope will fold into this. And you can find out more about AMP if you're interested at that URL. But I will shift over from talking about context to talking about what we actually have done. So the goals of this work were to test Whisper as this very, as you've heard, promising emerging automated speech recognition technology, test this against a variety of content that kind of represents the scope of our collections. different collections, different content genres, different original media formats in terms of what things have been digitized from and think about, or at least begin to think about how Whisper would fit into a captioning and transcription and metadata workflow, both to improve discovery but also to enhance accessibility to persons with disabilities. We and other universities, public universities at least in the U.S., have a new set of rules we're going to have to follow in terms of accessibility come April of 2026. And there's a lot of work we need to do to prepare for that. And so looking at how this could help us in that way as well. I'm not going to talk a lot about Whisper, because you've heard a lot about Whisper. But it's an open source automated speech recognition system from OpenAI. The degree of openness can be debated, as Kathy mentioned yesterday. It is important to note that WSPR is not just one model, but it's an emerging and growing family of models. And again, more info on WSPR there. In doing this work, we wanted to build on previous work in the community. And so Mike Trisna and colleagues from Smithsonian talked about their experience with or provided a very good overview of Whisper and talked about their experience last year at Fantastic Futures. In Vancouver, in the AI for LAMP community call in June, Alan Lungard from Stanford talked about experiments that Stanford had started to do with Whisper. And since we proposed this talk, in fact, A speech-to-text working group has formed in the AI4LAM community, and we hope to feed this work into discussions in that group and continue to learn from that group as well. So if you're interested in any of these things, you can find links from AI4LAM.org. But for the experiment we conducted, we first needed to select a test corpus, and so we We have, as I mentioned multiple times, a very wide range of collections. For this initial testing, we did decide to focus on English, even though the majority of our spoken, I believe this is, the majority or at least a significant portion of our spoken word recordings are in languages other than English, and actually a significant portion of our collections are music rather than spoken word. So this is, you know, just kind of getting at a piece of things. But, you know, for better or for worse, it was the easiest place to start. And we've heard all the limitations of Whisperer on accents and so on and languages yesterday. And we're very much, you know, concerned with that as well. But this is where we were able to start. We also wanted to, you know, have, as I mentioned, a diversity of physical formats, a diversity of content genres. And even though we were running this internally within computing resources in the university, we didn't want to presume that our collection managers and archivists were okay with their materials being run through an AI-based service. And so we wanted to obtain permission from the collection managers to do that, which is challenging when you're talking about 70 different collections that are independently managed. So we had to contact everyone university archivist to the chair of the physics department to get permission for the things that we worked with. So in terms of formats, you can see we had 59 items that we ultimately selected for the corpus, and you can see how they distribute across formats. So everything from wax cylinders from the early 1900s to audio cassette tapes from the 1990s. We created some categories of content that we wanted to try to represent, and you can see some are maybe overrepresented, broadcast TV and radio compared to, say, field recordings or home recordings or home movies, but we wanted to select a range of content genres as well. And this I see as kind of initial testing that will feed into more testing and work that we will have to do down the road. But as I mentioned, we had 59 files. We created Ground Truth by sending these to a commercial transcription vendor, 3Play Media, for human transcription and then validated those transcriptions ourselves to at least to the extent that we had time for. We then ran all of the A-B files through Whisper on Indiana University's High Performance Computing Cluster, a GPU nodes. We first normalized all of the file formats. remove trailing silence using FFmpeg. We applied some different audio filters based on some of Stanford's experience that I'll talk about in a minute here and then ran them through multiple whisper models. We then normalized both the human transcript and whisper transcript outputs for comparison, and I could give a whole other talk on what it means to normalize and compare. We really kind of went to lowest common denominator and essentially stripped it down to just words, which is not sort of fully representational of what we ideally need out of automated speech recognition in terms of punctuation, style, et cetera, but for purposes of initial comparison. That was the most effective approach. And the scripts that we used to compare, and actually I will post our data there as well. It's not there, but are available on GitHub. And happy to chat afterwards if you're interested more in getting access to that. So we constructed multiple test scenarios. We tested four different whisper models, small, medium, and then large version two and version three. We tested each of those using three different audio kind of pre-filtering settings suggested by some of Stanford's results that I mentioned earlier. We had a FFT-based noise reduction approach. We had a volume boosting approach, and then no pre-filtering. And then we also tested with both true and false values for whispers conditioned on previous text parameters, which helps influence the model based on text that comes before the current junk being evaluated. And so that was a total of 59 items, 24 permutations of settings, so about 1,400 whisper runs in our test result set. And so I'll talk about the results here. So this shows, and this probably could be better cast as how many times faster than real time it took, but this shows what we call processing ratio, which is the amount of real time that our whisper runs took. So you can see the small model. took about 4% of real time, up to the large model taking about 10% of play time. This is significantly slower than what NFSA presented yesterday, so I need to talk more with them about what differences. We are using the Python version of Whisper. I think there's some speculation the C++ version could offer faster performance. But in terms of the quality of output, we looked at a couple of things. So we looked at word error rate. And so the color coding, sorry, I should have explained because this is a lot of data and kind of small, but it's small, medium, large, version two, large version three, in terms of the four colors, and then each of the different kind of permutations of noise reduction and condition on previous within those. And we can start to see some patterns here that The large models do perform better than the small models, a little bit better than the medium models. The noise reduction settings actually, the FFT noise reduction actually hurt rather than helped in terms of word error rate. And this is across all of these different formats and genres. We also looked at word information preserved, so words retained and we saw similar patterns there. And so, yeah, we saw the large version 2 and version 3 nearly identical, medium a little bit worse. We didn't see that these additional filtering or parameters made much of a difference. As I mentioned, FFT-based noise reduction actually made things worse with our particular corpus. We did see hallucinations. I'm not going to have a lot of time to really get into that issue in detail, but we eliminated some of those by removing words that came in particular out of the larger models that had a duration of zero, typically were hallucinations. We did see more hallucinations with larger models. They were trying to do more to kind of fill in the gaps, it seemed. And we did see results that varied by content genre. So I did want to talk a little bit about that before I wrap up. So this shows the different content genres and the percent word error rate of each test run, each of those 1,400 or so test runs. And we can see some clustering here. So we can see field recordings had worse word error rate, which is probably to be expected because these are not professionally recorded. The audio quality is not great. They're not consistent in terms of volume and noise. We saw, though, that educational recordings, which includes educational films, perform much better. Again, these are professionally produced, typically for a school audience or university audience, much very clear diction, very clear presentations. So that's to be expected. We saw similar patterns with word information preserved. By format, we didn't see much clustering. And the clustering that we did see by format, I think, is more related to the content genre, at least that's our speculation, than the actual format. So you can see motion picture film, the purple in the center at the bottom, does do better than, say, you know, some of the beta cam or VHS videotapes. But that's probably due to that mostly being instructional content. So to wrap up, you know, we need to do a lot more exploration of these results we have so far. We need to work more with our collection managers to figure out, like, when is Whisper? Is Whisper ever good enough? What does human review look like? How do we fit this into a workflow? We want to do some testing of other models. We've recently done some testing with the Turbo Whisper model where we've seen about a 5x performance increase and actually a slight improvement in word error rate. across our test corpus. And then really engaging with the speech-to-text working group in the AFRLAM community to talk about how do we start to share these results, share these experiences, share our data, create perhaps a shared test corpus for this type of work. And then look at where we take things with revision. So super interested in the work that NFSA presented around their editor and crowdsourcing approaches and how does this fit into our goals of providing improved access and accessibility to our collections. Thank you very much.

Fantastic Futures