Fantastic Futures 2024 - Day 1 - Session 2

Title:

Fantastic Futures 2024 - Day 1 - Session 2

Year

2024

Summary

The first 140 days: how we’re teaching an American transcription engine to speak Australian

Presenter: Grant Heinrich

With machine learning, AI and trascription being largely influenced by US companies, the NFSA set itself a challenge: turning an American transcription engine into one that can understand Australian accents and vernacular speech. Grant Heinrich discusses issues on how to influence large data models with small collections of training data, a novel approach to accuracy, crowdsourcing corrections via Bowerbird UX, managing collection entities, and security around sensitive material.

Fantastic Futures 2024

Technology, language, history and creativity converged in Canberra for four days as cultural leaders gather for the world's first in-depth exploration of the opportunities and challenges of AI for the cultural sector.

Want to learn more about this event?
Visit the Fantastic Futures 2024 Hub

TRANSCRIPT

This transcript was generated by NFSA Bowerbird and may contain errors.

Hi, everybody. Welcome to the ARC cinema. Welcome to the NFSA. It's good to have you all here. We've got the sort of first presentation. When we were first putting this session together, we were thinking about kind of, you know, it was literally 140 days to the conference, right? Let's talk about all of the work that we're trying to do to teach what is basically the whisper transcription engine to speak Australian. But actually, almost immediately, Let's try that again. Right. Almost immediately, we came up with a better title, which is, let me give you a little bit of background. We are trying to build an engine that will transcribe the entire of the NFSA's audio, video, spoken word, et cetera, collection. We have about 30 linear years of speech. We have about 20 linear years accessible in some way where we have access copies of it. And we have an awful lot of metadata about all of these things. And an example would be we have a lot of news programs and we have metadata that says item one is something about Scott Morrison, item two is something about, you know, something else, but it doesn't really tell you: What happens at 40 minutes 33 seconds? It just gives you this sort of very top level sort of thing. So I think like a lot of people here, we are looking at how can we transcribe our entire collection so we can find everything that's in it. It's a first step towards what we call a conversational archive. Should I be pointing this in a particular direction? We are working here with the idea that we're going to test machine learning and AI. We're not just going to talk about it. And so this conversational archive idea, Keir, who was on earlier, has spoken about this at great length. There's a link to a presentation with him talking about it at the end of this presentation. But the sort of key things that I'm working on at least are, can we transcribe everything and make it searchable? Particularly for industry and researchers, but also for the public. And there's a whole bunch of work going on in the background, which is once we've transcribed this stuff, can we pull out entities and relationships? I'm going to let you have a look at Keir's presentation if you want to know more about the conversational archive. I'm mostly talking about our approach to the technology here. As mentioned, we have three guidelines for the NFSA principles for ML and AI creation. I used to work in the music industry. I used to work for the BBC, and so the idea that we are maintaining trust, we're on the side of creators. We know that creators make money out of copyright. We're not just trying to bust through everybody's copyright. You know, we're trying to be respectful of everything that we're doing. We're trying to build effectively and transparently, and we're trying to create public value, which is where this project, Bowerbird, comes in. Honestly, which direction am I supposed to be facing with this? Okay. So one of our strategic directions for 22 to 25 is to create better and new uses of digital tools to increase discoverability and usage. Basically, can we give people better access to more of the collection? And so this brings us to sort of the first big problem that we have with trying to build our own transcription engine. which is that we are, in context here, a tiny, tiny, tiny, tiny, tiny little dot, this isn't even in scale, compared to OpenAI's 680,000 hours of training for Whisper version 2. Kathy, thanks for the update earlier. I hadn't realised 3 was so much more. They have spent $9 billion this year alone on working on these things. They expect to lose $44 billion by 2028. There is a lot of money going into what they're doing. We don't have quite that amount of money, maybe a little bit less. And so, I've really got to work on this. And so, you know, the sort of five questions we were trying to answer here is, you know, Silicon Valley monoliths aside, how hard can it be to identify a transcription engine that gives us a great basis to speak Australian? and we are eventually looking at First Nations languages as well as Australian English. How do we build an interface that lets our internal curators correct the transcriptions that are made and eventually the public to be able to work with this? How do we enrich the engine's vocabulary with Australian English and First Nations slang and place names? How do we verify, when we were talking earlier about benchmarking, how do we verify that the understanding of Australian is getting better, and then how do we share and let others leverage our work? So I'm going to sort of, this presentation is these five issues walked through pretty quickly. There are a number of very detailed slides. There's a QR code at the end. If anybody wants the slides, please just get in contact via that QR code. So we have now built an engine called Bowerbird. It is ready to transcribe all collection material that is currently accessible, which is 20 linear years of speech. And it's going to take about 50 days to do that, give or take. It could be sort of 60 or 70 days by the time we shove all of these files from one system to another. The kind of interesting thing here is that it's then going to take 18 months to do the remaining 10 linear years because we don't necessarily have access copies, they might not be digitized, they might not be accessible, they might be material that is currently marked as restricted in our catalogue that maybe doesn't need to be. So we are very excited about the 50 days and we're going to try to be very excited over 18 months to get everything else done. So the first question here is, how did we identify the best open source transcription engine? You might think I'm a product manager rather than a sort of super technical person. In the corner here, Michael, if you could just stand up for a second. If anybody wants to ask any questions about the sort of deep technical side of this, this is Michael Smith, who's our lead engineer on the system. So when we were looking at this, it was initially, you know, you might be doing the same thing. You're saying how difficult can it be, right? You do a frontend. It tells a transcription engine what you want transcribed. It sends it to the file. The transcription engine sends back some sort of beautifully annotated segments and entities and summaries and that kind of thing. Even that is not particularly simple. You've got your automated speech recognition, you've got a bunch of software that's working on finding where the word boundaries are, so you can actually turn things into segments. You've got what's called speaker diarisation and... I might just move that a bit higher. Speaker diarisation and speaker identification, which is who is speaking and has it changed. We have had a lot of technical difficulties with that one. We also wanted to work with named entity recognition, so finding people, places and things in the catalogue. We have kind of forked that off to another project, it was a lot to do with what we were trying to do, and we have looked very basically at summarisation, but it's not a core part of this, but we do generate very quick summaries of things. But even then, you're dealing with even more complication. So you can see there that we've actually got two user interfaces for this thing. As we were building it, we asked an external software house to build a nice, beautifully designed, very pleasant user interface for the public, for curators. We didn't, and it was designed for production, so we didn't want to add a whole bunch of code I had written to this beautifully written, very elegant thing. I didn't want to put my messy sort of handprints all over it. So we have a separate admin UI which we used to, and we pretty much had to build to manage the benchmarks and to manage a sort of training, directed training process that we wanted to work on later. There's an MLOps section which we're feeding training and benchmarking data to so we can see if things get better. You know, eventually, at the very bottom right there, we want to make this stuff available via search, so there's had to have been a lot of work on getting things into a data lake, and particularly what we're calling the restriction manager, which is, is this oral history okay to show to the public? Is it okay to show to researchers? Is it embargoed? Is it culturally sensitive? Does it contain one noted Australian filmmaker being extremely rude about another noted Australian filmmaker, which is quite a common thing in our oral histories. So is it embargoed until somebody is dead? That happens a lot. I'm just giving you the kind of top level sort of list of things to think about. In our situation, this is what we ended up with. Again, you can get the slides later. I'm just kind of leaving this up for people who want to point at these things and say, oh, you've made the wrong decision. So it works for us. The next issue is how do we build an interface that will allow curators to actually put their oral histories into this or any sort of collection item and be able to correct it? We originally thought we'd have something quite straightforward. You request a transcript, you know, you give it a whole bunch of metadata, you give it an ID from our dam so it can pull the file or you upload a file. You can play the audio and correct the errors. You can mark transcripts as complete and then we would have an API so we've got Bowerbird and the Bower. A bowerbird is an Australian bird, for people who are not local, that is very well known for grabbing lots of things and putting them in its nest and not letting anybody else share. So the bower is the nest, and we thought the bower would just have to transcribe files, pull files from our dams, identify entities and summarise. It turned out to be, unsurprisingly, a great deal more complicated than that. I'd like to say at this point, there's a couple of people here from the New South Wales State Library. Thank you very much for the work that you've done on a system called Amplify, which was originally created by the New York Public Library. We originally looked at just extending what Amplify did, or maybe just using it outright. It was actually written in Ruby on Rails, I think, three or four or something like this, like a very early version. It had been updated. But in the end, we've elected to rewrite the whole thing from scratch to use what you can do with Rails 7 and 8 front-end work. So I'm going to do... a super sort of quick demo and just sort of show you some of the things that this thing can do. So the number one thing, and we stole this idea, stole is the wrong word, we borrowed this idea from the Australian War Memorial, who has an absolutely terrific transcription site called Transcribe, which I really recommend people looking at. And one of the key things they've been able to do is they do press campaigns around, They do campaigns around wanting to give people a great reason to get involved in transcriptions. So they will have, here is a whole bunch of World War II love letters and we're going to announce this on Valentine's Day. So they've had, I think the last time I looked, six million corrections made to a whole bunch of transcriptions. So we have taken that idea of we'd like to editorialize what we want to point people towards. So in this case, it's the Nagami Collection, Priscilla, Radio 100. This is not actually live to the public, so these are just things that we've mocked up. But that sort of, you know, what can we direct people to, what do we actually need transcribed or helped or corrected urgently is a big part of this. I'm going to run you very quickly through the sort of correction and startup things. In terms of adding new transcripts, it's exactly the kind of stuff you would think with two important kind of changes. You can see at the bottom of the screen that it has a thing about access control. One of the things that we found very early on was that the curators were extremely keen that they could upload a transcript and correct it privately. No one else knew it was on the system. They could invite contributors to work with it, but they could have very sensitive transcripts in the system and they didn't have to worry about whether other people were going to be accidentally able to see it. As we move out into asking people like the Australian Media Oral History Group and other groups like that to get involved, then it's very important to be able to have this sort of thing in there. We also allow, like Amplify does, the ability to have different institutions in it. I think the State Library of New South Wales works with about 30 other libraries through their Amplify system. So you can put things into different institutions, different collections. If I was to press save here, It would take, this particular transcription I think is about two hours long and the engine works at about 50 times normal speed. So an hour can be transcribed in roughly a minute. a little bit back and forth. We'll talk a bit more about that later. But the sort of key things you can do here is you can see all the lines. You can listen to the individual lines. And you can see it's got suggest a change and mark as correct. So you can say, I'm listening to this line and it's not actually correct. We've also added what we, it's called autoplay here, but we originally called it lean back mode. And the idea was that this thing would run. I'm not entirely sure that this is moving. Right, okay, it's playing. So the idea is that you would press the button and it rolls through on its own. Are we able to get sound out of this? Okay, cool. There's the only sound in here. It is the 9th of May, 2016, and this interview is being recorded at Cremorne, New South Wales. This will be file number one. So it just kind of walks through on its own. This is a change from Amplify, which means that which it lets you, sorry, prefers you to arrow key through the entire thing, which does become quite time consuming after a while. So you can just kind of lean right back. And there's also this idea of confidence that filter at the top. Where we are trying to find things that are actually wrong, so we're trying to find training data, we have the ability to filter out these transcripts and ask Whisper to tell us where it thinks that the transcript is incorrect. We would hope that this would give us kind of lots of training data faster. Anecdotally, in a very vibe-based way, I don't think Whisper is particularly good at working out when it's got things wrong. It's really not very good with Australian place names, and that's the stuff that it gets wrong most often. So this sort of idea of being able to filter things by confidence we thought would be really useful, we're not sure. With a number of the things we're talking about today, we haven't had enough time or we haven't had enough experiments to actually be completely sure. But this idea of being able to just pull out lines that need to be corrected would be a huge help if we can get it working. So what did we actually need in the end? If you're going to build these kind of systems, you need really fine-grained access control. You need some kind of playback, leanback mode so that curators can just watch the thing and correct anything that's wrong. We kept all of the messy stuff out of the main UX, so the benchmarking and test data is created somewhere else. We discovered the API because we're dealing with files that are up to 12 or 24 hours long. Some of our news recording goes on for a day. that we really needed the ability to cursor through specific files and have the API actually play back tiny segments of audio. So we're not loading a multiple tens of gigabytes file into a browser, which was obviously just a complete disaster. So it can now play things directly from the API. And we really needed a sort of provenance strategy. So one of the things that we're doing here is Each transcript is done with a very specific version of BowerBird, version 1, version 2, version 3. And we keep all the training data to a particular time. And every two weeks or every month or whatever, we run all the training data through and call it a new version of BowerBird. And so this means that a year from now, when BowerBird has improved significantly, we hope, then we'll be able to rerun all the machine-generated transcripts again and have a sort of slightly better transcript coming through. There is a lot of stuff that we've been looking at with provenance. I'm not going to go into it here because we could have the entire thing on what we're doing there. Let's move on to talk about how we're enriching the engine's vocabulary. The way that we wanted to review this was a goal of we wanted to improve comprehension of Australian English. And we wanted to ensure that its comprehension of American does not decline. So we're looking at word error rate. Anyone who's worked with transcription will have sort of seen these. And real-time transcription factor. And we started out looking at the GigaSpeech database, which is a data set, sorry, which is very YouTube, spoken word, you know, quite similar to the sort of collection we've got. Are we just as accurate with American speech? Really interestingly, we've managed to make it just as accurate but slightly slower. So we're not quite sure why. We're still looking into this. We suspect it's just the amount of time that things are moving around different systems. But the sort of big question that everyone's going to ask is, have we made it any better with Australians? No. So, I'd like to sort of point out that, you know, what we're trying to do is have an engine that we can transcribe our entire collection with. I thought the earlier presentation was actually very interesting in terms of sometimes bad machine learning is worse than no machine learning. We are kind of taking the attitude with this that in terms of stuff with Australian where it is, What's that? So it's 82.43% accurate. Word error rate runs backwards. So 17.5% of words have an error in them. The number at the bottom, by the way, the RTF, is that it is currently moving at 53.36 times normal speed. So we can transcribe 53 minutes of audio in one minute. Also, worth pointing out, no WSPR model that we've looked at has a lower than 15% error rate on our corpus, so V3, any of the others. Our corpus is not huge at this juncture. We are trying to build it out. I'd love to have a chat with you. Where has Cathy gone? I'd love to have a chat with Cathy about what you're doing there. But we decided, okay, this is not how machine learning works. Can we have a directed learning plan if Whisper has 680,000 hours of training material in it? Can we maybe put a small amount of Australian, or quite a lot of Australian training data into it and try and sort of force feed it Australian? So the idea would be we would put together a whole bunch of Australian place names, a whole bunch of Australian slang. We would use an LLM to turn it into stories because you can't just write, you can't just recite place names to whisper and expect it to learn them. They've got to be in some kind of context of sentences and paragraphs. So we built an LLM, sorry we didn't build an LLM, we used an LLM to turn it into stories, I'll show you one of those in a second, they are pure Australian poetry. the sort of funniest part of this was actually reading these scripts. We asked a whole bunch of different people to record them for us. So it wasn't particularly racially diverse. We are sort of working on First Nations material after this. We have been just trying to get this sort of thing out the door in terms of just basic sort of, you know, benchmarks and that sort of thing. We asked these 40 people to recite the story, you know, to recite the stories. We had them transcribed. We had them corrected. This is an example of a script. So all the scripts are named after the first three slang words that were in them. So this is beauty, pint, ute. The index for this thing is absolutely hilarious. The word rissole turns up all the time. For the Australians in the audience, I think the funniest thing about all of this is that the LLM is absolutely convinced that the Australian slang term bicky, which is a biscuit, is actually an adjective meaning great. So it talks about bicky sunsets, it talks about bicky cars, and I'm actually thinking about using that permanently because it just really works. So we ended up having to build a workflow system that would manage getting these things to people, checking that we had them, checking that corrections had been made. generating PDFs, writing instructions for people you can see in the lower corner because we worked with a bunch of people who are less confident with recording things with vocal notes so we had to say to them it doesn't matter if you um and ah. The biggest issue we had with this is that originally we had a very broad amount of Australian place names, some of which are little, tiny, quite obscure places that not everybody knows about, and so we really were not sure that everyone would be able to correctly pronounce all of these names, so we wound it back to cities over a certain size, or towns over a certain size. Edamoga survived because I grew up in Albury, and Edamoga is quite close. But I don't know that everyone would pronounce that correctly, to be honest. But in the very last minute, on the weekend before this went out, we were busily reducing the vocabulary. And we'd like to go back and actually broaden it once we're ready. So with that done, how do we verify that the engine is actually getting better at Australian? In general, when you're building these systems, you have a hell of a lot of training data and a tiny amount of test data. The GigaSpeech dataset is 2,500 hours of training data and 50 hours of test. We knew that we had an enormous number of corrected PDFs and Word docs where our curators had transcribed oral histories that we already had. So obviously the extremely easy thing that we were going to try and do was extract a whole bunch of unstructured data from a PDF or a Word document. I don't know if anyone else has ever tried doing this kind of stuff. It's really painful. And so we knew we had these PDFs. You can see here this text is obviously perfect because humans have corrected it and nothing could possibly go wrong with a human corrected transcript. So we wanted to be able to take this transcripted text and then run the audio through Bowerbird and then make sure that Bowerbird's benchmark that we were going to use, that had to be correct, actually matched what was in the original thing. So you might have looked at that previous slide, if I can just go back. and looked at it and said, well, they look kind of reasonably similar, like I can't really find anything. But you can see that two names are spelt incorrectly. This is an older transcript, so it uses an ampersand instead of the word and. Jeff's name is spelt in the two alternate ways that Jeff is spelt. So we ended up again having to build another interface where somebody could go through and this actually has little buttons that you can press and it will highlight all the ways in which the two different bits of text disagree with each other. So an external person An external lady called Sarah went through the extremely tedious work of going through, I don't know, there's like 200 hours of transcripts that we've used for benchmarks, just working her way through this kind of thing. As you can see from this point down here where the system that was supposed to match up the broken up Word document with the Bowerbird segments, it often just completely lost control of where it was, so Sarah had to go back and do all sort of this stuff by hand. There are a lot of time-consuming issues in this part. trying to extract unstructured text, mismatch segments. In general, with transcription, one of the biggest issues that we've found that we didn't deal with in terms of the UI is that Whisper doesn't always notice when the speaker's changed. So you really need in that UI to be able to say speaker 1 is speaking to this point, speaker 2 is speaking to this point, by the way don't mix the IDs up when you break this line down, maybe we can call this line 57 and we can call this one line 57A because that kind of stuff works beautifully in databases. So yeah, it was a very entertaining, complicated process, and it took up a lot of the final stage of getting this ready. So the kind of final thing that I'm going to talk about today is, you know, we are trying to present this stuff in public. We are now here in front of you saying, saying, where does Bowerbird go from here? We would really like to talk to other cultural institutions about whether you can use the back end of this, the API. We would really like at some point to put this in front of the public. Anyone who's worked with crowdsourcing will be familiar with these kind of diagrams. If we actually want to have a functional, foundational model of Australian English, then we need a lot of users to correct a lot of transcripts to improve the quality, which will hopefully bring a whole bunch of organizations in who want to bring in more users, who want to bring in more corrections, etc. etc. Around and around it goes, and hopefully at some point we have actually made a dent in Bowerbird's ability to actually understand Australian. So that's the kind of super fast kind of run through the entire thing. I'd just like to sort of thank, while we're here, our oral histories team have been absolutely fantastic with this. They have really gone above and beyond in terms of helping. There are two links which are in the presentation here, which is our principles for machine learning artificial intelligence, if you'd like to use or borrow them. And Rebecca Coronel and Keir Winesmith, who are two of our executives, talking about the conversational archive. That's it from me. This is the QR code. I'll be getting the sort of answers to this. So if you just want the presentation, please just say send me the presentation. But there's a box here where you can offer a question or a comment. And we're actually happy to get comments. Please get in touch. Thank you.

Fantastic Futures