Fantastic Futures 2024 - Day 2 - Session 8

Title:

Fantastic Futures 2024 - Day 2 - Session 8

Year

2024

Summary

AI-assisted analysis of Malay-language periodicals in Singapore

Presenter: Miguel Escobar Varela

Miguel Escobar Varela offers an overview of the early stages of a collaborative project that brings historians, digital humanists and computer scientists together to explore the interconnectedness of Singapore's history with the broader Malay-speaking world during the late colonial period. The study takes a look at Malay-language periodicals published in Singapore, utilising a unique methodology that combines optical character recognition (OCR) and large language models to convert newspapers and magazines written in Jawi to Rumi, enabling the processing and analysis of a vast corpus of historical periodicals at an unprecedented scale.

Fantastic Futures 2024

Technology, language, history and creativity converged in Canberra for four days as cultural leaders gather for the world's first in-depth exploration of the opportunities and challenges of AI for the cultural sector.

Want to learn more about this event?
Visit the Fantastic Futures 2024 Hub

TRANSCRIPT

This transcript was generated by NFSA Bowerbird and may contain errors.

I know I've got a tight 15 minutes, but I've also got a timer over there. Hello, everyone. It's fantastic to be here in this conference. I've been learning a great deal from all of you over the past two days. And now it's my turn to share a little bit about what we've been doing. I'll be speaking about a pilot project that we're working on. And I'm an academic researcher. My role is a bit different to that of some of the people here in the audience. But I do work with the National Library of Singapore, so with people that are custodians of this collection, and I'm So kind of what I want to do and what my team wants to do looks like this. So we have the Jawi newspapers, and we want to extract images, text, so that we can do the systematic analysis of themes, formats, and events. And now the flow of this arrow here might be very familiar to you. Many of you have collections that you want to use AI to in some ways convert into structured data. This might be video, audio, and other kinds of data. So I guess this direction, you know, you could have other kinds of things represented here, but you might be less familiar with the Jawi newspapers, although I do know there's some, like, you know, Singaporean and, like, Malaysians also in the audience, so a shout-out to all of you. But for those of you not familiar with this, let me tell you a little bit about this. And the thing to bear in mind is that, you know, as you might know, Singapore is a profoundly multicultural society. We have four official languages, English, Tamil, Mandarin Chinese, and Malay. And the de facto everyday language today, like the lingua franca, is English. But up to about 70 years ago or so, Malay was the lingua franca. And it was also Like currently, Malay is written in the Roman script, so Romanized Malay called Rumi. But up until about maybe 50 years ago or so, it was still common to have Malay written in the Arabic script, something that we called Jawi. So this is like an adaptation of the Jawi script with a few additional letters to represent the phonemes of the Malay language. And so we Oops, this comes with a... It was a bit too... So anyway, so we've got this... newspapers, these Jawi newspapers. We've got about 80,000 pages, which is kind of small compared to many of the millions of items and collections that I see many of you have here. But what's interesting about this is that we have a full collection for about 100 years. The reason why it's not so big for a newspaper is that the runs used to be pretty small. So typically, in a single issue, you have maybe 5 to 10 pages. So they're kind of small, but they're quite varied. They have a lot of reports from news from around the world. They have sports coverage, poetry, a ton of advertising, and they also cover this really interesting time in history, from the mid-19th century to the mid-20th century, so it covers the late colonial period and the early nationalist period in Singapore, and it offers coverage of what my colleague, Dr. Faiza Zakaria, who is a historian, calls vernacular cosmopolitanism. So how people were interpreting different changes in time at this pivotal time in history. But they're also really fascinating as objects. So I just want to show you, to give you a sense of how they look like. And also because we've got these big screens over here, maybe you can appreciate. You can see photographs, you can see different items. And I want to zoom in. in that image in the center. So what you can see here, I don't know how well you can see over there, but you can probably guess that that's like beer. Like, you know, it's like beer being poured. So stout, I mean, you cannot really see very clear, but that's like from Guthrie and Co. And it says like, you know, bulldog stout. It says that in English. And then the second line, so this is like, you know, Jawi is written like, you know, right to left. So it says like, you know, bulldog set out. So it's like a kind of an approximation of a bulldog stout. And then so it has like a combination of, you know, English and like Malay written in Arabic. And also what's fascinating about this from a cultural perspective is that currently some people still use this Jawi script, but it has a more religious connotation. So for a lot of people, this is connected to a more kind of conservative view of Islam, which, you know, it's all well and good, but that's a very recent phenomenon. And something like this would be unimaginable today. Nobody would have a beer advertisement in Jawi. But this is from the 1920s. So only 100 years ago, we had this type of really vibrant multicultural intersections and dialogues going on in Jawi in Malay. And people don't know anything about this history. Like, even when I tell Singaporeans about these that are familiar with the history, everyone is surprised, and everyone loves that example. So I thought that, you know, you'd also appreciate it here. So there's a ton of things we want to do with this collection. We also want to look at images, formats. But for now, I want to concentrate on our initial experiments on language. And, you know, there are many challenges to working with this. One is the sheer number of materials, which, you know, 80,000 is still not nothing. Also, Jawi is not commonly used anymore. So one of the first things that we did was also we wanted to build capacity for people who want to learn how to read Jawi. So we have these flashcards. I don't know if anybody has used these for learning languages. It's amazing. It's AnkiDex. I use this all the time. And we created these flashcards specifically for Jawi, first for learning the letters. There's only about 40 or so, so like the 31 letters from the Arabic script. plus a few adapted to the Malay context. And then we also created a web app for practicing reading Jawi in Streamlit. I don't have the URL there, but it's on my website. So if anybody wants to brush up your Jawi reading skills, do give me a shout out. And by the way, in this team, it's a small team. You know, it's like, you know, historians, computer scientists, and, you know, the odd digital humanist in the midst. So all of us speak Malay, but the vast majority of us didn't read Jawi. I myself didn't read Jawi at the beginning of the project, but we wanted to learn as part of this. So we kind of built this app so we could learn a bit more and we extracted. So kind of, you know, like we looked at the most common words. So this shows you like the, in this Streamlit app, you get the hundred most common words used in these Malay newspapers, and you can practice reading. So anyways, we, um, so that's still, that, you know, it's great that more people could read Jawi, more people can directly interact with this collection, but we still have this problem of scale, you know, 80,000. So this is where a computational solution comes in. So what we want to do is we want to do OCR, transliteration, and then this systematic analysis. By the way, when I created these slides, I didn't realize I was going to have this massive screen. So this is just placeholder. I didn't take the actual, so this is actually from that collection, but that OCR doesn't correspond to that transliteration. So, you know, in case you can read that, but you can even see that it says like, you know, 1936 or something and not in the other one, but that's just because I thought, oh, people are going to be too far. Nobody's going to ever, you know, see that. But anyways, um, but you know, we want to have this computational approach to kind of do this pipeline that very familiar to people here. And you know, we could also, some people would say, well, why don't you just, you know, skip the OCR, go direct to transliteration. So go directly from these high quality PDFs scanned by the National Library. of Singapore, who I should say there are partners that I also acknowledge in this collaboration. So we could go directly from the, you know, the high-quality scans to just the romanized Malay that most people know how to read. The problem with that is that we want to take advantage of the existing OCR pipelines that might work with Arabic, or, well, that was at least what we thought at the beginning. There are some problems, you know, coming up in the next few slides. We could have also done just like OCR, so just work with UTF-8 version of Jawi and go directly there to the you know, systematic analysis, avoiding the transliteration step. But we want to take advantage of these really great, like, LLMs that have been developed in Singapore. We have, like, the AI Singapore that has developed these large-language models trained from scratch on Saudi station languages, and they work really well with Malay and Indonesian. So we can, so we want to do things like, you know, themes, formats, classification, NER, all that kind of stuff that works pretty well with these models. So we still want to have the best of both worlds, like an OCR UTF-8 version of the Arabic script, but also the Romanized Malay. Now, the challenge The first challenge, I would say, is that there's no data. So there's no, there was no really like a good corpora that we could change for like, you know, for, of transliterated texts that we could use for transcript. There's a bunch of transliterated things. So this is the Jawi transcription project was done at the University of Singapore, National University of Singapore a few years back, but the only transliterated, even though it's called transcription project, weirdly. But anyways, there's no kind of transcribed directly into JOWI, so that's kind of the first thing that we did in this as part of the seed grant that is, that we had. And then we created this data set that is available now in Hugging Face. So you can just go to my website over there and, you know, mevsg, that's, and then we have like this OCR. So this is a very small but gold standard data set for JOWI OCR, should you ever need that. So it has, it's like line by line, so we don't solve all the problems. It has like, it's very, it's, you know, manually created, like, you know, very carefully controlled quality. And you have also like a spread of the languages, like English, Malay, and, you know, from things that you see from like captions, from, you know, titles, like all kinds of like diversity of the things that you see here, but it's pretty small. We're talking about like 1200 sentences or so. And then what we did was we tried to, so we say, okay, let's see how well, like, you know, out of the box OCR pipelines work here. Surya library has been doing the rounds on Twitter, has been like this excellent library for OCR. And, you know, the creators claim that for Arabic, it's got like 98.1%, which sounds pretty awesome. So we said, okay, let's see how well it performs with our data. And the answer is not great. Okay. But by the way, so this is a strange way to report things. So this is, so, 29% accuracy. So basically, it's like we subtracted the character error rate. So the character error rate is like 70%. But I thought it would be more dramatic to just present it in this way. And I won't even show you the word error rate, because it wouldn't be fair to this model. They basically didn't get anything right. It's like zero. Because there's no words, right? The vocabulary in Arabic, there's the same letters. But it's like, I don't know, taking Romanized whatever, like Dutch is also written with the same characters as the English language, but the words are not the same. Anyways. So what we did then is we fine-tuned these. We were doing some other things, but a couple of weeks back, the QN model VL dropped, and I've been telling anyone that I've run into in this conference, this is a game changer. We were able to train this on a single A100 machine on Lambda Labs for about six hours or so, only like 1,000 steps, a few hours, and we got 90% accuracy. which is not great, you know, we're not going to deploy this on the entire collection anytime soon, but it shows this is theoretically possible. And it's much better, it's like, you know, a whopping, like, you know, 60% better than the other models. So I'm pretty, you know, I'm feeling good about this. And also, by the way, our model is also available on Hugging Face. So if you want to look at this, it's like the QN for Jawi. Okay, anyways, I'm going to go pretty fast. Transliteration, it's a really hard problem because you don't always write the vowels. Sometimes you do, but not always. So for example, what is this word? Now this is a total trick question because it really depends. So I'm going to just explain what these are. Again, we're going right to left. So this is like a B, this is an N, this is a T, this is an NG. So what is B and T and G? Well, it could be any number of things. So it could be Ban Teng, it could be Ban Ting, it could be Ben Teng, it could be Ben Teng, or it could be Bin Teng. So the two, the last two are pretty common. So like Ben Teng is like a fortification, and Bin Teng is a star. Totally different things. So and this is, by the way, this example I've taken from Mulaika Hijaz, who's like a scholar of Jawi based at Soas University of London. And she uses this to say like, you know, how difficult it is to, actually, it's pretty easy to learn how to recognize the characters, but then to interpret them in context is a whole other story. So this is what we think that LLMs can be pretty good for this post-OCR correction. So we created a data set of B and T and G disambiguation. So we manually created like, you know, 500 examples of each possible form of the word and with a prediction. And we are currently training a model that wants to, that we're trying to see if these CLION, like study station languages in one network training Singapore can work for this disambiguation problem. I'm pretty hopeful that it will work, but I don't have the results just yet. I was hoping to have this ready by today, but still need a bit more training, but like, you know, watch this space in a hugging phase. And I, but I believe this is the kind of things that LLMs have been shown time and again to be extremely good at. And we hope at some point to do not just disambiguation on these like, you know, toy example, but also like a more, more complicated context where you are able to, you know, correct OCR, but also like guess which are the vowels based on the context. And I'm, again, you know, I might be, you know, optimistic, but I think I'm in the company of people who are similarly, carefully optimistic about technology. And these are the kind of things that I think are totally within reach. So this is our work so far, but this is, of course, an oversimplification of much more things that we want to do. Eventually, we want to do things like, you know, document reconstruction. That's another thing that sometimes, you know, everybody who's worked with newspapers knows this problem, right? Your document begins at one column, it continues somewhere else, maybe at another page. And we, you know, I haven't, I'm not reporting on that, but since I have one more minute, I'll say something that we also have been able to kind of like do like a classifier model, just a bird model that predicts which is the next continuation. So given this candidate, which is the next continuation, and this is amazing, like it's extremely good at these kind of simpler tasks that are not generative in the way we use that today. We're using bird models and we're using, yeah, so that's kind of where we are now. And we are looking for friends and funding to expand, so FF24. We're applying for some funding, so we hope that that will come around, that people will find this interesting. But I'm also really looking for advice and any suggestions, anybody that you know that might be interested in that kind of stuff, so please come talk to me. So, thank you very much.

Fantastic Futures