
Presenter: Joshua Ng
In 2022, Archives New Zealand spearheaded an innovative proof-of-concept project to explore the potential of machine learning and hyperscale cloud computing in auto-classifying digital public records and surfacing information of interest to Māori. Joshua Ng provides an overview of the process, the solutions that were developed and what to consider in the design and usage of these tools.
Technology, language, history and creativity converged in Canberra for four days as cultural leaders gather for the world's first in-depth exploration of the opportunities and challenges of AI for the cultural sector.
Want to learn more about this event?
Visit the Fantastic Futures 2024 Hub
This transcript was generated by NFSA Bowerbird and may contain errors.
Tēnā koutou katoa. E mihi ana te mana whenua no Malaysia ao. E noho ana ki te Whanganui-a-tara. He kaitatari matapopore matahiko ao. E mahi ana ao ki te roa mahara o te kauanatanga. Ko Joshua Ng, Hoang Thang Dinh. Tōku ingoa. Tēnā koutou, tēnā koutou, tēnā tatou katoa. My name is Joshua Ng. I'm a Malaysian based in Wellington, and I'm a digital preservation analyst in Archives New Zealand. And today I'll be sharing with you a machine learning proof of concept that we have done in 2022 to automate record classification. Slight correction. It is mainly to classify public records according to our disposal authority, but at the same time we were also trying to see whether or not we can surface some of the public records that is of interest to Maori. So that is both goals. So I'll start a little bit about Archives New Zealand. We are the National Archive of New Zealand, and we want Archives New Zealand to lead the way as a steward and protector of the memory of government in the digital age. This means we have to reimagine how we store, appraise, and process information and records. And this proof of concept that we did was two years ago, and this is before open AI sort of released the Kraken. So we were trying, trying, and this is the most This is what we have done. So as we all know, everyone produces lots of data. Government agencies produce lots of data as well, and most of them are born digital. And lots of this doesn't need to be kept and does not have long-term value. But some obviously do, but how do we know which is which? And it's important to keep those that's important so that it serves as the long-term memory of the government. And we are National Archive. We move a bit slowly. So our current system definitely needs to evolve. We have lots of legacy systems, and we have a lot of systems that's designed with paper records in mind. I'm sure some of you are also familiar with that. And meanwhile, we know that outside of in the government agency land, which we are the regulator, there's lots of database emails, legacy systems as well, and also shared drives full of content, shared network drives. We have heard a couple of times these past few days, and they are very difficult to handle, and it is definitely not possible to sort through them one by one by a human. So we need to try a new approach. And it is an opportunity to try this. We wanted to see whether machine learning tools can help us in this. And at the same time, we also want to take advantage of hyperscale cloud capabilities. And this is interesting because Some of you might not know that there is currently no data centers that's of big tech players in New Zealand yet. So Amazon, Microsoft, none of them, Google, none of them actually have a strong presence or any presence in New Zealand. So we are trying to see as a proof of concept whether or not any of these things can be done. But at the same time, it has to be done very carefully because they are actually not in New Zealand. Of course, machine learning tools, like I said earlier, is supposed to be heard, that it can do things faster, it can surface information from unsorted data, and it's supposed to be able to improve efficiency of information management. So we start this whole thing with proof of concept, and thanks to Digital Government Partnership Innovation Fund, we got a little bit of money, and it was about less than half a year, so four to five months, and we wanted to see how we can streamline the appraisal process. So this is a process that is done, supposed to be done by all agencies within their own organization to sort of classified those records according to a disposal authority that is sort of agreed and approved by Archives New Zealand and the agency. So we're trying to streamline it and we want to see how we can assign and classify the appropriate disposal authority to the records. And at the same time, since we are doing this, we wanted to see whether we can identify material of importance to communities, especially the Maori community. So our partners were the Ministry of Primary Industries. We were really happy that they partnered with us because we can't do this by ourselves. We needed to work with an agency that has actual data. And the way this machine learning is done is that you have to have pre-trained pre-human trained data before you can actually train the model. And so Ministry of Primary Industries has a set of data that allow us to do that. And Ministry of Justice, who had undertaken a similar proof of concept, and they were there to share their experience and sort of guide us along the way. And most importantly, there are two big players, the Microsoft and Amazon Web Services, our technology partners, who actually help us develop the proof-of-concept solution. So we have enough money to sort of run operationally, but Microsoft and AWS, trying to enter the New Zealand market, they were very keen to do something about it, so we sort of partnered with them. And so we carried out a proof of concept in four parts, planning, preparing a test environment, preparing the data, and solution building and testing. We spent a lot of time defining the scope because streamlining appraisal process can be anything, and we landed on automatic classification of those public records and a very narrow scope. We also spent a lot of time preparing the test environments because this is a tricky situation. None of them have presence in New Zealand, but we have very strong data sovereignty, which you have heard this morning. So we kind of need to figure out a way on how to do this without actually having the data move outside of New Zealand. So we did a lot of workarounds and prepared a test environment so that things doesn't really go out. And preparing the data is also something that we spend a lot of time on. Yes, ministry of primary industries have a set of data, but it was so happened they were classifying the data in preparation for a transfer to Archives New Zealand. So it was not meant for a proof of concept like this, so we actually have to massage and transform the data so that it actually fits into the proof of concept that we're trying to do. So most of that was spent sort of getting ready before we can actually put the data into the systems that the partners have developed and tested. And voila, the outcome is success, yay. We got about 75 to 88% accuracy and they were able to identify correct disposal class and we have also discovered that With more data, the accuracy improved, no surprises there, and we were also able to find some record of interest using Maori subject headings that the National Library colleagues have developed. It was a proof of concept, so we were kind of like, yeah, yeah, it kind of works. 88% accuracy is great, but not good enough, because we weren't sure how we can push it higher. And we know that a human will need to actually review it. So that is a success. We're going to go into a little bit off the solution overview on not not too much in detail, but both Microsoft and Amazon actually does it slightly differently. Microsoft label and. Those documents in SharePoint, so we kind of have to put in SharePoint, and then they would transfer into a data lake, their own solution, and then only they scan and classify it within the data lake. Then they predict attributes, and then they extract it out again. So that's Microsoft way of doing it. Whereas Amazon doesn't really have the suite of tools that Microsoft has, that sort of enterprise kind of requirements tool. So Amazon actually developed from scratch some of these tools that in the end was applied to our data, most likely from open source libraries. They prepare and ingest it into their own S3 bucket, and then only the extract text from the S3 bucket, then they classify it. What they have provided is also a sort of dashboard for us to review, and we can check whether or not it's correctly classified. Some of the design considerations that we had when we were building all these things, we quickly realized that we don't have capabilities. Just over at lunch, we were just talking about how great it would be to have the resources the NFSA would have. like software engineers, software developers, a full-time employer just focusing on AI and machine learning. We have none of those. There's only three person in Archive New Zealand that is remotely connected to digital. Everyone else is doing everything else that we do in Archive New Zealand. So we kind of have to wear multiple hats when we're doing this. And so we definitely need to improve our capability if we're going to do this and push this further. We also have a lot of issue with the training data. Getting it classified by human is an arduous task. Before we found Ministry of Primary Industries, we actually looked around all of our New Zealand government agencies and sort of asked who else would be keen to sort of join us and it took a long time before we actually found Ministry of Primary Industries because Those data are sitting in shared drives and they're not touched for reasons that we all know is probably too hard, not the priority, and there are so many other information management things that every agency needs to focus on. This is like the last thing on their priority. So having a human classifying them is actually a big ask. So that is also another thing. The other thing that we learned is that if we have a larger training set, it will be more accurate. We started off with 100 files, and accuracy was only 35%. No surprises there. And then we increased it to about 1,000 files, and the accuracy went up to about 85%. Trust was also a big thing. We also did a lot of running around, convincing everyone that this is going to be OK, because now we're going to safeguard the data. It's not going to go outside. And then we have everyone sign an agreement that we're not going to misuse this data. We got the partners to sign this, that they can't use this data to train whatever model that they have. This is a proof of concept that is for us only. So we did a lot of this kind of background administrative work to sort of make sure everyone is OK with this. And if we were going to do this, again, we definitely need to work with our Maori communities to sort of get a better understanding of what is it that they're interested in, and so that we can take that input into this approach. Because we are just trying to do a proof of concept. So we did reach out to a couple, but not a lot of interaction. And we also have a very limited amount of time. So we definitely need to work closely with them moving forward. Some of the system considerations is that disposal authorities were written for paper records, and obviously needs updating. We need to sort of rewrite some of them so that it fits the digital era, and also make sure that they have the ontology in place to make sure that it is usable for machine. I talk about working with Maori, and we also wanted to align our work with the broader government information system. We are a regulator of the government information, of New Zealand government, but we can't do this ourselves, so we definitely need to work with others, and we need to align with some of the newer things that's coming out, like the algorithm charter and the AI governance policies that we have. So what's next? So obviously, this is a success, and we are really excited about this. And we are really excited that this can be done. One interesting thing that happened is that after this project is done, this was done in 2022, we quickly ran out of funding. The project ended, and then we asked around again. And hey, there's actually, everyone is just doing other things. And we couldn't really find anyone to sort of proceed on to the next phase to run our pilot. We are still trying to find someone to work with us, but we couldn't really get there. But this is something that is fresh out of the oven. I just saw a demo last Thursday. It's not in my slides because I just saw it last Thursday. There is a solution, there is a demo of using generative AI to actually classify the data. What is great about this idea is that we no longer need to sort of pre-train the data before the machine looks at it. The difference is that it uses LLM and the secret sauce is writing the correct prompt. So what we saw was someone actually did translating all of the disposal authorities that is written for paper records into LLM prompts, teaching the LLM sort of what is it that we are looking for and then applying it to documents. Again, this is very early stage. It's not even proof of concept. We are really keen to pursue this because this will mean actually moving this project forward again and trying a new different method because we are really interested in automating the classification of the records. It will really solve a problem that we have for a very long time. We hope that this will help us unblock the thing that we're trying to do. That's the end of my presentation. Any questions? No questions? No questions, sorry. You can talk to me after this.
The National Film and Sound Archive of Australia acknowledges Australia’s Aboriginal and Torres Strait Islander peoples as the Traditional Custodians of the land on which we work and live and gives respect to their Elders both past and present.