Fantastic Futures 2024 - Day 1 - Session 4

Title:

Fantastic Futures 2024 - Day 1 - Session 4

Year

2024

Summary

The Mímir Project: Evaluating the Impact of Copyrighted Materials on Generative Large Language Models for Norwegian Languages

Presenter: Javier de la Rosa

Javier de la Rosa takes us through the research, evaluations and conclusions resulting from the Mímir Project: an initiative by the Norwegian government that aims to assess the significance and influence of copyrighted materials in the development and performance of generative large language models, specifically tailored for Norwegian languages.

Fantastic Futures 2024

Technology, language, history and creativity converged in Canberra for four days as cultural leaders gather for the world's first in-depth exploration of the opportunities and challenges of AI for the cultural sector.

Want to learn more about this event?
Visit the Fantastic Futures 2024 Hub

TRANSCRIPT

This transcript was generated by NFSA Bowerbird and may contain errors.

First, I'm going to check the clicker. Yeah, it seems to be working fine. Okay, today I'm going to be speaking about the MIMIR project, which is an initiative among different institutions in Norway. We were tasked by the government to evaluate the impact that copyrighted materials may have or may not have on the performance of generative language models. So, I'll talk a little bit why we're doing this and why the government gave us that mandate and obviously the results out of it. So just to give a little bit of background, we know that we're now living in this sort of revolution of LLMs. Everything that can be done can be done through an LLM. Natural language processing is basically pervasive everywhere. But for training these artifacts, we all know that vast amounts of data are needed, which raises a lot of legal and ethical questions about the proper way of compensating authors. If we see briefly the timeline for this, or not. It was around November 2022 that ChatGPT was released. It was a before and after moment for the community because it was the first successful product, public phrasing product, that had a discourse and language model. Before ChatGPT, we all knew maybe about GPT models, Some people have heard about BERT models, T5 models, things like that. But it was not up to the fingers of everyone, regardless of expertise, to use these language models. Shortly after, they also released a few months later a new model called GPT-4, which was an improvement of the foundation model behind CHAP-GPT. or we think it is, a model adjusted on the existing GPT-3 point something model that OpenAI kept private. CHEP-GPT-4 was a foundational model trained on next-world prediction that could also be fine-tuned and adapted for many different use cases, one of them being an interface like a chat. But it was not a lot of, a lot of time didn't pass until different author organizations realized that this model in some cases were producing verbatim their own contents. So a rain of lawsuits started to happen in the United States and also in Europe. And the whole thing was shaking the industry because there was no proper ways of crediting this work. It was actually unknown the kind of data that OpenAI used to develop these models. It seems like they're not going to publish or say anything about the data they have been using. There are interviews with Sam Altman, the guy behind OpenAI, and I said the guy behind because before it was a non-profit company. organization looking to improve society. They had a board making decisions and then the guy basically did a coup, kind of, and he's now leading the whole thing, turning it into a for-profit company, but that's probably derailing a little bit of my presentation. So in November of 2024, the Right Holders Organization of Norway, in resonance with Right Holders Organizations around the globe, Wrote a letter to the Norwegian government because I work for the National Library of Norway at a small AI lab, and we develop language models for the Norwegian society. We release them publicly, not only open weights, but mostly open source models as well. And before ChatGPT, no one was very interested in any of this, so we were releasing what is called encoder-only models. These are models that are not able to produce text. We also released T5 models, which are sequence-to-sequence. They are good for converting one piece of something into another piece of something, but they are not generative in nature, like GPT is. They are good from translations and things like that. But we also started experimenting with autoregressive modeling, GPT style. And right after all of these losses started to happen, right holders organization in Norway say, hey, you are working for this organization, you are part of the public administration, so we want these many millions if you are to keep training models without crediting properly or without establishing a compensation scheme to pay the authors of this content. So the government, before starting a conversation with them, decided that it needed to have more information. They wanted to make better data-informed decisions before establishing any sort of compensation to the authors. So we got a mandate. It was a very brief letter, very government-style kind of mandate. Extremely vague, in which we were tasked with the whole organization of a massive research project, spanning different institutions, and going from ideation to results in six months. We obviously panicked a little bit, but it was a lot of work. So that's how the MIMI project came to exist. So in January, we started conversations with other organizations, different university, University of Oslo, the Language Technology Group, and also Technical University in Trondheim, which are probably the most technically oriented universities with NLP expertise in Norway, but also with Sigma, too, which is a partially government-owned company that handles all the high-performance computing in Norway. So we create a consortium and then we started thinking about how can we do this, what part of this mandate is actually the task of this MIMIR project that we are talking about. So the mandate had like three different objectives. The one is examining the value of copyrighted material in the training of generative models, also assessing the basis for a possible compensation scheme and prepare proposals, actual proposals with money to compensate these authors. MIMIR project is only concerned about the first objective, that is doing the research to establish if there is any impact or not. And we try to convey to the government that Whatever the result we are getting, it should not be directly translated into numbers. This is more to inform a possible conversation because objectives one or two are eminently political aspects. We didn't have a lot to say on those two elements because we were only tasked with the actual research aspect of all of this. So in order to make all of this, we have to design a methodology following best standards in the field. So we assembly a very large and diverse corpus and then we specify different trainings of models and a whole new set of benchmarks for Norwegian. Norwegian is a very small language compared to English or Spanish or Chinese. So a lot of the benchmarks that are already there for English that can be easily used are not available for Norwegian. So we have to create all of them from scratch, which is extremely time consuming, but also demands labor. We have to pay the people doing those annotations, reviewing translations from Usually bad Benchmarking data for English. I'll talk about that a bit later So all of that had to be done before we were even able to start giving answers to this mandate In terms of data, we gather a huge corpus using sources from different partners. We have agreements with news publishers, the biggest news publishers in Norway. We also have legally deposited books in the National Library of Norway. We also have other agreements with TV channels, so we gather everything that we could. I created two different corpora. One is called the MIMIR base and the other one we call that internally the MIMIR extender. The difference between these two is that MIMIR base was not supposed to contain any copyrighted material. And when I say copyrighted material here, we need to be very pragmatic about the definition of copyright. In this sense, copyright content is related to works by authors that can be potentially represented by right holders organizations in Norway. This purposely excludes, for example, internet data. There is no legal mechanism by which the Norwegian government can pay internet authors unless they are member of one of the right holders organizations in Norway. So this is a somewhat very technical and legal aspect of it because copyright is in fact law. There is a copyright act in Norway. So we were very careful in our definition of copyright for the project. These are some numbers, like we have a number of documents and a number of words, which are, of course, very low when you compare with the number of words that usually these models are trained on. These days, it's very common to have, like, data sets in the trillions of tokens or trillions of words, while we only had a few billions to experiment with. Moreover, we were also interested not only in the comparison between copyright material and non-copyright material, excuse me, but also in what aspects of these copyrighted materials can make the models better or worse. And for that specific purpose, we define different Delta datasets. adding or excluding different aspects of this material. For example, we have a different data set that is only for books, another one that is only newspaper, one that is both, one that only has fiction books, one that is only nonfiction books, one that is nonfiction plus news, like kind of factuality-oriented data set. Also, we look into translated material versus originally in the region material. We have many different ablations of the data just to see which aspects were more impactful in the training of these models. Again, here you can see that we basically got ourselves in trouble. Because when you ablate all of these different aspects, you end up with sizes that are very disparate to each other. So they are kind of hard to compare against each other. Because, for example, for translated books, we basically have like 100,000 books of them. And then when you compare that to the actual collection of original literature in the region, you see that we are almost half a million books. Designing the training protocol for this was very difficult because we also have very, very constrained, a lot of limitations in terms of time. We were very constrained, so we only have like six months to do everything. So we designed the training protocol to be basically uniform in all of the trainings. We did trainings of models both from scratch, It was like Mistral models, but also from pre-trained weights to do comparisons. And then all of the other Delta datasets, we trained them equally for the same number of steps. That of course might have some impact because there is some research suggesting that up to six iterations of your data, the model is still learning something. After that, there is no more useful training steps that you can do because the model is not gonna learn anything new. So there was kind of the maximum amount of repetitions that we allowed ourselves to train, which using the training regime and the training parameters we had was the equivalent of 10,000 steps, I think, for the different ablations, and around 40,000 for the biggest from scratch and from pre-trained weights models. So that translated into 17 or 18 different trainings that it was a nightmare to coordinate because we have three different institutions. We were training using Google Cloud TPUs provided by Google TensorFlow Cloud for free. TPUs are kind of a different accelerator device that are not very, They are not very common, so there is not a lot of code bases for it. And then one university was using their own Eden cluster of GPUs, and then the other one was using the European high-performance computing cluster in Finland called Lumi. So we have to create like a controlled model to make sure that all the different trainings in different infrastructure and using different code bases, we're actually producing models that were compatible, that you can compare. So it was kind of tough to handle all of this, but then probably the most difficult part of all was the evaluation. We didn't have data to do any kind of evaluation. So the National Library decided to spend, I think it was like a million kroner, which is around 100,000 US dollars. And we paid that money to the universities to hire students. They were actually very well compensated for their work. I think we got around 60 students to annotate, review, and create from zero some of these data sets. In most of the cases, the data sets are not very good. When you look closely to these English datasets, they are not as good as you think they are. They have a lot of mistakes, a lot of ambiguity, a lot of contextual responses. They are not easily translatable. Sometimes they are about gallons and miles, and in Norway we use liters and kilometers, so math problems were not directly translatable. We found a lot of those things that basically forced us to do some of these datasets from scratch. Luckily enough, students were extremely committed to the task, and in a matter of two months, we have almost 28 different common natural language processing tasks ready to create a benchmark to evaluate all of the models that we had already trained or were under training. We group all of these tasks into nine higher-level skills, as we like to call them, dealing with sentiment analysis, fairness, word knowledge, specific syntax, summarization, which is a very deep linguistic task in which you have to have very deep knowledge of the language you are dealing with, and also some more linguistically-oriented analysis made by linguists of the Language Bank at the National Library of Norway. So evaluating these models is basically running inference of them. So you also need a lot of compute and GPUs to do this. We all donated our time and our resources. They were all basically done for free because the government told us to do it. And it was kind of hard to collect more than 2,000 different scores, all disparate, ranging from minus 1 to 1, other tasks from 1 to 0, other tasks from 0 to 100, other tasks lower is better, some tasks higher is better. So we have to renormalize all of these, put it into a table, pick different metrics to be able to compare all of that. We run many different prompts. This is just one example asking about what happened before you put out a poster for a lost dog, and then the options are the dog is lost. And other options like you're training it, things like that. So obvious correct answer is A. This is just one single example of the prompt of one of the single tasks into one of the many possible high-level task hierarchy that we have. And for only this, we ended up with tables like this. So we have to find a proper way to summarize all of this into something that was easy to digest, because the ultimate goal of the project was to inform conversations at a political level. They're not going to look into these amazing colored tables, right? So we instead made amazingly looking plots. Aggregating all of this is extremely hard because there is some even social theories about how you can actually put all of these scores together to make sense while being absolutely fair. So we ended up just doing a summation of all of the scores and ranking the models by performance with a cumulative sum of all the scores. Here we can see that one of the best models was obviously the extended model using all the copyrighted material, but also when you look at the other two red, sorry I cannot see the screen, small screen from here, you see that when you are using the base dataset That's my 20 minutes. When you're using the right base data set with non-fiction books or the newspapers alone, the model is also performing really well. So, we thought, okay, let's just try to dig a bit deeper into this. We also analyzed just the model strain from fine-tune from Mistral, and they are obviously performing Way better than every other model because Mistral is a very powerful model. So in the ablations, we saw that when compared to the performance of the base of the model strain on the base data set, we saw that fiction was detrimental to the performance of the models, right? And this is actually funny because when we talked to the right holders organization, they understood it extremely quickly. They didn't even hesitate. They were, yes, of course, it's obviously detrimental. The reason is, we believe, is that all of these tasks are designed for factuality in a sense. They do not in any way positively analyze the creativity of a model. They are not designed in a way that you can get highest scores when the model is more creatively writing something or expressing ideas in a different way. They are very factual because, in fact, they are designed after commercial purposes. These data sets basically are related to commercial use cases. And most of the times are big companies behind the creation of these data sets. So I think it's important that we start thinking about other ways of analyzing these models because not everything needs to be monetizable. So we also did some ablation from scratch just to see exactly what was happening over here and then we calculated the average gain over the base model for all the different ablations and we saw exactly that adding fiction to the base model was making the performance loss 1.4%. We did the same for all the different high-level skills, for example, for sentiment analysis. We also did it for fairness and truthfulness. There are some variations there, but I think that the most interesting ones is not reading comprehension or word knowledge, not common sense reasoning. Interestingly here, newspaper was very detrimental for common sense reasoning, right? That kind of makes sense to me as well. But I think that it was very interesting to see, for example, that in the use of Norwegian language, adding fiction was very detrimental. But I think it might be summarization here. When you look at the variation and the readability, this was a very linguistically focused benchmark that we created. Adding fiction books was performing really, really well. We call this the John Fossey paradox, because John Fossey used to write without using, John Fossey is the previous Nobel Literature awardee last year, and he makes use of the language that doesn't use commas or dots or anything like that, he just writes very linearly. So to an unused reader, when you read that, I don't read Norwegian, but I can take a look at the books, And it seems like a very linear chunk of text without no structure. But it was scoring very high when you start to add fiction to this kind of benchmark. So I think it's important to keep thinking about how to properly evaluate other aspects of these kind of models. We also did other comparisons. We actually, sorry, I'm just gonna go quickly over this. We also did some instruction tuning datasets. In the same vein, the CHAP-GPT is an adaptation of GPD-4-ish. We also did some of those adaptations. And in this case, we were actually able to match, using only Norwegian data, the performance of the models trained on Mistral pre-existing weights. So as you start adding instructions to your data, the gap between the performance between the base without copyrighted materials and extended with the copyrighted material decreases. So the more practical the model gets, the less important the copyrighted material is. We also did some language, because we have two different written languages in Norway. So just to conclude, there is actually, in fact, there seems to be, in fact, empirical evidence supporting the idea that copyrighted material has an impact on performance, actually making it better. As you add more fine-tuning data, the difference seems to be reducing, and also adding instruction fine-tuning consistently increases the performance of the models. So with all of this, we created a report that was released in July 1st or so, I think, that was the basis for the government to start conversations with the right holders organization. Those conversations are ongoing, still, it seems like they're never gonna end, but it's still a starting point, and as a matter of fact, irradiating for all of these compensations, the state of the United States, The state budget for 2025 already included a new part of the budget specifically dedicated to AI and specifically dedicated to the National Library to keep doing and keep pushing for language modeling in the Norwegian society. Thank you.

Fantastic Futures