
Presenter: Emmanuelle Bermès
The Musée des Arts Décoratifs in Paris (MAD) partnered with Ecole des chartes (ENC) and the national library of France (BnF) in order to bring together expertise with artificial intelligence and to prototype the automated annotation of large paper collections. Emmanuelle Bermès shares their experiments using the YOLO (You Only Look Once) model on the Royère collection of architecture and furniture drawings to perform object detection using a specific ontology.
Technology, language, history and creativity converged in Canberra for four days as cultural leaders gather for the world's first in-depth exploration of the opportunities and challenges of AI for the cultural sector.
Want to learn more about this event?
Visit the Fantastic Futures 2024 Hub
This transcript was generated by NFSA Bowerbird and may contain errors.
Thank you. Bonjour. So before I start my presentation, I would like to acknowledge the colleagues who have done the work I'm going to present today and who are unfortunately not here today. So in particular, Marion. Marion is a junior researcher and she has done most of the engineering that you will see today in my presentation. Natasha was an intern with Marion, and Benedict was the sponsor of the project, and also Jean-Philippe, many of you know him probably, who was the expert with us on this project. So this is a project we've been working on for one year in Paris with the Design Museum, the Musée des Arts Décos, MAD, in Paris. And it's a museum which has massive collections, and we were in particular working with the Department of Drawings, Wallpapers and Photographs. So they have this massive, nearly one million objects, and most of it is not at all described or catalogued. However, it's a very important challenge for them since it's a legal obligation in France to have a thorough description of the museum collections, and also, of course, for discovery reasons. So, in terms of strategy, they really wanted to experiment with AI to see if it could help them create descriptions for these collections that are not at all catalogued, and that's why they reached out. to us. So first, we started working on a very specific collection inside this department, which is the collection of the designer Jean Royer. Royer is a quite famous designer in France, and his pieces of furniture are still selling at very high prices on the market. So you can find in auction Royer furniture at one million euros, for instance. So there is a very big challenge for the museum, which hosts the archives and the drawings of Royer, to be able to provide to the stakeholders the documentation they need, in particular to check when there is a piece of furniture in auction, to check if it's real Royer or if it's fake. So for that, they need to go back to the archives, and you have on the screen some examples of the 18,000 drawings that there are in this collection. Some of them are representations of one piece of furniture, others are different pieces of furniture organized in a room. And then you have these more challenging drawings, the technical tracings with all the dimensions, very precise dimensions of the pieces of furniture. And this is really the ones that are important for this specific use case. And these are also the ones that are obviously more challenging to identify with AI. So the computer vision state-of-the-art is working quite well at the moment for things like the drawings in colour that you can see behind me. But if you're looking at these very fine tracings with the dimensions, it's much more difficult to identify that this may be the same object in both drawings. So for those of you who may not be completely familiar with computer vision, this is the kind of things that we can do with state-of-the-art AI. So on the left, you have object detection. So the idea is that you're going to identify if a specific object is present in the drawing. So for that, we're using a model called YOLO. And you have this bounding box, so you're basically drawing a rectangle over the part of the image where the object can be found. And then there is a tagging that says that there is an armchair in this part of the drawing. On the right-hand side, it's a different approach called segmentation. So these kind of tools will be able to very precisely identify the shape of the objects within an image. So it's much more precise. and it has a different purpose. So testing both, in the first experiments that we've been doing, we had better results with object detection, so we went with YOLO. So this is the process that has been designed by Marion. It's an iterative workflow where she starts by creating ground truth, so manually annotating some of the drawings to show where the objects are. Then model training, so she's been fine-tuning YOLO with these manual annotations. Then, once the model is trained, she runs it on new data, she deploys the model on new data, and with the result of this inference, she annotates manually, she corrects the annotations created by the model in order to see if the objects she was trying to find were correctly detected. And these corrections create new annotations that then can be fed into the ground truth, and then you can iterate to progressively improve the model. So that's what Marion has been doing. But this ground truth creation, so the manual annotations of the bounding boxes and the tagging of the presence of the objects in the image, is something that is taking a lot of time and effort, and it's quite cumbersome to do. So she's been looking for strategies on how to make this annotation easier. So one of the strategies is segmentation. I showed you just earlier. So the idea is that instead of having to draw the boxes around the objects, she would use a model that would pre-select the shape of the object, and then she would just have to add the tags. So this is still under testing, so I don't have anything to show about that. And the other approach that was quite successful is the idea to use deformation of image. So what you have on the screen, you can see this matrix of values that concern each image. Some of these values relate to the scale of the image. Some of them relate to the perspective. And then on the bottom, you have the ratio, which we won't touch. And so the idea is that if you change these values for each image, you will be able to deform it and simulate a perspective, like you were looking at this object from a different angle. So you have here an example of a piece of furniture by Royer. On the top you can have the actual drawing and on the bottom you have the drawing that has been deformed according to perspective like if it was seen from another angle. So this method allowed us to double the number of annotations we had in our ground truth. So she annotated manually about 160 images and then we had more than 300. thanks to this manipulation, making then the model more efficient in particular to recognize the same object but in a different perspective, in a different position in a room. So in terms of results, I'm presenting here the metrics that resulted from the projects. What you can see on the left are the detailed metrics for the different types of objects that we were trying to recognize. And on the right, it's the confusion matrix. So if you're not familiar with that kind of representation, on the vertical side, you have the objects we wanted to find annotated in the ground truth. And on the bottom, you have the predictions made by the model. So normally, if the model is working well, you should have a beautiful diagonal lighted that would show that it has predicted the object that we were expecting. So you can see that the results are not so bad, but we still, even after several iterations of fine-tuning, find things that are difficult to identify. Like, for instance, it's difficult to identify a chair from an armchair. things that are very precise. And on the other hand, some things are very well predicted by the model, like it can find easily if there is text in the image. But sometimes there are things that are more difficult to find, like lamps. So if you look at the detailed metrics on the left, there is something written luminaire, which is lamp. in French. And so you can see that it has found very efficiently about half of these lamps, but half are totally not present in the prediction. So this shows the amount of precision we are trying to looking for to actually develop the use case that is expected by the people in the museum. So in terms of limitations, we need more accurate detection, we need more fine-tuning, so more ground truth, and one of the challenges is also the user interface. So the museum curators, the people who are in charge of the collections, want to be able to evaluate themselves the results of this object detection process. So we need to provide an interface. In order to do that, we tried a completely different approach. So this is work that has been developed by Natasha, based on work by Jean-Philippe. So we tried to use CLIP. So CLIP was introduced yesterday by Peter Reynolds. I won't say more about it, just to say that it's a pretty straightforward way to index visually a collection. So Natasha, our intern, was able to inject into CLIP, into an application based on CLIP, the wire collection, which allowed very easily without any fine-tuning to have the keyword search and the natural language search. So if you're looking for a lamp, you get some lamps very easily. There are also limits. One of these limits is that CLIP is working quite well if you are looking for a type of furniture, but actually some of the things that the people in the museum wanted to look for were the specific styles of Jean Royer. The designer used to have patterns that he would reuse in his different works. So, for instance, here on the right, you can see a desk and a bed, and they have this crisscross shape, both, which he called croisions. So they want to look for that kind of things. So not the object in general, but more like the style of the object. Here again, there are some limitations, and if you want to answer a complex use case, you really need to fine-tune the models, so fine-tuning CLIP is one of the things we're going to try to do in the future. So, future prospects, we want to be able to adapt this iterative approach to other types of collections within the MID. We want to explore additional image deformation methods. The one we used was only on perspective, but here you have an example of affine transformation, which is another one. We want to further investigate with CLIP and test also new models. We've heard of Florence 2, for instance, yesterday. But more than that, furthermore, we think we really need in our new project, we got funding for one more year, so now we have a continuation of this project, and we really feel that we need to address the challenge of bringing the team in the projects and convincing the staff that what we are doing is really useful for the needs they have. So this is a very quick overview of the new collection we're going to work on. So this time it's not drawings, photographs. You can, with the few examples you have on screen, you can see the type of challenges we are going to have. Paul-Henri was a photographer, he was doing a lot of architecture photography. So we want to be able to identify not a church or a building, but this specific church, and to geolocate the monument that he had photographed. We also have this issue of trying to find not objects, but shapes or aesthetics, like the spiral, which was one of the shapes that is coming back again and again in Polandreau's work. He was working really on on naturalization of architecture. So these are the kind of patterns and shapes we want to be able to find. And finally, in that project, we want to work on how AI is going to change the work of the museum. So we are trying to discuss with them the fact that automating part of their process of image description is not going to be, you have a human doing something, you remove the human and replace him with an AI and it's going to work the same way. AI is shifting the way things are working globally and in the overall process of the museum. So what we sense is that there may be a new paradigm where instead of having the metadata enrichment at the end of the process, it would be more interesting to start by digitizing, then applying AI. and then going with the human work that will exploit this first step of automation. And of course we'll have a big focus on the quality, transparency of results and working on interfaces that will make it possible for the museum curators to understand what the AI is generating, why it's working, how it's failing, and how they can reuse this type of work in their process. So in just a few seconds, I wanted to highlight that we have a big sister project, which is the French National Library project, Gallica Images. So Gallica is the digital library of the BNF. and they are in currently an ongoing industrialization phase of their computer vision projects, so they are fully automating the indexing of iconographic content inside their digital library, extracting all the images that are inside the digitized documents, applying state-of-the-art AI models to describe the content, and then repeat iteratively on the different type of documents they have. It's a very big volume. It's more than 10 million digitized items in Gallica. They have a very wide range of document types, even wider than what we have at the MAD. So this is definitely very inspiring. And one of the main challenges is that all the computer vision models we are using are evolving so fast. So when we are personalizing things, you really need to think to the fact that you will need to be able to remove a model that you've been using and replace it by a new one that is working better. So this is one of the challenges of creating such pipelines. And that's all for me. I think we don't have questions now, but if you want to speak with me at the break, you'll be welcome to do so. Thank you.
The National Film and Sound Archive of Australia acknowledges Australia’s Aboriginal and Torres Strait Islander peoples as the Traditional Custodians of the land on which we work and live and gives respect to their Elders both past and present.