
Presenter: Francis Crimmins
The vast majority of the images held by the National Library of Australia do not have any detailed textual description, and keyword searches rely on matches against the title and author fields. Francis Crimmins showcases their experiments into implementing a system that uses an AI vision model to generate a detailed textual description of images from the catalogue, which are then given vector embeddings to be used in search and the exploration of similar images.
Technology, language, history and creativity converged in Canberra for four days as cultural leaders gather for the world's first in-depth exploration of the opportunities and challenges of AI for the cultural sector.
Want to learn more about this event?
Visit the Fantastic Futures 2024 Hub
This transcript was generated by NFSA Bowerbird and may contain errors.
Good afternoon, everyone. So today I'm going to be talking about some work we did at the National Library of Australia to investigate how AI vision models could be used to improve, essentially, the findability of images in the library's collection. And this is some work I did with my colleague Kent Fitch. So there's about 200,000, or over 200,000, publicly available images in the library's catalog. And this is an example of one of them. And you can see on the right here, we've got some of the metadata associated with the image. And so we've got a, the title is actually more about the small set of photographs that this image comes from. We've got the photographer, the description again talking about, you know, comes from a set of 19 photographs. In terms of the copyright, all images in the data set that we're working with, are out of copyright and were created before 1955. And in terms of availability, all images are publicly available from the NLA Catalogue and Trove. So this is an example then of a of a description generated by an AI model for this image. I'm not going to read out all of it. You can see it's quite detailed and quite in-depth. But I will point out a few aspects. So at the top it's saying this is a black and white image featuring a person or an individual in a dramatic pose and costume. the person appears to be Robert Helpman based on the text at the bottom of the image. So it's already making some inferences that this could be a dancer or an actor and that they're in a dramatic pose. It's been able to extract the text from the bottom of the image and has been able to correctly infer that this individual is Robert Helpman. And if you go down, if you read the whole thing, it's talking about, for example, the use of makeup the looseness of the costume again suggesting that this could be a dancer. In fact, he's a dancer but also an actor as well. So we can see we're getting this really rich description of this particular image. And so what we wanted to do was to see whether we could use the text description like the example we saw there to provide a standard keyword search over these descriptions, giving improved precision and recall when searching the pictures. So in information retrieval, precision is how relevant are the results that are being returned to the user, and recall then is, of those relevant results within the collection, what proportion are being returned as well. So it also allows us to provide a similar images feature. And we can also generate or cluster images as well based on these descriptions. So really, if you think about it, we've gone from an image to text. And so we're then able to use all of these standards Text infrastructure, like search engine software like Solr or Elastic, we're able to use clustering libraries on the text and descriptions and so on as well. So, you know, it's almost as though we're transcribing the image into a text format that we can then work with some of these other standard technologies. Clustering allows us to better understand what is in the collection as well by surfacing image groupings that we may not be aware of. And one thing I want to make, just to clarify as well or make clear, is that in the work reported here, we are not, we are not, we are only using the vision models themselves. We are not training them with the data that we're working with. What we're showing here now is an example of a set of results being returned for the query, a dancer or actor in a dramatic pose. So you can see we're getting quite relevant results. The image that we were showing earlier is position three at the top. So this is a kind of a grid display, the most relevant result on the top left, and then just going in reading order like that. And so if you imagine for each one of these images, if we had a similarly detailed text description, that we're then able to support these type of queries. It's probably for a lot of these images, if we were searching on the metadata alone, they may not be returned for this type of query. So at a high level, how do these vision models work? So they're trained on very large data sets containing image text pairs. And they learn relationships then between the visual content and the language. So I'm sure many of you have used Google Image Search. So when Google's crawling the web, as well as downloading text web pages that are also downloading images. And they would associate the text around or near an image with that image to help provide an image search facility. So that same type of data is being used to train these AI vision models. So the model, when it's being trained, it learns to recognize patterns, objects, and features. learn that if there's a tree in the picture, that there probably are other trees nearby, or a river, or a road, or whatever. Or if there's a large building, that there may be other tall buildings nearby, because it's a street scene from a city. So once the model has been trained, then it's capable of tasks such as image description and object recognition, as we've seen. So here's another set of results for the query that two birds on a branch. So the titles here that you can see will contain the common name of the bird and the Latin name or the scientific name. But it's unlikely that the metadata is going to say two birds on a branch. So again, by adding this additional layer of descriptive metadata, we can support queries like this. So, the model also has knowledge of architectural styles. So, for the query Art Deco, we're getting this set of results. And if you imagine, scroll down the screen, there would be a lot more. The top two images up on the top left have the terms Art Deco in their metadata. And so, if you were searching across just metadata, they would be returned. But none of these other images would contain that detail in their existing metadata. So, we're leveraging the fact that the model that we're using in this case has as part of its, it's learned to identify particular architectural styles. You're actually in an Art Deco building at the moment, I realized. And also, I guess we have kind of dark and moody lighting, too. So it's kind of strange I did this presentation. Sorry, the slide's a while ago. So the model here, we're saying that the model can make aesthetic judgments. So it's, again, it's part of its training. It's making inferences. And so some of these text descriptions, it might say, you know, it's dark and moody lighting or something like that. And so we're able to support queries like this as well. It's about supporting a larger set of queries, different types of queries. On one of the talks yesterday, they had some good examples of using the CLIP model to support these kind of more abstract queries as well. So how do we generate image descriptions? Well, from the NLA catalog, we processed a sample of 5,000 out of copyright images. We extracted their metadata, so their title, author, original description notes, et cetera. And then the URL of the image is sent to a vision model endpoint, and it is asked to describe the image. And the title is passed in in the prompt. So the prompt would be, please describe this image. For reference, this is the title of the image, and we give the title. So that's giving the model a bit more information to go on in terms of generating its description. And then once we have that text description, we then generate a separate embedding, or send it to a separate embedding endpoint, which returns a numeric embedding for it. So what is an embedding? It's a dense vector representation of data, so words and images that capture semantic relationships and similarities. And so similar data points, so words with related meanings, are mapped to vectors that are close to each other in the embedding space. And we use a combination of embeddings for generating descriptions and images to provide search similarity and clustering capabilities. So this is a screenshot of a visualization of an embedding space from Google Research. If you imagine this was an interactive demo that you could, you know, the image could be rotated. It's a set of points in a three-dimensional space, and you can zoom in and out. And so I've kind of zoomed in on the word museum, which you can see there in the center. And then on the right-hand column here, we can see the nearest points in the original space. So we've got words close to museum or gallery, institute, library, and so on. So a point to make here is it's not just about words, but it's also the concepts and ideas that those words point to. And so words that are close together in this space are considered to be, you know, either similar or related in some fashion. And so if we imagine then the embeddings that we would generate from the images, you could then imagine images being placed within this space and images that are similar are close together within the embedding space. So if we go back to our query, a dancer or actor in a dramatic pose, I've highlighted one of the images in the middle row over on the left. So if you imagine, we have this detailed description that has been generated, and we have an embedding as well. If we click on that image, we can then see images that are considered to be similar. And so what's happening behind the scenes is it's going, we're sending a query to Solr, a nearest neighbor query to say, return us the nearest neighbors to this particular image. And so that's the set of images that we get. So the image we clicked on has now moved up to the top left. And then we can see. So the image that's considered the closest is the one beside it. And then we go on along like this. So these, I think, are called theater postcards. They were around the late 1800s, early 1900s. So you can see that we started with an initial text query, dancer or actor in a dramatic pose. We got our initial set of results. And then we can click on an individual image and explore what's related to that image, what's similar to it. And you can keep drilling down. You could click on this one here, and you might get images that are more colorized and so on. And so you can navigate back and forth. So it's a combination of search and browse. If we go back to our results set again, you can see now I've highlighted this image down at the bottom right. So if I click on that, because I want to see similar images, this is the set we get back. And you can see, if you look at the top, So the image we clicked on is, you know, moved up to the top left. The two images beside it, you can see it at the figure. And those images is looking to the left of the image. So those are considered the closest to that image, the nearest neighbors. And then as we go reading left to right, we're getting, you know, we're starting to drift away. These images are not as close to the original image that we clicked on. So another aspect we looked at was clustering. So we used the HDBSCAN library, which is a fairly modern clustering algorithm to cluster the numeric embeddings of the image descriptions. So and then a representative image, which is the closest point to the mean embedding, is then used for each cluster. So for each cluster, we want to have a representative image. And the one that we choose or use is closest to the mean embedding. The implementation generates a static set of clusters, allowing this overview of image groupings in the collection to be to be displayed without needing any initial text query. So we're surfacing these particular clusters or groups. We're not giving it any initial query to work with. We're just generating the clusters based on the embeddings that we generated for each image. So these are some of the clusters that we got back. And the top left is the cluster that had the largest number of images within its group. And just to clarify, this is, these are not all of the clusters within the overall image library's image collection. These are just from the 5,000 images that we sampled. So you can see there's, you know, the Japanese print style that we were looking at. There's some maps. If you look at the image over there, The top right, if I click on that, I then get images from ballet, images to do with ballet, and you can see that the images are quite similar because you can see the whole stage, you can see the company behind the primary dancers or whatever, but again, as we start to move downward, the content is starting to drift away. Another example cluster is watercolors of flowers. So within the overall image collection, there are, I might not be getting the terminology correct, but sub-collections or mini or micro-collections that where the image metadata says, you know, this is part of a small collection to do with botanical drawings or, you know, watercolors or flowers or whatever. So we would expect that the clustering approach is going to identify those image categories. This may not be an actual, you know, mini-collection within what we're looking at, but the point is, yeah, if you're doing this kind of approach, you expect that if you have some predefined categories that the clustering should surface that as well. I don't think this is an official kind of micro-collection. Maybe it should be. I just like this because it made me laugh when I first saw it. So as I said, we're looking at images from before 1955, out of copyright images, so it's probably not surprising that we're getting But again, this is an example of, you know, we don't necessarily know what's at least within the sample that we've gathered. And so the clustering allows us to surface that kind of information, which I think would be very helpful and useful to curatorial staff, for example. So which models did we evaluate? We looked at CLIP, which was talked about in some of the other talks. It's an open source model. It aligns text and images into the same vector space, which allows direct comparison. We ran that on an internal server, so no data was sent to external parties. We also looked at OpenAI GPT-4 vision, which is a proprietary model from OpenAI. It was considered state-of-the-art at the beginning of 2024. It takes about 15 seconds per image on average, but that includes the network round time. Before we used the service, we took a close look at its terms of service, which says that data sent to this API will not be used to train or improve OpenAI models, and also that images are not retained by OpenAI, and so we would call that a zero retention system. And the last model we looked at, evaluated was Microsoft 5.3.5 Vision, also an open source model. Takes around four seconds per image for a description of eight to ten sentences. We also ran that on an internal server, so no data sent outside. And so the description we were looking at earlier was generated by the proprietary model. And then this was the description generated by 5.3.5. So you can see it's a lot shorter, you know, not as detailed. It's making some incorrect assumptions. So for example, it assumes that the photographer is Robert Helpman, whereas he's actually the subject of the photograph. So it's not doing as well, but I will say that for other images, it did quite well, and we'll see how it did in the evaluation when we were comparing the use of different models. The other thing to say is, you know, this is a fast-moving field, so there was a model, an open-source model released last week from the Allen Institute, and it's providing, you know, quality similar to the OpenAI proprietary description, if not better. So how do we do our evaluation? So for 535 and OpenAI descriptions, we generate embeddings for these from an open source system called Nomic, and then we store the embeddings in Solr. A point to make here is that if your organization has some technical capability and you're already using systems like Solr or Elasticsearch, if you generate these descriptions and you generate these embeddings, it's actually quite easy to store these in Solr or Elastic. And then you can provide a search out of the box using the descriptions and embeddings that you've generated. So I think that's quite important, in fact, and also the point that the open source models are catching up very quickly. to the proprietary models. So we convert an incoming text query or text queries to an embedding using NOMIC and we then do that K nearest neighbor query. So K there is just the number of results we want to get back and N is nearest neighbor. All image similarity searches use dynamic embeddings of the image descriptions and a KNN query. And then for Clip, we're able to use its own embeddings because as I mentioned, it has both images and text mapped to the same embedding space. So this is the query set that we used for our evaluation. So we wanted to be able to compare the different systems across a fixed set of queries. We didn't want people to, you know, we didn't allow people to put in an ad hoc query. We wanted to use this fixed set of queries and we classified them into, there are different types there. So we have specific item queries like Don Trune or Sydney Creek Grounds, so place names, person names, William Morris Hughes, categories like book covers or propaganda posters, styles, so Japanese print with lots of red or black and white images of big city streets, and then more abstract queries as well, like people looking very happy, smiling, or laughing outside. So this is a screenshot of the evaluation interface that we used. Um, so you can see that the query at the top is bridge construction. Uh, and then we're doing an AB comparison. So we, we, we've selected, we've selected two of the systems at random behind the scenes and they must be different. And then we display the results for each of those systems side by side. And then you can see across the top, we've got buttons where, A user can say A is slightly better, somewhat, or much better, or equally for B as well, and also a button in the middle that says no preference. So for this example here, if you look at the images, they look pretty similar, the result sets. I mean, you can scroll down, you see more images, but if I was evaluating this response, I would probably say no preference, just to say these would be, to me, would be considered pretty equal. Then this is the interface for evaluating similarity results, so image similarity. So the image on the top left of each set there is the image that we want to find, the images that are similar. And so we can see set A there is doing a pretty good job in terms of returning other images of churches and buildings in that particular style. Set B is not doing so well, so image number three there looks more like an image of a hotel or a pub or something. It's not really that relevant. Or we've got an interior image here or a city scene or whatever. So if I was evaluating this, I'd probably say that side A is doing much better. So what kind of results do we get back from this? So for the image similarity, which is this interface here, we got, so this table here, what we're showing is the head to head win ratio as a percentage. So if you look at the left most column, what we're saying is clip, for example, when clip was competing against just catalog metadata, it was preferred 86% of the time. Uh, if it was being compared against the open AI description, it was being preferred 67% of the time and so on. And so, um, Its overall score, which is a weighted average of all competitions and model, took part in and also points for a draw are shared, so a draw occurs when users click on that no preference button. The overall score for Flip was coming out ahead, preferred 71% of the time for image similarity. And then for just plain search results, Much larger table, I'm not going to go into it all, but basically more blue is better. So the variant down the bottom is doing better overall. What we did was we looked at catalog metadata, 5.3.5, CLIP and OpenAI and so on, and then we did some combinations. So for combined variants, we use a single Solr query. with weighted ORD clauses. So, for example, the variant right down the bottom, which we're saying has done best overall, we send a query to Solr, and Solr has fields for CLIP and OpenAI description and catalog metadata, and then we're weighting what comes back. We're giving a weighting of 50% to the CLIP result, 30% to OpenAI description embedding, and 20% to catalog metadata. And so, again, so comparing that against catalog metadata on its own, users were preferring that 93% of the time. So these results are from over 900 judgments by 15 NLA staff. And so, yeah, as I said, more blue is better. So you can see that it's, you know, preferred six, the variant down at the bottom preferred 66% of the time against that version at the top and so on. So its overall score is coming in at 66%. But you can see it's not that different to some of the other variants. So we're getting 60% for the variant above, which is using PHY 3.5 rather than OpenAI. And if you look at the score there, 52, so the variant that has come out on top is only preferred 52 percent of the time compared to 80% CLIP, 20% PHY 3.5. So for organizations who don't want to use proprietary models, we can see here that for this particular use case, using an open source model like Microsoft 3.5 is feasible and you can get similar quality results. And as I said, the open source models are continuing to catch up as well to the proprietary models. So, in summary, we're saying that AI vision models can be used to provide detailed descriptions of images. The text description of an image from a vision model allows us to provide a standard keyword search over these descriptions, giving improved precision and recall when searching for pictures. It also allows us to provide a similar image feature, and we can also cluster images based on these descriptions. And also embeddings of text and images can also be incorporated into providing these capabilities. So our evaluation of search results showed that that variant there was considered best 66% of the time in competition with other variants. But other combined variants, as I said, were quite close to this. And then for image similarity, CLIP was the clear winner, preferred 71% of the time. And the way I think of this is it's quite possible that some of the images I've shown you here today have hardly been seen at all since they were first catalogued. And so I think there's a lot of promise that these technologies can be used to help surface images, help creators understand better what's in their own collections, and also allow people to find images from this historical archive. So thank you very much. you
The National Film and Sound Archive of Australia acknowledges Australia’s Aboriginal and Torres Strait Islander peoples as the Traditional Custodians of the land on which we work and live and gives respect to their Elders both past and present.