Fantastic Futures 2024 - Day 1 - Session 8

Title:

Fantastic Futures 2024 - Day 1 - Session 8

Year

2024

Summary

Applying Advances in Person Detection and Action Recognition AI to Enhance Indexing and Discovery of Video Cultural Heritage Collections

Presenter: Peter Leonard (for Peter Broadwell)

The ability to recognise and digitally encode human movements and actions, both in individual and interpersonal contexts, holds great potential for elevating the accessibility and understanding of digitised audiovisual materials. As part of the Machine Intelligence for Motion Exegesis (MIME) project, collaborators worked to apply new transformer AI-based video analysis models to archival recordings of theatre performances, political speeches and other events, exemplifying the methods’ potential relevance to any video collection containing human movement.

Fantastic Futures 2024

Technology, language, history and creativity converged in Canberra for four days as cultural leaders gather for the world's first in-depth exploration of the opportunities and challenges of AI for the cultural sector.

Want to learn more about this event?
Visit the Fantastic Futures 2024 Hub

TRANSCRIPT

This transcript was generated by NFSA Bowerbird and may contain errors.

Thank you so much for the chance to come and speak today. It feels like only earlier today than last year. But I'm excited to be able to present the work of some really amazing colleagues at Stanford Libraries. Peter Broadwell on the left, Simon Wiles in the middle, Vijoy Abraham on the right, are all members of the Stanford University Libraries team. And they've been working very closely with the professor of Theater and Performance Studies, Michael Rau. And what you'll see in a second is what they've been working on, which is really exciting. It goes under the title Machine Intelligence for Motion Exegesis, which is an acronym that forms MIME. And I'm actually going to see if I can play this. It's a project to use pose estimation to study recorded performances. This is the same performance of, I believe it's, fondly do we hope, fervently do we pray. It's being processed by two different algorithms. The former state of the art on the left and the current state of the art on the right. What I'd love you to do is to keep your eye on the differences between the two videos. Because this talk is really about precisely these advancements in person detection and tracking motion and pose. The question this project really started with was what if we could get the same quality data that we can get when you have performance in a motion caption studio with those multiple camera angles and the sort of globes or dots glued to their bodies, but what if you could get that quality of motion tracking pose tracking without having to have had that performance occur in a specialized space? What if you could go back to any historical video that was recorded on videotape or DVD and get this amount of information? This is actually very hard until recently. The video on the left is representative of the former state of the art, its convolutional feature-based approach. And it did incorporate some sort of spatial and temporal tracking, but it didn't really, it wasn't really able to keep people disambiguated when they passed behind each other or when the view was temporarily occluded. This is sort of a shortcoming you can see by the numbers on the top of their heads, which is how many times the system lost track of who was who. The video on the right, which you have been watching, uses transformers for both individual person detection and then also for multi-person tracking in time and space. And so this advancement really allows us to go forward with the MIME project, the purpose of which is to explore the uses of human pose estimation and related technologies for analyzing video recordings. The initial corpora is actually recordings, I believe, of the work of Professor Michael Rau, theatrical performances from Stanford, which gives us an easier story in terms of the copyright and the access to the material. But really, what the project has resulted in is a very advanced researcher-focused web interface that uses all sorts of interesting interface affordances to let you search through pose and action, all being stored in a Postgres database with some really interesting extensions for vector similarity search. We're happy to acknowledge the support of the Stanford Human-Centered AI program. What are the types of questions, research inquiries that might be supported by being able to track pose and action in videotape? There's a possibility of detecting what we call directorial signatures, choreography, performance blocking, individual acting. You can think of this as a visual corollary to Moretti's distant reading. You might be able to identify characteristic poses and actions in a performance. You could track a performance as it goes through rehearsal and it evolves over time as the director or the choreographer the team change things or actors respond to other ways of working or other ideas. The system I'm going to show you in just a second is most decidedly for researchers rather than for a public to use, but I would be curious in folks' feedback and possible applicability beyond the space of theater. I do think in broad terms the notion of tracking actions and poses is another one of those multimodal engagements or interventions into a video archive that could have fine purchase in sports videos or in cinematic narrative film as well. This is Mime. The American designer Iris Apfel once said, more is more and less is a bore, and this has informed our approach to interface design. At the top left, a timeline. which includes series data such as tracked poses and faces per shot, the boundaries of shots, movement between frames, and most crucially, how interesting a pose or an action is. Now, a pose's interestingness we define by taking a kind of normalized average of a 17-dimensional vector. So imagine just like a stick figure with her arms straight up and down. We can do the math and say this pose is very different from like a standard stick figure pose. For actions and measuring the interestingness of actions, we actually use a 6D element embedding that I'll discuss in just a couple slides. But the key thing is, as I scrub through the video on the left, What I get is a target pose in the bottom left, and the timeline up top will actually highlight poses that match this particular actor's, I guess they're lifting their left hand up, it'll highlight that in the timeline. And that's a really exciting way to experience the performance or the video, but it's not the only way that you can select a target pose, what we call a target pose, the other thing you can do is if you're feeling up to it, perform in front of your own webcam and do a strike a pose, so to speak, and that will actually be input into the system. It will find matching poses. In addition, you can also, I should say, these are matching poses and actions from that. And then we also have this other interface which lets you arrange sort of a stick figure and say, I need somebody posed like this. Can you find it in the corpus? These are underlying, of course, these advancements are the vision transformer, which follows the advent of text transformers. We already heard about the paper, Attention is All You Need, from 2017. The vision transformer papers come out in 2021. It extracts features differently than previous convolutional neural networks. It still uses patches, but it uses flattened patches. To my eyes, that's a little less intuitive than the way convolutions worked. But it turns out to not really matter because this interesting combination of a multi-layer perceptron layer with multi-headed self-attention allows the model to essentially teach itself or to learn what parts of the image are important in relationship to one another. So the key point here is that this works very well. And vision transformers are trained very similarly to text transformers. I'll see if I can bring that up there. Instead of masking out specific words, so in text transformers we might say a sentence like, I went to Starbucks and bought a blank, and the text transformer would eventually learn that it was very likely that I was going to buy a latte or a flat white, unlikely that I was going to buy a cat. The same idea is that we mask out sections of images and we then expect the visual transformer to find out what is likely in that particular patch. And what's kind of interesting here is that the video, I think they're masked autoencoders, that we're using here turn out to be self-supervising in a really interesting way. What this means is if your research corpus is cooking videos, then if you run a model trained on cooking videos, it turns out, or you fine-tune it on cooking videos rather, it turns out to be very good at downstream tasks like, you know, grabbing an ingredient or where are the hands in the video. Same is true of sports, the same is true of martial arts, the same is true of theater. So it really makes a big, having a corpus of theatrical videos in this case is actually really important for us and delivers very good results. What MIME is about is pose analysis via a pipeline of specialized models, including many of the models I've just described. It's not one megalithic model. It is many things glued together. And I think this is very common in a lot of other projects. The Distant Viewing Toolkit out of the University of Richmond, Lauren and Taylor's project, also sort of glues together multiple models in order to achieve really interesting results. This actually includes some pretty bog standard models such as scene detection, also human body mesh representation, which is one of a couple different ways of understanding a human body. You could have a skeleton, you could have a 3D volumetric mesh. There are some interesting post box detections and things like that. What's really key, though, is two models from UC Berkeley. And one of these comes from Kanazawa's care lab and one of them comes from Professor Malik's Bayer lab. And these come together to produce a predicting human appearance, location, and pose model, or FALP. And I'm going to talk a little bit about that on the next slide. The goal of this approach really is to follow people as long as we can in these theatrical recordings. We don't want to miss them when they step behind a curtain, or the lighting changes, or the camera angle changes. The longer we can track them, and crucially, the longer we can track them through an inferred 3D space, because some of our models are trying extrapolate sort of structure from what is a two-dimensional view point in a camera. You don't want to consider each of these frames isolated in time. You want to know what 3D space is. You want to know what the actors look like, what their hair is, their appearance, their makeup, their clothing. All of this comes together to track these people as long as humanly possible, and that results in much better data long-term. You can see a little bit more details here about the vision transformer and the bespoke language model style transformer are used to track 3D people in the video. Once we track them, we can assign their pose sequences to what we call tracklets. So we're tracking people and we're understanding their poses. We've also got this notion that is kind of unique to the Transformers moment, I would suppose, which is the ability to interpolate missing poses rather than lose them entirely. So imagine I'm walking across the stage and I walk behind a column or I walk behind another actor, another performer. It's disastrous if we lose that continuity and we think there's a new person who just sprung out of nowhere. What we really want to do is actually interpolate missing poses. Now, this might sound similar to hallucination or something, but it actually is kind of a good thing because you're trying to keep continuity and you're trying to understand motion throughout an entire performance. We need a system that is resilient to these transient blips. Unlike in Star Trek, people don't just disappear suddenly. Usually they're still there. We need to sort of have the ability to track them throughout all of this. So let's talk a little bit more about what is enabled here. If we can actually track multiple people individually across multiple seconds or even minutes if we get lucky of a video, we can do way more sophisticated analysis of their movements and actions and you might even say their interactions over time. This could include really simple things like how much a specific person moves and how close they tend to be to other actors. You can also lead up to something I've been talking about, which is action detection. And that's actually really interesting because action detection allows you to go beyond static poses and understand something more about what's actually going on in the video. So let's move over to this model. Action detection or action recognition is what the sort of model is doing on the screen here. And what you can do is you can look at some of the final labels on the right side of the screen. These labels might seem pretty banal. They're things like standing or running or pointing or something. That isn't really very exciting. If all you had was a label of somebody standing in Henry V, that is not very exciting. Just like with convolutional neural networks, just like with ImageNet solvers such as Inception, it turns out that the real power in the model is one layer behind the final label. step into the penultimate or the semi-final layer of this action detection model, we get this feature space of about 60 vectors, 60 tensors, that lead up to the difference between standing and sitting. That is actually a very powerful space. It's more powerful than standing and sitting because it lets us, rather than describing something as like halfway between standing and sitting, we could create a sort of crouch or a crawl or something like that. We can describe it using the vision system that understands action without being beholden to the labels themselves. We could even curate a kind of theater-specific actions based on these dimensions or in any kind of domain that we're running this stuff on top of. Let's return to the interface. And here, instead of looking at Stanford theatrical performance, let's look at a theatrical performance from the Japanese nō tradition. And what we've done here is we can see the models and techniques that we've just described at work. At the right, are the actions and poses that closely match the query pose or action on the bottom left. It's embodied by an actor. And it's using inferred 3D pose key points and 60 element action vectors. So we're able to do this really powerful query and it isn't just if I'm standing like this front of the stage, we can actually lift that out and find that pose even if my back is to the stage or if I'm 90 degrees. So remember we talked earlier about interpolating or inferring the 3D space. If you infer the 3D space, you can then compute the poses and the actions invariant of their sort of axial rotation. We see this inferred, I think, positioning there in the bottom left. It's recreated a 3D space with some of these poses. And that is a really exciting visualization for us because it suggests that we've successfully recreated at least a primitive representation of the theatrical space. Now, I'm going to have to leave to the notion of emailing Peter Broadwell and other members of this team any questions you might have about the technical details of this project. But I know that he and everybody else who's worked on the MIME project would love to talk about what we've accomplished, challenges that remain. The notion of how we index and search and we do semantic labeling. Really the 10,000 foot view questions of what is the possibility of this in a large scale filmic archive. Nations like Norway digitized almost all of their filmic culture, their video culture. And countries like Australia, which have done an amazing amount of work as well. As I said in the beginning of my talk, I think this really represents a possibility of another form of analysis. You might call it another form of multimodal analysis that would complement auditory analysis, analysis of the subtitles, analysis of the speech and whisper. It all adds up to producing a more complex and sophisticated understanding of this culture. Thank you very much.

Fantastic Futures