
Presenter: Svetlana Koroteeva
Traditional music cataloguing in libraries involves a lot of manual work, where librarians manually categorise each piece of music by genre. This process is time-consuming and sometimes lacks consistency. Svetlana Koroteva presents their study on developing a tool that can correctly identify genres from records and audio features: beginning from selecting a model and preparing the data to training the model.
Technology, language, history and creativity converged in Canberra for four days as cultural leaders gather for the world's first in-depth exploration of the opportunities and challenges of AI for the cultural sector.
Want to learn more about this event?
Visit the Fantastic Futures 2024 Hub
This transcript was generated by NFSA Bowerbird and may contain errors.
Good afternoon, everyone. I'm going to talk about my master project, Exploring Autoclassification of a Music Dataset for National Library of New Zealand. We've worked on this project with Richard Robertson, a metadata workflow specialist. National Library of New Zealand plays a crucial role in preserving and accessibility of New Zealand national heritage. As part of this role, we developed a significant collection of published New Zealand music. Approximately, we collected 36,000 releases across formats including vinyl, cassette, CD, and digital. Even with high-level automation in modern libraries, there is still a lot of manual work. Here you can see four main cataloging approaches. The first one is composition music code based on Mark 21 vocabulary. The second is selection of appropriate Library of Congress subject headings. The third is free text description, which is done by catalogers. And also, very relevant, we are inserting appropriate Maori subject headings, naupokotukutuku terms. Let's talk about most of approaches in details. The first one is musical composition code. Here you can see part of bibliographical record, field 008, and in this field, 008, you can see two letters, R and C. It stands for rock. These two characters correspond to term, and you can see table of these terms on the right side of the slide. Music composition code has limited depth, especially for popular music genres. It also has some code which called ZZ or others, and a lot of music falls into this gap. The next approach is Library of Congress subject headings. Here you can see 650 field. And this subject heading has more depth rather than a Mark 21 based vocabulary. Potentially, they could be mapped to a more suitable descriptive vocabulary. In 2019, we stopped adding Library of Congress subject headings to music records, as it was no longer sustainable to spend the time selecting and assigning these. Since we stopped, we have collected 10,000 music recordings. The next one, free text description. Here you can see an example of 500 filled. Cataloguer added description based on listening and available information from appropriate sources. Examples of this description are following. Experimental and improvised electronic noise music. Alternative indie lo-fi rock music. Instrumental heavy metal and stoner rock music with progressive rock elements. The sources for these descriptions could be record stores, music platforms like Bandcamp and also Spotify. With classification of recorded music, we have to remember that it's not static, it's flowed. New original styles are being created. Genres evolve to sub-genres and previously used vocabulary could be reused. Also, classification of music have some limitation. We are limited by what we know and how we interpret what we hear. Also, control vocabulary system changed only after changes already happened. That was about cataloging. And let us move into automated classification. The scope of this project, I'm using composition codes, which are going to be our labels. The data set consists from 21,000 folders. We call them archival packages as they were extracted from our library preservation system. Archival package is a system of folders with unique identifiers as names, which also contain XML file with descriptive information, which called mats, and music file itself. For this project, we didn't use original files, but only access copies with MP3. And the total size of this set is more than one terabyte. With original formats, this size would be much bigger. Before diving into the magic of machine learning, let's remember what machine learning is. Let's think on it as aggregation of mathematical and statistical method. And as we all know, the core of mathematics is numbers. Now we need to convert our materials to numbers. Let's start with music. Music could be very naturally converted to numbers as has a lot of numerical features like tempo, frequency, volume and pitch. And also we can use technical audio characteristics such as male spectrograms, spectral centroid, chroma features, tonants. For this project I have used Python library Librosa and extracting features with Librosa are pretty straightforward. I used only four of them, MFCC, MEL spectrogram, chroma vector and tonons. Here you can see code snippets. Even the code is very simple and pretty straightforward. The process took for me two machines and one week. And this is given that I used only first 30 seconds of music file. As a result of this code, we obtain huge CSV file. Each row of this CSV file represent music track with its names, identifier, and also around 500 columns with numbers. And then let's see what other features we can obtain from this data set. Let's look at text. Here you can see example of bibliographical record. Compared to other bibliographical materials, music records do not contain much textual information, but we can see in this project how this small amount of information can improve our total accuracy. Here you can see a bibliographical record. I extracted all composition codes from 008 field from all data set. And here on diagram you can see structure of the data set. The set is very imbalanced. Some of genres like rock or song have thousands of instances and others like Requiem, for example, or Anthem have only few. They are not even visible. And now we need to convert this text to numbers. How we can do that? There are several approaches. For this particular project, I used BERT, which stands for Bidirectional Encoder Representations from Transformer. This highly effective algorithm was developed by a group of Google researchers for national language processing tasks, such as classification, answering questions, and translations. There are other techniques, like, for example, TF-IDF. However, previously, I tried both techniques on bibliographical records, and BERT showed itself much, much better than TF-IDF. Here, you can see what snippets. For this process, I used Python library TensorFlow and also linked to the model. As a result of code, we obtained another CSV file. Here, on the very right, you can see BERT embeddings. These numbers are word vectors, or word embeddings, or contextual word embeddings, which we are going to use for further classification. Then we need to join both CSV. Let's revise an entire process. We have storage with archival packages. We are extracting music features from mp3 file and adding them to CSV, then using System identifiers, we're accessing bibliographical record and extracting textual information from bibliographical records. Then convert them to word vectors and add to the same CSV file. And now we are ready to machine learning. Here on this slide, you can see example of different machine learning libraries available in Python. We are lucky here, we can use supervised method as we have very good quality labels, which are result of work of our cataloging team. For this project, I used XGBoost classifier and Random Forest classifier. Here on the right, you can see very simple example how to use these libraries. So we are taking our CSV file, then assign column which is going to be label. split on test and train, feed to the model and predict, and get label sets prediction. As I mentioned previously, the set was very imbalanced. This is not a problem as such libraries as XGBoost and Random Forest can manage imbalancing very well. However, some of the instances have only few examples, and this could be a problem when using standard machine learning methods. For this particular project, I used Reptile, which is a visual learning algorithm, and its prototype agnostic learning algorithm called Mamo. If you're interested in code, I can share it. How it's working. First, we need to group some of the classes in one. For example, here we have 11 of classes with very few instances. Let's group them together and call others. And this is our entire pipeline. And in the same time, we are training those instances which we are labeled as others with mammal and reptile algorithm. And then we can use it. First model will classify everything. And for those labels which are predicted as others, it would be another process of classification with few-shot learning model. And here you can see a common matrix to measure success of machine learning models. Actual labels on the left, predicted labels on the bottom, and those diagonals represent those labels which were predicted correctly. For this project, I tried different combinations. XGBoost and Random Forest with BERT and without BERT, and also Reptile and Mammal with BERT embeddings. and the winners are XGBoost and Reptile. XGBoost shows 96% accuracy and Reptile 90% of accuracy, which is pretty good, but also could be better. What next? First, this classification is only by composition code, but we would be interested to classify set by subject headings. Besides, in our dataset, we have only 47 music examples, but there is 74 altogether in the Library of Congress website. What to do with those genres which we don't have on the collection? Should we bring external examples or maybe accommodate existing machine learning models like, for example, BERT for music or Spotify model? Or maybe do something more interesting, like collaboration between libraries to share music examples or even models. Thank you for listening. Here is on the slide reference to models I have used for this project. And if you have any questions, please contact me by this email.
The National Film and Sound Archive of Australia acknowledges Australia’s Aboriginal and Torres Strait Islander peoples as the Traditional Custodians of the land on which we work and live and gives respect to their Elders both past and present.