Bird song classifier report

The aim of this project is to be able to identify bird species making the call or song from a recording. The project is for a personal interest— having recently bought a house with a garden, I can hear a lot of birds now that it's spring, and I am quite keen on knowing what they are. Eventually, I would like to port this project over to a raspberry pi so I can build a 'smart' birdfeeder that will tell me what birds are visiting my garden. For the moment, I am keeping to the simpler task of building a desktop model that 'more or less works'.

Dataset

There are various public birdsong datasets, e.g. the Cornell birdsong dataset and the Birdclef dataset on Kaggle. However, they are mostly American birds, and as I live in the UK, I would like my dataset to consist of birds that you can find in UK/Europe. Using this British Birdsong Dataset as a guide to what are the UK bird species, I wrote a script to first gather the metadata around recordings of these species on xeno-canto.org, then download the relevant recordings. The recordings are filtered for quality A/B to so as to reduce the need for clean the data to remove noise — while noise might help the model to generalise better to a real life scenario, it makes it harder to build a decent baseline classifier to start off with, and there are reasonable number of training samples to work with, so I decided not to download records that are of lower quality.

Challenges with dataset

Despite filtering for rating A and B recordings, some recordings are still quite noisy
Class imbalance— For the most common class, there are > 1000 samples whereas for the least common class, there are only around ~10
Recording length: the recordings range from very short ~< 10s to very long > ~5mins. Certain models need the recordings to be the same length, and I'll need to decide how to divide up the samples. In addition, in some longer recordings, there are long periods where the bird is not singing— if I naively divide up the recordings into chunks this might have a negative impact on the model (since there might be chunks where it's just noise/silence)
Prescence of other birds in the recordings: usually, the main birdcall in the recording is from the target bird, but there are sometimes the presence of other birds (an example below of a recording of a northern lapwing with a few other birds)

northern_lapwing/35970.wav

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/0e7a5c7d-b34f-44e9-920f-268e57192915/35970.wav

Constructing dataset suitable for modelling

Different models need different input format— for Yamnet (see below), which works on framewise predictions, I can feed the waveform as is into the dataset construction, wheras for spectrograms, I need to chop up the recording into equal length chunks before converting them into spectrograms to feed a model that uses 2D convolutions.

For both cases, I downsampled all audios to a uniform 16000Hz mono (as this is what Yamnet works off)

The above recording in waveform:

And in spectrogram