Birdsong Classification Using Deep Learning

Tsitsimpasis Stefanos

1. Introduction

Monitoring bird populations is a cornerstone of biodiversity assessment and conservation. Manual identification of species from field recordings is labor‐intensive and requires expert ornithological knowledge. This project demonstrates how a Convolutional Neural Network (CNN) can be trained to automatically classify bird species from their songs, using publicly available audio data and open‐source tools.

For this project, the https://xeno-canto.org/ website was used exclusively, which provides recorded sounds of wild animals from around the world.

The aim of this work is to train a neural network to classify 10 common Greek bird species. Initially 200 samples per species were collected and carried out all necessary steps so that the samples could be fed into a neural network. Next, a CNN was trained and performance was evaluated for the 10 bird species.

2. Data Collection

To download the audio files from https://xeno-canto.org/, we used the API provided for this purpose: https://xeno-canto.org/explore/api, which greatly facilitated gathering and organizing them into folders. The plan was to store them on Google Drive and use Colab for the entire processing pipeline.

Of the audio files downloaded, some were in MP3 and others in WAV format. The recordings contained noise or other background sounds—such as calls of other birds, running water, cars—or moments of silence at the beginning, in the middle, and at the end. Recording duration ranged from a few seconds (0:10) up to 20–25 minutes. Quality was rated on a scale of A, B, C, D, and E.

Using the available API, we retrieved only files with quality A and duration between 10 and 210 seconds. Audio files were organized in folders per bird species.

3. Data Preprocessing

At this stage, all .mp3 and .wav files were converted to WAV with a single channel and a 16 kHz sampling rate using the pydub library.

Next, silence was trimmed from the beginning and end of each clip using a 20 dB threshold. The duration was set at 5 seconds (using librosa), noise reduction was applied (using noisereduce), peak normalization was performed [-1 to 1] (the y-axis on the left part of image 1), and the processed files were saved as .wav. To prepare the audio for a neural network, each clip was converted into a Mel spectrogram (right part of the image 1) for feature extraction. The resulting spectrograms (128x128) were saved in folders, ready to be fed into the network when needed.

Labels were then extracted and stored in a CSV file (file paths to the Mel spectrograms, and an integer per bird species). The dataset was then split to 70/15/15, into training, validation, and test sets respectively, using a stratified split for consistency and reproducibility.

Image 1: Time domain (on the left image) and frequency domain (on the right image) i.e. a Mel spectrogram, of the same audio file (species: Turdus Merula [Κοτσύφι])

4. Model Architecture and Training

4.1 CNN Design

Input: 1×128×128
[Conv Block 1]
  • Conv2D(1→16, 3×3, pad=1) → 16×128×128
  • BatchNorm2d(16), ReLU
  • MaxPool2d(2) → 16×64×64

[Conv Block 2]
  • Conv2D(16→32, 3×3, pad=1) → 32×64×64
  • BatchNorm2d(32), ReLU
  • MaxPool2d(2) → 32×32×32

[Conv Block 3]
  • Conv2D(32→64, 3×3, pad=1) → 64×32×32
  • BatchNorm2d(64), ReLU
  • MaxPool2d(2) → 64×16×16

Classifier:
  • Flatten → 16,384
  • Linear(16,384→128), ReLU, Dropout(0.3)
  • Linear(128→10) → logits