CONTRASTIVE LEARNING OF GENERAL-PURPOSE AUDIO REPRESENTATIONS | Notion

Summary

The work presents a simple contrastive learning framework that doesn't require data augmentation, and only rely on sampling different segments from each sequence.
Evaluated on nine different audio-related tasks going beyond speech, the model demonstrates superior performance over supervised counterparts.
It conducts an extensive evaluation on different model configurations.

The problems

Prior works are limited in particular domains.

The solution

Each input pair is sampled from a sequence, consisting two segments considered to be the positive samples. Apart from the sampling and contrastive loss, no data augmentation is used.
Train on a large and diverse dataset, AudioSet, and evaluate on nine different audio-related downstream tasks.

Thoughts

The simplicity may be attributed to the large dataset and the network design.
It would be interesting to see if data augmentation degrades the performance.
A benchmark against the reconstruction-based framework as MULTI-TASK SELF-SUPERVISED PRE-TRAINING FOR MUSIC CLASSIFICATION would be interesting.