
Summary
- The work presents a simple contrastive learning framework that doesn't require data augmentation, and only rely on sampling different segments from each sequence.
- Evaluated on nine different audio-related tasks going beyond speech, the model demonstrates superior performance over supervised counterparts.
- It conducts an extensive evaluation on different model configurations.
The problems
Prior works are limited in particular domains.
The solution
- Each input pair is sampled from a sequence, consisting two segments considered to be the positive samples. Apart from the sampling and contrastive loss, no data augmentation is used.
- Train on a large and diverse dataset, AudioSet, and evaluate on nine different audio-related downstream tasks.
Thoughts