tl;dr

ImageBind one of the recent multi-modal works beyond audio-visual learning. It proposes an idea of using one modality as an anchor which most likely has the highest amount of paired data available. This technique allows unpaired disparate modalities to also jointly arrive a similar representation and allows emergent behaviors while overcoming the issue of unavailability of all modalities to be present for every data sample.

Reading Notes

Limitations Proposed Solution
You need large amount of paired (complete) data for a true multimodal learning They use image as an anchor and allow learning image-paired with any modality for learning.

Untitled

KPIs :

ImageBind can outperform the following settings (claimed)

  1. Specialist audio-text trained models
  2. Supervised models for audio-event detection
  3. Wide variety of compositional tasks (not central to images)
  4. Outperforms some other zero/few-shot learning benchmarks

Design :

Untitled

Untitled