tl;dr

ImageBind one of the recent multi-modal works beyond audio-visual learning. It proposes an idea of using one modality as an anchor which most likely has the highest amount of paired data available. This technique allows unpaired disparate modalities to also jointly arrive a similar representation and allows emergent behaviors while overcoming the issue of unavailability of all modalities to be present for every data sample.

Reading Notes

Limitations	Proposed Solution
You need large amount of paired (complete) data for a true multimodal learning	They use image as an anchor and allow learning image-paired with any modality for learning.

You can apply ImageBing to variety of tasks without any training — How?

Untitled

KPIs :

ImageBind can outperform the following settings (claimed)

Specialist audio-text trained models
Supervised models for audio-event detection
Wide variety of compositional tasks (not central to images)
Outperforms some other zero/few-shot learning benchmarks

Design :

Can automatically associate pairs of modalities without seeing a training pair that specially binds them

Untitled

One positive pair and all others are negative (infoNCE)

infoNCE(Noise-Contrastive Estimation) loss :

Good explanation of many types of SSL losses here
Emergent alignment (zero-shot learning?) — When you have not trained a model for an ability but it performs well (outperforms task-specific baselines) when tested on that ability.
- Generally emergent abilities are present in large models. (A whole field of research dedicated to characterizing the emergent abilities of large language models)

Untitled

They have a good summary of datasets being used for all the modalities.