๐Ÿ–‡๏ธ 1. CLIP

1.1 What is CLIP?

CLIP revolves Image Classification paradigm.

Standard image classification takes non-sense one-hot labels as supervision.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/8adce09f-4fd1-4aeb-969e-4b04bcfd639a/file-20200309-118956-1cqvm6j.webp

Traditional Supervision: 5

One-hot Label: [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

Supervision from CLIP:

"A cute dog wearing a mask

looks like he is worried about the virus."

[Blog]

CLIP (Contrastive Language-Image Pre-Training)

Training: ****Image-Text Pairs from the Internet.

Testing: Check similarity between image and proposed text's embeddings.

Therefore, it obtains impressive โ€œzero-shotโ€ capabilities.

Paper Title: Learning Transferable Visual Models From Natural Language Supervision

Born in January 5, 2021.

1.2 How CLIP works?

Training Stage

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/751e0dbf-d594-4c49-9cd3-3382ff81796c/Untitled.png

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/05e29506-1252-4b22-97d0-6a5797008de5/Screen_Shot_2021-04-08_at_1.14.48_AM.png