InfoNCE & Metric Learning

Noise Contrastive Estimation & InfoNCE

Unsupervised setting에서의 Instance Discrimination
a. label이 없을 때, #samples = N이라면 N-label classification으로도 유의미한 feature를 얻을 수 있다는 것이 증명됨
(Unsupervised Feature Learning via Non-parametric Instance Discrimination)

b. what if N is too large? → Noise Contrastive Estimation
Noise Contrastive Estimation for Unsupervised feature learning(NCE)

a. positive sample과 negative sample(from noise distribution)을 샘플링 한 뒤, binary classification으로 접근

ex1). image $x_1, x_2$

ex2). word2vec 학습 시, 전체 vocabulary에 대해 softmax를 하지 않고, negative sampling이나 hierarchical softmax를 적용하여 loss를 계산
InfoNCE loss

a. categorical cross entropy를 사용하여 여러 개의 negative sample과 비교하여 positive sample을 identify하는 방법

b. formulation

c. $f(x)^Tf(x^+)$

d. 해당 방법을 사용한 SimCLR, MoCo, BYOL은 unsupervised image representation learning을 통해 다양한 downstream에서 SOTA 달성

Mathematical view of NCE & InfoNCE

(reference : Contrastive Predictive Coding)

Intuition

a. 고차원의 데이터 부분 부분에 있는 shared information을 encoding하는 것

ex). 문장 내의 가까운 단어들 사이에 공유된 정보 / 인접한 이미지 패치 간에 걸쳐 있는 공유된 정보

b. context로부터 학습한다

 1. target : image → context : augmented image
  2. target : image patch → context : adjacent image patches or pixels
  3. target :  word → context : adjacent or preceding words
  4. target : video frame → context : adjacent video frames
  5. target : video clip → context : concurrent video transcript(sentence)
  6. target : image(when paired with caption) → context : paired caption

Why infoNCE maximizes Mutual Information between target and context?
a. Let data instance $X$

ex). X : 하나의 문장
x(target) : context주변의 token
c(context) : x주변의 token(fixed)

b. Basic mathematical Intuition
- Given context $c$
- context $c$
- 이 셋팅에서, N(=batch size)개의 샘플들 중에서 positive sample $x_{pos}$
InfoNCE loss function formulation

$\mathcal{L} = -\mathbb{E}_{x}[\log {{f(x,c)}\over{\sum_{x'\in X} f(x',c)}} ]$

where $f(x,c) = \exp (v_x^T v_c)$
How does minimizing loss function above corresponds to maximizing mutual information between $x_{pos}, c$
- $MI(x;c) =\sum_{x,c}p(x,c)\log{ {p(x|c)}\over {p(x)}}\propto \log{p(x|c) \over p(x)}$
  - GPT나 BERT와 같이 $f(x,c)$
  - maximizing $f(x_{pos},c)$ implies maximizing density ratio, and it implies maximizing mutual information between $x_{pos}, c$ .

Contrastive Learning Applications

Image Representation Learning(Unsupervised Setting)
a. dataset : image만으로 구성된 dataset
b. context-target
- target : image patches → context : adjacent image patches
- target : image → context : augmented images
c. ex). SimCLR, BYoL, MoCo
Vision-Language representation learning
a. dataset : image-caption pair로 이루어진 dataset
b. context-target
- context : image → target : paired caption
- context : caption → target : paired image
c. ex). CLIP, ALIGN, FLAVA, Florence
Video Representation learning
a. dataset : HowTo100M과 같이 clip-transcript(text)로 이루어진 data
b. context-target
- target : clip → context : paired transcript
  (or transcript from -k time steps to +k time steps) → MIL-NCE에서 제시한 아이디어
c. ex. MIL-NCE, UniVL, MerLoT

Author And Source

이 문제에 관하여(InfoNCE & Metric Learning), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://velog.io/@dongdori/InfoNCE-Metric-Learning

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

InfoNCE & Metric Learning

Noise Contrastive Estimation & InfoNCE

Mathematical view of NCE & InfoNCE

Contrastive Learning Applications

Author And Source

좋은 웹페이지 즐겨찾기