MAB: Multi-Armed Bandit | Notion

Background

문제 정의

$$ q_*(a) \doteq \mathbb{E}\left[R_t \mid A_t=a\right] $$

$t:$ time step or play number
$A_t:$ 시간 $t$에 play한 action
$R_t:$ 시간 t에 받은 reward
$q_*(a):$ 액션 a에 따른 reward의 실제 기대값

모든 action에 대한 $q_*(a)$를 정확히 알고 있다면 문제 해결

하지만 실제 reward의 true distribution을 모르기 때문에 추정

$q_*(a)$에 대한 시간 $t$에서의 추정치 $Q_t(a)$를 최대한 정밀하게 구하는 것이 목표!

추정 가치가 최대인 action을 선택하는 것을 greedy action이라고 한다.

greedy action 선택 → exploitation
다른 action 선택 → exploration

MAB 알고리즘

Greedy Algorithm

Epsilon-Greedy Algorithm

UCB: Upper Confidence Bound

MAB를 이용한 추천 시스템

MAB 알고리즘 심화

Thompson Sampling

LinUCB