Introduction

기존 연구들은

efficient pre-training을 위해서 objective 등을 고쳐보려고 함.

본 페이퍼는

attention layer의 연산을 4가지의 information flow로 구분하고, 의미있는 flow에 집중하는 구조를 제안
집중하기 위해서 학습 과정에 serialize된 stage 구조를 잡아서 구현

Information flow

positional inforamtion
token information

각 토큰들은 위의 두 info를 가지고 있으며 아래의 4가지의 flow를 통해 정보가 전달되고 이것의 역할을 하는 것이 self-attention modul이다.

SmartSelect_20230716_161715_Samsung Notes.jpg

Position and token info among unmaksed
P, T info from unmaked to maked
P info among masked
P info from masked to unmasked