Abstract

larger language models memorize training data faster across all settings. … memorize before over-fitting and forget less throughout the training process.. memorize nouns and numbers first

Introduction

memorization은 generalization과 연관지어서 고민되었던 분야로, over fitting 등 부정적인 인식을 가지고 있다가, large model에서 training data를 기억한다는 것이 반대로 generalization을 개선하는 효과로 인식되었고, 최근에는 training 과정에서 언어 모형의 memorization의 dynamics를 이해하려고 한다.

Contribution

model size에 따라 training 동안 memorization dynamics가 어떻게 바뀌는지 파악
forgetting curve라고 하는 dynamics를 보여주고, 여기에 lower bound가 존재한다는 것과 이 lower bound가 model scale에 따라 바뀐다는 것을 실험적으로 검증
특히 noun과 number가 다른 POS보다 더 빠르게 기억된다는 것을 보임.

Background and Related work

Memorization in Language Models

초기에는 unintended memorization이라고 해서 extraction attack, membership inference attack 등 문제로 인식되었다.
다만 최근 들어 QA task 등에서 generalization의 중요한 요소로 부각되면서, 실제 지식의 상당한 양을 encoding이 가능함을 보여 주었다.
이러한 behavior는 model size, traininig data duplication, prompting context length 등의 요소에 의존하는 것으로 파악

Language Model Training Dynamics