Abstract

power-law with model size, dataset, amount of compute used for training

dependecy of overfitting on model/dataset size and the dependency of training speed on model size

Larger models are significantly more sample-efficient, … training very alrge models on a relatively modest amount of data

Introduction

[TBD] Performance depends strongly on scale, weakly on model shape

Performancec는 크게 3가지 factor에 영향을 받는다.

반대로 width나 depth와 같은 shape에는 덜 영향

Smooth power laws

앞서 말한 3가지 요소에 따라서 performance(test loss)의 변화가 smooth하다.

단 하나의 요소를 바꾸는 과정에서는 다른 요소가 영향을 주지 않는다.

Untitled