Mamba: Linear-Time Sequence Modeling with Selective State Spaces

결국에 기존 SSM이 여러 LRD 및 complexity issue를 해결하면서 recurrent / convolutional 한 모델 구조를 제안했다.

하지만 text-based task들에서 성능이 좋지 않았다는 것을 해결하려고함.

이러한 text-based task를 잘 수행하는 transformer들에 대한 분석들은 크게 2가지 특징을 가진다고 보여짐

Copying mechanism
induction head

그래서 이러한 점들을 SSM에 적용한 모델 구조를 제안한 것이 Mamba이다.

여기에서는 selective copying mechanism이 가능하도록 모델 설계를 진행함.

The Mamba paper identifies this inability to selectively route information based on input content as a key structural weakness in prior models, and it introduces a solution by making certain SSM parameters input-dependent
그리고 이런 것을 selective SSM (or S6)라고 부름

… allows the model to filter out irrelevant information and retain important information indefinitely during sequence processing
Comparison of several architecture

Motivation

Structured SSMs provide a principled way to handle long sequences with linear complexity, yet prior SSMs were constrained by time-invariant parameters, meaning they could not adapt to input content and thus missed important context-dependent behavior

결국에 tractability를 위해 LTI 가정을 하게 되는데 이로 인해 attention layer의 주요 기능인 (1) Copying mechanism과 (2) induction head의 capability는 가질 수 없고, 그로 인해 text-based task에는 좋은 성능을 가지지 못하게 되는 원인으로 분석.

→ 그래서 selective architecture를 도입하려고 한 것.

Approach

Selection Mechanism (Input-Dependent SSM Dynamics): Rather than having fixed state evolution, Mamba’s SSM changes its parameters as a function of the input at each time step
- 결국에 input-dependent하게 한다는 것이 SSM에 필요한 SSM parameter를 input을 통해 계산해내겠다는 접근
- 이를 통해 TF 구조가 특히 문장 내 관계성을 표현하고 이를 활용하는 capability를 SSM도 가질 수 있게 구조화.
- 특히 병목이 되었던 attention operation을 SSM 기반으로 표현하게 되면서 complexity에서의 이점을 유지
Hardware-Aware Efficient Implementation: Making the SSM time-varying means we can no longer use the fast Fourier transform (FFT)-based convolution trick that prior SSMs used for parallel computation