Training

https://arxiv.org/abs/2408.11796
https://arxiv.org/abs/2410.16215
- many ablations
https://arxiv.org/abs/2410.17215
- offline logits + re-weightening
https://openreview.net/forum?id=IcVSKhVpKu
- Centered Kernel Alignment (CKA)
https://arxiv.org/abs/2502.02671
- online + offline
https://arxiv.org/abs/2509.01649
- in-context learning에 악영향?
https://arxiv.org/abs/2506.07900

Data

https://arxiv.org/abs/2509.20186v1

Architecture

https://arxiv.org/abs/2502.11089
- https://github.com/fla-org/native-sparse-attention

for each query q_t:
  # 1) 압축: 블록 대표값으로 전역 거친 탐색
  Kcmp, Vcmp = compress(K[:t], V[:t])     # block reps

  # 2) 선택: 중요 블록 top-n → 블록 내 top-m 토큰
  B = top_blocks(q_t, K[:t])              # block-level scoring
  Ksel, Vsel = top_tokens(q_t, K[B], V[B])

  # 3) 슬라이딩: 최근 w개 로컬 토큰
  Kwin, Vwin = K[t-w:t], V[t-w:t]

  # 4) 세 분기 어텐션 + 게이팅 결합
  y = g_cmp*Attn(q_t, Kcmp,Vcmp) + g_sel*Attn(q_t, Ksel,Vsel) + g_win*Attn(q_t, Kwin,Vwin)