1. DeepSeek-V3.2 Sparse Attention

MLA 구조에 Lightning Indexer를 별도로 학습해 Top-K 토큰만 선별해서 attention 수행
(의견) GQA에 해도 되는데 compressed vector가 아니라서 메모리 부하가 좀 있을 수도

Framework Support

training
- https://github.com/deepseek-ai/FlashMLA → Indexer가 fp8만 지원하는지 확인해야함
inference
- https://lmsys.org/blog/2025-09-29-deepseek-V32/
- https://developers.redhat.com/articles/2025/10/03/deepseek-v32-exp-vllm-day-0-sparse-attention-long-context-inference

2. Native Sparse Attention (DeepSeek)

총 세 개 종류의 attention을 태우고 gate score로 합치는 방식
- compressed attn: compressed block과 attention
- selected attn: compressed block을 사용해 importance score → topk select
- sliding attn: sliding window
kv compression MLP + gating 모듈이 추가됨

Framework Support

training
- ‣ → sliding attn은 native fa (fa3에 최적화가 안되어 있는듯)
- https://github.com/XunhaoLai/native-sparse-attention-triton → InfLLM-v2에서 비교를 위해 사용
- https://github.com/fla-org/native-sparse-attention