GitHub repos

thu-ml/SageAttention
Quantized Attention that achieves speedups of 2.1x and 2.7x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Language: Python
#attention #inference_acceleration #llm #quantization
Stars: 145 Issues: 6 Forks: 3
https://github.com/thu-ml/SageAttention

1.8K views22:00

mit-han-lab/nunchaku
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Language: Cuda
#diffusion_models #flux #genai #lora #mlsys #quantization
Stars: 249 Issues: 10 Forks: 13
https://github.com/mit-han-lab/nunchaku

1.8K views11:00

dipampaul17/KVSplit
Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% quality loss. Includes benchmarking, visualization, and one-command setup. Optimized for M1/M2/M3 Macs with Metal support.
Language: Python
#apple_silicon #generative_ai #kv_cache #llama_cpp #llm #m1 #m2 #m3 #memory_optimization #metal #optimization #quantization
Stars: 222 Issues: 1 Forks: 5
https://github.com/dipampaul17/KVSplit

1.5K views10:00