Hacker News
24.1K subscribers
118K links
Top stories from https://news.ycombinator.com (with 100+ score)
Contribute to the development here: https://github.com/phil-r/hackernewsbot
Also check https://t.me/designer_news

Contacts: @philr
Download Telegram
Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon (🔥 Score: 150+ in 2 hours)

Link: https://readhacker.news/s/6uBAK
Comments: https://readhacker.news/c/6uBAK

I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality.
I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising:
- K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss
- K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss
- The configurations use the same number of bits, but K8V4 is 7× better for quality
This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.
Implementation was straightforward:
1. Added --kvq-key and --kvq-val flags to llama.cpp
2. Applied existing quantization logic separately to K and V tensors
3. Validated with perplexity metrics across context lengths
4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues)
Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon.
GitHub: https://github.com/dipampaul17/KVSplit
ClojureScript 1.12.42 (Score: 151+ in 11 hours)

Link: https://readhacker.news/s/6uBDi
Comments: https://readhacker.news/c/6uBDi
Wow@Home – Network of Amateur Radio Telescopes (Score: 150+ in 12 hours)

Link: https://readhacker.news/s/6uCht
Comments: https://readhacker.news/c/6uCht
Palette lighting tricks on the Nintendo 64 (Score: 151+ in 5 hours)

Link: https://readhacker.news/s/6uDgM
Comments: https://readhacker.news/c/6uDgM