Offshore

orms and residuals: stabilize learning and prediction, making deep networks possible - positional encodings (like RoPE): tell the model where each token sits in the sequence - so "cat" and "catastrophe" aren't confused by position - by stacking these layers…

few-shot for niche tasks

- common pitfalls
- OOM? out of memory. Model or context too big, quantize or shrink context
- gibberish? used a base model with a chat prompt, or wrong template; check temperature/top_p
- slow? offload to CPU, wrong drivers, no FlashAttention; check CUDA/ROCm/Metal, memory fit
- unsafe? don't use random .bin or trust_remote_code; prefer safetensors, verify source

- why run locally?
- control: all the knobs are yours to tweak:
- sampler, chat templates, decoding, system prompts, quantization, context
- cost: no per-token API billing-just upfront hardware
- privacy: prompts and outputs stay on your machine
- latency: no network roundtrips, instant token streaming

- challenges:
- hardware limits (VRAM/memory = max model/context)
- ecosystem variance (different runtimes, quant schemes, templates)
- ops burden (setup, drivers, updates)

- running local checklist:
- pick a model (prefer chat-tuned, sized for your VRAM)
- pick precision (4-bit saves RAM, FP16 for max quality)
- install runtime (vLLM, llama.cpp, Transformers+PyTorch, etc)
- run it, get tokens/sec, check memory fit
- use correct chat template (apply_chat_template)
- tune decoding (temp/top_p)
- benchmark on your task
- serve as local API (or go wild and fine-tune it)

- glossary:
- token: smallest unit (subword/char)
- context window: max tokens visible to model
- KV cache: session memory, per-layer attention state
- quantization: lower precision for memory/speed
- RoPE: rotary position embeddings (for order)
- GQA/MQA: efficient attention for memory bandwidth
- decoding: method for picking next token
- RAG: retrieval-augmented generation, add real info

- misc:
- common architectures: LLaMA, Falcon, Mistral, GPT-NeoX, etc
- base model: not fine-tuned for chat (LLaMA, Falcon, etc)
- chat-tuned: fine-tuned for dialogue (Alpaca, Vicuna, etc)
- instruct-tuned: fine-tuned for following instructions (LLaMA-2-Chat, Mistral-Instruct, etc)

- chat/instruct models usually need a special prompt template to work well
- chat template: system/user/assistant markup is required; wrong template = junk output
- base models can do few-shot chat prompting, but not as well as chat-tuned ones

- quantized: weights stored in lower precision (8-bit, 4-bit) for memory savings, at some quality loss
- quantization is a tradeoff: memory/speed vs quality
- 4-bit (NF4/GPTQ/AWQ) is the sweet spot for most consumer GPUs (huge memory win, minor quality drop for most tasks)
- math-heavy or finicky tasks degrade first (math, logic, code)
- quantization types: FP16 (full), INT8 (quantized), INT4/NF4 (more quantized), etc.
- some runtimes support quantized KV cache (8/4-bit), big savings for long contexts

- formats/runtimes:
- PyTorch + safetensors: flexible, standard, works on GPU/TPU/CPU
- GGUF (llama.cpp): CPU/GPU, portable, best for quant + edge devices
- ONNX, TensorRT-LLM, MLC: advanced options for special hardware

- avoid legacy .bin (pickle risk), use safetensors for safety

- everything is a tradeoff:
- smaller = fits anywhere, less power
- more context = more latency + VRAM burn
- quantization = faster/leaner, maybe less accurate
- local = full control/knobs, but more work

- final words:
- local LLMs = memory math + correct formatting
- fit weights and KV cache in memory
- use the right chat template and decoding strategy
- know your knobs: quantization, context, decoding, batch, hardware

- master these, and you can run (and reason about) almost any modern model locally
tweet

1 view08:36