Offshore
Ahmad RT @TheAhmadOsman: - local llms 101 - running a model = inference (using model weights) - inference = predicting the next token based on your input plus all tokens generated so far - together, these make up the "sequence" - tokens ≠ words - they're…
orms and residuals: stabilize learning and prediction, making deep networks possible
- positional encodings (like RoPE): tell the model where each token sits in the sequence
- so "cat" and "catastrophe" aren't confused by position

- by stacking these layers (sometimes dozens or even hundreds)
- transformers build a complex understanding of your prompt, context, and conversation history

- transformer recap:
- decoder-only: model only predicts what comes next, each token looks back at all previous tokens
- self-attention picks what to focus on (MQA/GQA = efficient versions for less memory)
- feed-forward MLP after attention for every token (usually 2 layers, GELU activation)
- everything's wrapped in layer norms + linear layers (QKV projections, MLPs, outputs)
- residuals + norms = stable, trainable, no exploding/vanishing gradients
- RoPE (rotary embeddings): tells the model where each token sits in the sequence
- stack N layers of this → final logits → pick the next token
- scale up: more layers, more heads, wider MLPs = bigger brains

- VRAM: memory, the bottleneck
- VRAM must must fit:
1. weights (main model, whether quantized or not)
2. KV cache (per token, per layer, per head)
- weights:
- FP16: ~2 bytes/param → 7B = ~14GB
- 8-bit: ~1 byte/param → 7B = ~7GB
- 4-bit: ~0.5 byte/param → 7B = ~3.5GB
- add 10–30% for runtime overheads
- KV cache:
- rule of thumb: 0.5MB per token (Llama-like 7B, 32 layers, 4K tokens = ~2GB)
- some runtimes support KV cache quantization (8/4-bit) = big savings

- throughput = memory bandwidth + GPU FLOPs + attention implementation (FlashAttention/SDPA help) + quantization + batch size
- offload to CPU? expect MASSIVE slowdown

- GPU or bust: CPUs run quantized models (slow), but any real context/model needs CUDA/ROCm/Metal
- CPU spill = sadness (check device_map and memory fit)

- quantization: reduce precision for memory wins (sometimes a tiny quality hit)
- FP32/FP16/BF16 = full/floored
- INT8/INT4/NF4 = quantized
- 4-bit (NF4/GPTQ/AWQ) = sweet spot for most consumer GPUs (big memory win, small quality hit for most tasks)
- math-heavy or finicky tasks degrade first (math, logic, coding)

- KV cache quantization: even more memory saved for long contexts (check runtime support)

- formats/runtimes:
- PyTorch + safetensors: flexible, standard, GPU/TPU/CPU
- GGUF (llama.cpp): CPU/GPU/portable, best for quant + edge devices
- ONNX, TensorRT-LLM, MLC: advanced flavors for special hardware/use
- protip: avoid legacy .bin (pickle risk), use safetensors for safety

- everything is a tradeoff
- smaller = fits anywhere, less power
- more context = more latency + VRAM burn
- quantization = speed/memory, but maybe less accurate
- local = more control/knobs, more work

- what happens when you "load a model"?
- download weights, tokenizer, config
- resolve license/trust (don't use trust_remote_code unless you really trust the author)
- load to VRAM/CPU (check memory fit)
- warmup: kernels/caches initialized, first pass is slowest
- inference: forward passes per token, updating KV cache each step

- decoding = how next token is chosen:
- greedy: always top-1 (robotic)
- temperature: softens or sharpens probabilities (higher = more random)
- top-k: pick from top k
- top-p: pick from smallest set with ≥p prob
- typical sampling, repetition penalty, no-repeat n-gram: extra controls
- deterministic = set a seed and no sampling
- tune for your use-case: chat, summarization, code

- serving options?
- vLLM for high throughput, parallel serving
- llama.cpp server (OpenAI-compatible API)
- ExLlama V2/V3 w/ Tabby API (OpenAI-compatible API)
- run as a local script (CLI)
- FastAPI/Flask for local API endpoint

- local ≠ offline; run it, serve it, or build apps on top

- fine-tuning, ultra-brief:
- LoRA / QLoRA = adapter layers (efficient, minimal VRAM)
- still need a dataset and eval plan; adapters can be merged or kept separate
- most users get far with prompting + retrieval (RAG) or[...]
Offshore
orms and residuals: stabilize learning and prediction, making deep networks possible - positional encodings (like RoPE): tell the model where each token sits in the sequence - so "cat" and "catastrophe" aren't confused by position - by stacking these layers…
few-shot for niche tasks

- common pitfalls
- OOM? out of memory. Model or context too big, quantize or shrink context
- gibberish? used a base model with a chat prompt, or wrong template; check temperature/top_p
- slow? offload to CPU, wrong drivers, no FlashAttention; check CUDA/ROCm/Metal, memory fit
- unsafe? don't use random .bin or trust_remote_code; prefer safetensors, verify source

- why run locally?
- control: all the knobs are yours to tweak:
- sampler, chat templates, decoding, system prompts, quantization, context
- cost: no per-token API billing-just upfront hardware
- privacy: prompts and outputs stay on your machine
- latency: no network roundtrips, instant token streaming

- challenges:
- hardware limits (VRAM/memory = max model/context)
- ecosystem variance (different runtimes, quant schemes, templates)
- ops burden (setup, drivers, updates)

- running local checklist:
- pick a model (prefer chat-tuned, sized for your VRAM)
- pick precision (4-bit saves RAM, FP16 for max quality)
- install runtime (vLLM, llama.cpp, Transformers+PyTorch, etc)
- run it, get tokens/sec, check memory fit
- use correct chat template (apply_chat_template)
- tune decoding (temp/top_p)
- benchmark on your task
- serve as local API (or go wild and fine-tune it)

- glossary:
- token: smallest unit (subword/char)
- context window: max tokens visible to model
- KV cache: session memory, per-layer attention state
- quantization: lower precision for memory/speed
- RoPE: rotary position embeddings (for order)
- GQA/MQA: efficient attention for memory bandwidth
- decoding: method for picking next token
- RAG: retrieval-augmented generation, add real info

- misc:
- common architectures: LLaMA, Falcon, Mistral, GPT-NeoX, etc
- base model: not fine-tuned for chat (LLaMA, Falcon, etc)
- chat-tuned: fine-tuned for dialogue (Alpaca, Vicuna, etc)
- instruct-tuned: fine-tuned for following instructions (LLaMA-2-Chat, Mistral-Instruct, etc)

- chat/instruct models usually need a special prompt template to work well
- chat template: system/user/assistant markup is required; wrong template = junk output
- base models can do few-shot chat prompting, but not as well as chat-tuned ones

- quantized: weights stored in lower precision (8-bit, 4-bit) for memory savings, at some quality loss
- quantization is a tradeoff: memory/speed vs quality
- 4-bit (NF4/GPTQ/AWQ) is the sweet spot for most consumer GPUs (huge memory win, minor quality drop for most tasks)
- math-heavy or finicky tasks degrade first (math, logic, code)
- quantization types: FP16 (full), INT8 (quantized), INT4/NF4 (more quantized), etc.
- some runtimes support quantized KV cache (8/4-bit), big savings for long contexts

- formats/runtimes:
- PyTorch + safetensors: flexible, standard, works on GPU/TPU/CPU
- GGUF (llama.cpp): CPU/GPU, portable, best for quant + edge devices
- ONNX, TensorRT-LLM, MLC: advanced options for special hardware

- avoid legacy .bin (pickle risk), use safetensors for safety

- everything is a tradeoff:
- smaller = fits anywhere, less power
- more context = more latency + VRAM burn
- quantization = faster/leaner, maybe less accurate
- local = full control/knobs, but more work

- final words:
- local LLMs = memory math + correct formatting
- fit weights and KV cache in memory
- use the right chat template and decoding strategy
- know your knobs: quantization, context, decoding, batch, hardware

- master these, and you can run (and reason about) almost any modern model locally
tweet
Offshore
Video
Ahmad
RT @TheAhmadOsman: can’t write code because Cursor and Codex are both down thanks to the aws-us-east-1 outage?

tired of Anthropic’s weekly limits and nerfed models?

with one command and a few GPUs,
you can route Claude Code to your own local LLM with ZERO downtime

Buy a GPU https://t.co/aj8r201V83

i built a simple tool that makes

Claude Code work with any local LLM

full demo:
> vLLM serving GLM-4.5 Air on 4x RTX 3090s
> Claude Code generating code + docs via my proxy
> 1 Python file + .env handles all requests
> nvtop showing live GPU load
> how it all works

Buy a GPU https://t.co/7nYsId4Uyu
- Ahmad
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: the Buy a GPU website & guide is launching this week

so, what should you expect? https://t.co/e36YLjAdoo
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: > today this guy axes FAIR at Meta
> so this is a quick recap of his origin story
> and why he should not be the one
> making that decision

> Alexandr Wang, born January 1997
> age 19, drop out of MIT
> co-found Scale AI
> "what if we label data, but mid?"
> convince every LLM company that this is fine

> 2016–2023
> flood the market with barely-labeled goat photos and out-of-context Reddit takes
> call it “foundational data”
> raise billions
> valuation hits $7.3B
> everyone claps

> 2025
> sell Scale AI to Meta for $14B
> not a typo.
> fourteen. billion. dollars.
> join Meta as Chief AI Officer
> rename division to Meta Superintelligence Labs
> start saying things like “AGI by 2027” in interviews

> meanwhile, researchers:
> "the data from Scale is trash"
> models hallucinate goat facts and mislabel wheelchairs as motorcycles
> AI alignment folks are malding
> i am Alexandr. unbothered. moisturized. thriving.

> ranked #1 in Times Top Grifters of All Time
> beat out SBF, Elizabeth Holmes, and your favorite VC
> literally built an empire out of copy-pasted Amazon Mechanical Turk tasks

> mfw I labeled 4chan posts for pennies and turned it into a 14B exit
> mfw I am now leading Meta's quest for godlike AI
> mfw data quality was never part of the business model
> never bet against the grind
tweet
Ahmad
RT @TheAhmadOsman: all the snarky replies i get about how local models “don’t stand a chance”

make one thing clear

people are still judging based on LLaMA 2

if they touched Qwen 3 32B or 30B‑A3B for even a second,

they’d realize they’re stuck in 2023

open models have gotten SO GOOD
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: i love having my own private UNRESTRICTED COMPUTE

Buy a GPU https://t.co/6r8c46owH7
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: Feynman was right.

In a world of rented APIs and black-box models,
one truth remains:

> “What I cannot create, I do not understand.”

This fall, in this timeline:

> Buy a GPU
> Learn LLMs

Understand the machine.
Create with it. https://t.co/9C438y6n7a
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: the Buy a GPU website & guide is launching this week

so, what should you expect? https://t.co/e36YLjAdoo
tweet
Offshore
Video
Ahmad
RT @TheAhmadOsman: i built a simple tool that makes

Claude Code work with any local LLM

full demo:
> vLLM serving GLM-4.5 Air on 4x RTX 3090s
> Claude Code generating code + docs via my proxy
> 1 Python file + .env handles all requests
> nvtop showing live GPU load
> how it all works

Buy a GPU https://t.co/7nYsId4Uyu
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: working on getting nanochat training running with TT‑NN

the more i push my single Tenstorrent QuietBox Blackhole,

the more i see just how much headroom this thing has

counting down until my 4x TT‑QuietBox Blackhole cluster arrives

this cluster's going to be an absolute beast https://t.co/lN9VsITgDs
tweet