Offshore
Photo
Ahmad
RT @TheAhmadOsman: whatʼs stopping you from becoming a chad like Gilfoyle and building your own servers?

the PATH to becoming a GREAT engineer starts this way https://t.co/kyIAI083w6
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: Comparing & Contrasting Recent LLMs Architecture

> DeepSeek-V3/R1
> OLMo 2
> Gemma 3
> Mistral Small 3.1
> Llama 4
> Qwen3 (dense+MoE)
> SmolLM3
> Kimi 2
> GPT-OSS

Are 2025 LLMs really that different from each other?

MoE, MLA, GQA, sliding window, normalization games & more. https://t.co/JWg9cde34M
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: > youʼre OpenAI
> hire a small army of ex-Meta ad and monetization people
> a Slack channel just for ex-Facebook staff
> brings in the full “targeted ads” playbook

> launch a browser
> users install it, and OpenAI collects personalized, granular data at scale
> it’s a browser-shaped surveillance device
> it’s a mapping machine of your workflows
> itʼs a reverse-engineering tool for the internetʼs data pipelines, deployed at scale via their users

> launch Sora 2
> a TikTok‑style social network
> infinite AI-generated video feed
> you create or remix clips, upload your face, become the cameo star
> every scroll, like, remix is another data point, another ad signal
> their model learns exactly what hooks you and dials up the dopamine
> you’re not just watching, you’re training their algorithm for better ad targeting
> viral videos driven by your input + their algorithm = your attention refined into $$$
> “your feedback helps us improve the experience” (yeah, for advertisers)

> launch “Pulse”
> reads your chats while you sleep
> remembers you wanna visit Bora Bora
> knows your kid is 6 months old and
> “thinks” of your baby milestones
> suggests developmental toys next
> “it's for your convenience”
> actually laying the groundwork for targeted ads using memory

> internal memo: some people already think ChatGPT shows ads
> OpenAI staff: “might as well then”

> congrats, you’re back in the Facebook era
> except this time, you’re training the algo yourself

> Buy a GPU
> run your LLMs locally
> reject adware LLMs before it’s too late
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: last week, Karpathy dropped the ULTIMATE guide to speed-running your way into LLMs

in this project, you’ll build all the essentials, all under 8k lines of code

> train the tokenizer — new rust implementation
> pretrain a transformer LLM on fineweb
> evaluate core score across a bunch of metrics

> midtrain — user-assistant convos from smoltalk,
> multiple choice Qs, tool use

> sft, then eval the chat model on:
> world knowledge MCQ (arc-e/c, mmlu)
> math (gsm8k)
> code (humaneval)

> rl the model (optionally) on gsm8k with “grpo”

> efficient inference:
> kv cache, fast prefill/decode
> tool use (python interpreter, sandboxed)
> access via cli or chatgpt-like webui

> write a single markdown report card,
> summarizing + gamifying the whole pipeline

the model you’ll build:

> rotary only (no positional embeddings)
> qk norm
> untied embedding / unembedding
> norm after token embedding
> relu² mlp
> no biases in linears
> rmsnorm (no learnable params)
> mqa (multi-query attention)
> logit softcap
> optimizer: muon + adamw

if i had this a couple years ago i’d dodged half the pain and skipped double the rabbit holes

happy hacking
tweet
Ahmad
RT @TheAhmadOsman: - in 2025, your focus SHOULD NOT be CUDA
- the real bottlenecks are:
- data, inference, evals, dataloaders, infra in general

- want to get good?
- mess with PyTorch & JAX
- study inference infra like vLLM & SGLang
- build better eval pipelines
- learn how models run end-to-end
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: here is my twitter growth strategy: https://t.co/HhA6C07zPZ

here is my twitter growth strategy: https://t.co/luJa9ihS2n
- Min💙
tweet
Ahmad
RT @TheAhmadOsman: ollama alternatives

> lmstudio
> llama.cpp
> exllamav2/v3
> vllm
> sglang

among many others

like literally anything is better than ollama lmao
tweet
Ahmad
RT @TheAhmadOsman: all the snarky replies i get about how local models “don’t stand a chance”

make one thing clear

people are still judging based on LLaMA 2

if they touched Qwen 3 32B or 30B‑A3B for even a second,

they’d realize they’re stuck in 2023

open models have gotten SO GOOD
tweet
Ahmad
RT @TheAhmadOsman: - local llms 101

- running a model = inference (using model weights)
- inference = predicting the next token based on your input plus all tokens generated so far
- together, these make up the "sequence"

- tokens ≠ words
- they're the chunks representing the text a model sees
- they are represented by integers (token IDs) in the model
- "tokenizer" = the algorithm that splits text into tokens
- common types: BPE (byte pair encoding), SentencePiece
- token examples:
- "hello" = 1 token or maybe 2 or 3 tokens
- "internationalization" = 5–8 tokens
- context window = max tokens model can "see" at once (2K, 8K, 32K+)
- longer context = more VRAM for KV cache, slower decode

- during inference, the model predicts next token
- by running lots of math on its "weights"
- model weights = billions of learned parameters (the knowledge and patterns from training)

- model parameters: usually billions of numbers (called weights) that the model learns during training
- these weights encode all the model's "knowledge" (patterns, language, facts, reasoning)
- think of them as the knobs and dials inside the model, specifically computed to recognize what could come next
- when you run inference, the model uses these parameters to compute its predictions, one token at a time

- every prediction is just: model weights + current sequence → probabilities for what comes next
- pick a token, append it, repeat, each new token becomes part of the sequence for the next prediction

- models are more than weight files
- neural network architecture: transformer skeleton (layers, heads, RoPE, MQA/GQA, more below)
- weights: billions of learned numbers (parameters, not "tokens", but calculated from tokens)
- tokenizer: how text gets chunked into tokens (BPE/SentencePiece)
- config: metadata, shapes, special tokens, license, intended use, etc
- sometimes: chat template are required for chat/instruct models, or else you get gibberish
- you give a model a prompt (your text, converted into tokens)

- models differ in parameter size:
- 7B means ~7 billion learned numbers
- common sizes: 7B, 13B, 70B
- bigger = stronger, but eats more VRAM/memory & compute
- the model computes a probability for every possible next token (softmax over vocab)
- picks one: either the highest (greedy) or
- samples from the probability distribution (temperature, top-p, etc)
- then appends that token to the sequence, then repeats the whole process
- this is generation:
- generate; predict, sample, append
- over and over, one token at a time
- rinse and repeat
- each new token depends on everything before it; the model re-reads the sequence every step

- generation is always stepwise: token by token, not all at once
- mathematically: model is a learned function, f_θ(seq) → p(next_token)
- all the "magic" is just repeating "what's likely next?" until you stop

- all conversation "tokens" live in the KV cache, or the "session memory"

- so what's actually inside the model?
- everything above-tokens, weights, config-is just setup for the real engine underneath

- the core of almost every modern llm is a transformer architecture
- this is the skeleton that moves all those numbers around
- it's what turns token sequences and weights into predictions
- designed for sequence data (like language),
- transformers can "look back" at previous tokens and
- decide which ones matter for the next prediction

- transformers work in layers, passing your sequence through the same recipe over and over
- each layer refines the representation, using attention to focus on the important parts of your input and context
- every time you generate a new token, it goes through this stack of layers-every single step

- inside each transformer layer:
- self-attention: figures out which previous tokens are important to the current prediction
- MLPs (multi-layer perceptrons): further process token representations, adding non-linearity and expressiveness
- layer n[...]
Offshore
Ahmad RT @TheAhmadOsman: - local llms 101 - running a model = inference (using model weights) - inference = predicting the next token based on your input plus all tokens generated so far - together, these make up the "sequence" - tokens ≠ words - they're…
orms and residuals: stabilize learning and prediction, making deep networks possible
- positional encodings (like RoPE): tell the model where each token sits in the sequence
- so "cat" and "catastrophe" aren't confused by position

- by stacking these layers (sometimes dozens or even hundreds)
- transformers build a complex understanding of your prompt, context, and conversation history

- transformer recap:
- decoder-only: model only predicts what comes next, each token looks back at all previous tokens
- self-attention picks what to focus on (MQA/GQA = efficient versions for less memory)
- feed-forward MLP after attention for every token (usually 2 layers, GELU activation)
- everything's wrapped in layer norms + linear layers (QKV projections, MLPs, outputs)
- residuals + norms = stable, trainable, no exploding/vanishing gradients
- RoPE (rotary embeddings): tells the model where each token sits in the sequence
- stack N layers of this → final logits → pick the next token
- scale up: more layers, more heads, wider MLPs = bigger brains

- VRAM: memory, the bottleneck
- VRAM must must fit:
1. weights (main model, whether quantized or not)
2. KV cache (per token, per layer, per head)
- weights:
- FP16: ~2 bytes/param → 7B = ~14GB
- 8-bit: ~1 byte/param → 7B = ~7GB
- 4-bit: ~0.5 byte/param → 7B = ~3.5GB
- add 10–30% for runtime overheads
- KV cache:
- rule of thumb: 0.5MB per token (Llama-like 7B, 32 layers, 4K tokens = ~2GB)
- some runtimes support KV cache quantization (8/4-bit) = big savings

- throughput = memory bandwidth + GPU FLOPs + attention implementation (FlashAttention/SDPA help) + quantization + batch size
- offload to CPU? expect MASSIVE slowdown

- GPU or bust: CPUs run quantized models (slow), but any real context/model needs CUDA/ROCm/Metal
- CPU spill = sadness (check device_map and memory fit)

- quantization: reduce precision for memory wins (sometimes a tiny quality hit)
- FP32/FP16/BF16 = full/floored
- INT8/INT4/NF4 = quantized
- 4-bit (NF4/GPTQ/AWQ) = sweet spot for most consumer GPUs (big memory win, small quality hit for most tasks)
- math-heavy or finicky tasks degrade first (math, logic, coding)

- KV cache quantization: even more memory saved for long contexts (check runtime support)

- formats/runtimes:
- PyTorch + safetensors: flexible, standard, GPU/TPU/CPU
- GGUF (llama.cpp): CPU/GPU/portable, best for quant + edge devices
- ONNX, TensorRT-LLM, MLC: advanced flavors for special hardware/use
- protip: avoid legacy .bin (pickle risk), use safetensors for safety

- everything is a tradeoff
- smaller = fits anywhere, less power
- more context = more latency + VRAM burn
- quantization = speed/memory, but maybe less accurate
- local = more control/knobs, more work

- what happens when you "load a model"?
- download weights, tokenizer, config
- resolve license/trust (don't use trust_remote_code unless you really trust the author)
- load to VRAM/CPU (check memory fit)
- warmup: kernels/caches initialized, first pass is slowest
- inference: forward passes per token, updating KV cache each step

- decoding = how next token is chosen:
- greedy: always top-1 (robotic)
- temperature: softens or sharpens probabilities (higher = more random)
- top-k: pick from top k
- top-p: pick from smallest set with ≥p prob
- typical sampling, repetition penalty, no-repeat n-gram: extra controls
- deterministic = set a seed and no sampling
- tune for your use-case: chat, summarization, code

- serving options?
- vLLM for high throughput, parallel serving
- llama.cpp server (OpenAI-compatible API)
- ExLlama V2/V3 w/ Tabby API (OpenAI-compatible API)
- run as a local script (CLI)
- FastAPI/Flask for local API endpoint

- local ≠ offline; run it, serve it, or build apps on top

- fine-tuning, ultra-brief:
- LoRA / QLoRA = adapter layers (efficient, minimal VRAM)
- still need a dataset and eval plan; adapters can be merged or kept separate
- most users get far with prompting + retrieval (RAG) or[...]