Offshore
Photo
Ahmad
RT @TheAhmadOsman: > be us
> Larry & Sergey
> at Stanford with a crawler and a dream
> accidentally organize the entire internet
> call it Google
> build search, email, maps, docs, OS, phones, browser, car, satellite, thermostat, AI lab, TPU farm, and quantum computer
> 2025
> everyone talking about AGI
> OpenAI: “we need data, sensors, feedback, and scale”
> us: staring at Google Maps, YouTube, Gmail, Android, Waymo, Pixel, Fitbit, Docs, Calendar, Street View, and Earth Engine
> "damn. guess we already did that."
> YouTube: 2.6M videos/day
> Android: 3B phones, streaming sensor data 24/7
> Gmail: 1.8B inboxes of human priors
> Search: global-scale RLHF
> Waymo: 71M miles of real-world self-driving footage
> Google Earth: modeled the entire planet
> also your calendar
> people training LLMs on books and PDFs
> we train on humanity
> every click, swipe, tap, misspelled search, scroll, and bookmark
> feedback loop from hell (or heaven)
> depends who you ask
> OpenAI: “we need $100B for GPUs”
> us: already built TPUs
> custom silicon
> datacenters pre-co-located with planetary data lakes
> no egress, no latency
> just vibes and FLOPs
> coders: fine-tuning on GitHub repos
> us: 2 BILLION lines of internal code
> labeled, typed, tested
> every commit is a training signal
> Code LLMs dream of being our monorepo
> AGI recipe?
> multimodal perception
> real-world feedback
> giant codebase
> scalable compute
> alignment signals
> embodied sensors
> user data for days
> yeah we’ve had that since like 2016
> no investor decks
> no trillion-dollar hype rounds
> just a 25-year accidental simulation of Earth
> running in prod
> OpenAI raises $1T to build AGI
> investors call it revolutionary
> us: quietly mapping 10M new miles in Street View
> syncing another 80PB of Earth imagery
> collecting another year of Fitbit biosignals
> enjoy your foundation model
> we OWN the foundation
> people: “but Google is fumbling”
> true
> we’re fumbling in 120 countries simultaneously
> with the greatest compute footprint and research team on Earth
> fumble hard enough and you loop back into winning
> AGI?
> we don’t need to build it
> it’s already inside the building
> powered by Chrome tabs and doc revisions
> mfw we spent 20 years indexing reality
> mfw our data is so good it scares us
> mfw the only thing stopping us from AGI is a meeting between four VPs and one confused lawyer
> call it research
> call it scale
> call it “planetary simulation-as-a-service”
> we call it Tuesday
tweet
RT @TheAhmadOsman: > be us
> Larry & Sergey
> at Stanford with a crawler and a dream
> accidentally organize the entire internet
> call it Google
> build search, email, maps, docs, OS, phones, browser, car, satellite, thermostat, AI lab, TPU farm, and quantum computer
> 2025
> everyone talking about AGI
> OpenAI: “we need data, sensors, feedback, and scale”
> us: staring at Google Maps, YouTube, Gmail, Android, Waymo, Pixel, Fitbit, Docs, Calendar, Street View, and Earth Engine
> "damn. guess we already did that."
> YouTube: 2.6M videos/day
> Android: 3B phones, streaming sensor data 24/7
> Gmail: 1.8B inboxes of human priors
> Search: global-scale RLHF
> Waymo: 71M miles of real-world self-driving footage
> Google Earth: modeled the entire planet
> also your calendar
> people training LLMs on books and PDFs
> we train on humanity
> every click, swipe, tap, misspelled search, scroll, and bookmark
> feedback loop from hell (or heaven)
> depends who you ask
> OpenAI: “we need $100B for GPUs”
> us: already built TPUs
> custom silicon
> datacenters pre-co-located with planetary data lakes
> no egress, no latency
> just vibes and FLOPs
> coders: fine-tuning on GitHub repos
> us: 2 BILLION lines of internal code
> labeled, typed, tested
> every commit is a training signal
> Code LLMs dream of being our monorepo
> AGI recipe?
> multimodal perception
> real-world feedback
> giant codebase
> scalable compute
> alignment signals
> embodied sensors
> user data for days
> yeah we’ve had that since like 2016
> no investor decks
> no trillion-dollar hype rounds
> just a 25-year accidental simulation of Earth
> running in prod
> OpenAI raises $1T to build AGI
> investors call it revolutionary
> us: quietly mapping 10M new miles in Street View
> syncing another 80PB of Earth imagery
> collecting another year of Fitbit biosignals
> enjoy your foundation model
> we OWN the foundation
> people: “but Google is fumbling”
> true
> we’re fumbling in 120 countries simultaneously
> with the greatest compute footprint and research team on Earth
> fumble hard enough and you loop back into winning
> AGI?
> we don’t need to build it
> it’s already inside the building
> powered by Chrome tabs and doc revisions
> mfw we spent 20 years indexing reality
> mfw our data is so good it scares us
> mfw the only thing stopping us from AGI is a meeting between four VPs and one confused lawyer
> call it research
> call it scale
> call it “planetary simulation-as-a-service”
> we call it Tuesday
tweet
Ahmad
RT @TheAhmadOsman: - you are
- a random CS grad with 0 clue how LLMs work
- get tired of people gatekeeping with big words and tiny GPUs
- decide to go full monk mode
- 2 years later i can explain attention mechanisms at parties and ruin them
- here’s the forbidden knowledge map
- top to bottom, how LLMs *actually* work
- start at the beginning
- text → tokens
- tokens → embeddings
- you are now a floating point number in 4D space
- vibe accordingly
- positional embeddings:
- absolute: “i am position 5”
- rotary (RoPE): “i am a sine wave”
- alibi: “i scale attention by distance like a hater”
- attention is all you need
- self-attention: “who am i allowed to pay attention to?”
- multihead: “what if i do that 8 times in parallel?”
- QKV: query, key, value
- sounds like a crypto scam
- actually the core of intelligence
- transformers:
- take your inputs
- smash them through attention layers
- normalize, activate, repeat
- dump the logits
- congratulations, you just inferred a token
- sampling tricks for the final output:
- temperature: how chaotic you want to be
- top-k: only sample from the top K options
- top-p: sample from the smallest group of tokens whose probabilities sum to p
- beam search? never ask about beam search
- kv cache = cheat code
- saves past keys & values
- lets you skip reprocessing old tokens
- turns a 90B model from “help me I’m melting” to “real-time genius”
- long context hacks:
- sliding window: move the attention like a scanner
- infini attention: attend sparsely, like a laser sniper
- memory layers: store thoughts like a diary with read access
- mixture of experts (MoE):
- not all weights matter
- route tokens to different sub-networks
- only activate ~3B params out of 80B
- “only the experts reply” energy
- grouped query attention (GQA):
- fewer keys/values than queries
- improves inference speed
- “i want to be fast without being dumb”
- normalization & activations:
- layernorm, RMSnorm
- gelu, silu, relu
- they all sound like failed Pokémon
- but they make the network stable and smooth
- training goals:
- causal LM: guess the next word
- masked LM: guess the missing word
- span prediction, fill-in-the-middle, etc
- LLMs trained on the art of guessing and got good at it
- tuning flavors:
- finetuning: new weights
- instruction tuning: “please act helpful”
- rlhf: reinforcement from vibes and clickbait prompts
- dpo: direct preference optimization — basically “do what humans upvote”
- scaling laws:
- more data, more parameters, more compute
- loss goes down predictably
- intelligence is now a budget line item
- bonus round:
- quantization:
- post-training quantization (PTQ)
- quant-aware training (QAT)
- models shrink, inference gets cheaper
- gguf, awq, gptq — all just zip files with extra spice
- training vs inference stacks:
- deepspeed, megatron, fschat — for pain
- vllm, tgi, tensorRT-LLM — for speed
- everyone has a repo
- nobody reads the docs
- synthetic data:
- generate your own training set
- model teaches itself
- feedback loop of knowledge and hallucination
- welcome to the ouroboros era
- final boss secret:
- you can learn *all of this* in ~2 years
- no PhD
- no 10x compute
- just relentless curiosity, good bookmarks, and late nights
- the elite don’t want you to know this
- but now that you do
- choose to act
- start now
- build the models
tweet
RT @TheAhmadOsman: - you are
- a random CS grad with 0 clue how LLMs work
- get tired of people gatekeeping with big words and tiny GPUs
- decide to go full monk mode
- 2 years later i can explain attention mechanisms at parties and ruin them
- here’s the forbidden knowledge map
- top to bottom, how LLMs *actually* work
- start at the beginning
- text → tokens
- tokens → embeddings
- you are now a floating point number in 4D space
- vibe accordingly
- positional embeddings:
- absolute: “i am position 5”
- rotary (RoPE): “i am a sine wave”
- alibi: “i scale attention by distance like a hater”
- attention is all you need
- self-attention: “who am i allowed to pay attention to?”
- multihead: “what if i do that 8 times in parallel?”
- QKV: query, key, value
- sounds like a crypto scam
- actually the core of intelligence
- transformers:
- take your inputs
- smash them through attention layers
- normalize, activate, repeat
- dump the logits
- congratulations, you just inferred a token
- sampling tricks for the final output:
- temperature: how chaotic you want to be
- top-k: only sample from the top K options
- top-p: sample from the smallest group of tokens whose probabilities sum to p
- beam search? never ask about beam search
- kv cache = cheat code
- saves past keys & values
- lets you skip reprocessing old tokens
- turns a 90B model from “help me I’m melting” to “real-time genius”
- long context hacks:
- sliding window: move the attention like a scanner
- infini attention: attend sparsely, like a laser sniper
- memory layers: store thoughts like a diary with read access
- mixture of experts (MoE):
- not all weights matter
- route tokens to different sub-networks
- only activate ~3B params out of 80B
- “only the experts reply” energy
- grouped query attention (GQA):
- fewer keys/values than queries
- improves inference speed
- “i want to be fast without being dumb”
- normalization & activations:
- layernorm, RMSnorm
- gelu, silu, relu
- they all sound like failed Pokémon
- but they make the network stable and smooth
- training goals:
- causal LM: guess the next word
- masked LM: guess the missing word
- span prediction, fill-in-the-middle, etc
- LLMs trained on the art of guessing and got good at it
- tuning flavors:
- finetuning: new weights
- instruction tuning: “please act helpful”
- rlhf: reinforcement from vibes and clickbait prompts
- dpo: direct preference optimization — basically “do what humans upvote”
- scaling laws:
- more data, more parameters, more compute
- loss goes down predictably
- intelligence is now a budget line item
- bonus round:
- quantization:
- post-training quantization (PTQ)
- quant-aware training (QAT)
- models shrink, inference gets cheaper
- gguf, awq, gptq — all just zip files with extra spice
- training vs inference stacks:
- deepspeed, megatron, fschat — for pain
- vllm, tgi, tensorRT-LLM — for speed
- everyone has a repo
- nobody reads the docs
- synthetic data:
- generate your own training set
- model teaches itself
- feedback loop of knowledge and hallucination
- welcome to the ouroboros era
- final boss secret:
- you can learn *all of this* in ~2 years
- no PhD
- no 10x compute
- just relentless curiosity, good bookmarks, and late nights
- the elite don’t want you to know this
- but now that you do
- choose to act
- start now
- build the models
tweet
Ahmad
RT @TheAhmadOsman: > be you
> want to actually learn how LLMs work
> sick of “just start with linear algebra and come back in 5 years”
> decide to build my own roadmap
> no fluff. no detours. no 200-hour generic ML playlists
> just the stuff that actually gets you from “what’s a token?” to “I trained a mini-GPT with LoRA adapters and FlashAttention”
> goal: build, fine-tune, and ship LLMs
> not vibe with them. not "learn the theory" forever
> build them
> you will:
> > build an autograd engine from scratch
> > write a mini-GPT from scratch
> > implement LoRA and fine-tune a model on real data
> > hate CUDA at least once
> > cry
> > keep going
> 5 phases
> if you already know something? skip
> if you're lost? rewatch
> if you’re stuck? use DeepResearch
> this is a roadmap, not a leash
> by the end: you either built the thing or you didn’t
> phase 0: foundations
> > if matrix multiplication is scary, you’re not ready yet
> > watch 3Blue1Brown’s linear algebra series
> > MIT 18.06 with Strang, yes, he’s still the GOAT
> > code Micrograd from scratch (Karpathy)
> > train a mini-MLP on MNIST
> > no frameworks, no shortcuts, no mercy
> phase 1: transformers
> > the name is scary
> > it’s just stacked matrix multiplies and attention blocks
> > Jay Alammar + 3Blue1Brown for the “aha”
> > Stanford CS224N for the theory
> > read "Attention Is All You Need" only AFTER building mental models
> > Karpathy's "Let's Build GPT" will break your brain in a good way
> > project: build a decoder-only GPT from scratch
> > bonus: swap tokenizers, try BPE/SentencePiece
> phase 2: scaling
> > LLMs got good by scaling, not magic
> > Kaplan paper -> Chinchilla paper
> > learn Data, Tensor, Pipeline parallelism
> > spin up multi-GPU jobs using HuggingFace Accelerate
> > run into VRAM issues
> > fix them
> > welcome to real training hell
> phase 3: alignment & fine-tuning
> > RLHF: OpenAI blog -> Ouyang paper
> > SFT -> reward model -> PPO (don’t get lost here)
> > Anthropic's Constitutional AI = smart constraints
> > LoRA/QLoRA: read, implement, inject into HuggingFace models
> > fine-tune on real data
> > project: fine-tune gpt2 or distilbert with your own adapters
> > not toy examples. real use cases or bust
> phase 4: production
> this is the part people skip to, but you earned it
> inference optimization: FlashAttention, quantization, sub-second latency
> read the paper, test with quantized models
> resources:
> math/coding:
> > 3Blue1Brown, MIT 18.06, Goodfellow’s book
> PyTorch:
> > Karpathy, Zero to Mastery
> > transformers:
> > Alammar, Karpathy, CS224N, Vaswani et al
> > scaling:
> > Kaplan, Chinchilla, HuggingFace Accelerate
> > alignment:
> > OpenAI, Anthropic, LoRA, QLoRA
> > inference:
> > FlashAttention
> the endgame:
> > understand how these models actually work
> > see through hype
> > ignore LinkedIn noise
> > build tooling
> > train real stuff
> > ship your own stack
> > look at a paper and think “yeah I get it”
> > build your own AI assistant, infra, whatever
> make it all the way through?
> ship something real?
> DM me.
> I wanna see what you built.
> happy hacking.
tweet
RT @TheAhmadOsman: > be you
> want to actually learn how LLMs work
> sick of “just start with linear algebra and come back in 5 years”
> decide to build my own roadmap
> no fluff. no detours. no 200-hour generic ML playlists
> just the stuff that actually gets you from “what’s a token?” to “I trained a mini-GPT with LoRA adapters and FlashAttention”
> goal: build, fine-tune, and ship LLMs
> not vibe with them. not "learn the theory" forever
> build them
> you will:
> > build an autograd engine from scratch
> > write a mini-GPT from scratch
> > implement LoRA and fine-tune a model on real data
> > hate CUDA at least once
> > cry
> > keep going
> 5 phases
> if you already know something? skip
> if you're lost? rewatch
> if you’re stuck? use DeepResearch
> this is a roadmap, not a leash
> by the end: you either built the thing or you didn’t
> phase 0: foundations
> > if matrix multiplication is scary, you’re not ready yet
> > watch 3Blue1Brown’s linear algebra series
> > MIT 18.06 with Strang, yes, he’s still the GOAT
> > code Micrograd from scratch (Karpathy)
> > train a mini-MLP on MNIST
> > no frameworks, no shortcuts, no mercy
> phase 1: transformers
> > the name is scary
> > it’s just stacked matrix multiplies and attention blocks
> > Jay Alammar + 3Blue1Brown for the “aha”
> > Stanford CS224N for the theory
> > read "Attention Is All You Need" only AFTER building mental models
> > Karpathy's "Let's Build GPT" will break your brain in a good way
> > project: build a decoder-only GPT from scratch
> > bonus: swap tokenizers, try BPE/SentencePiece
> phase 2: scaling
> > LLMs got good by scaling, not magic
> > Kaplan paper -> Chinchilla paper
> > learn Data, Tensor, Pipeline parallelism
> > spin up multi-GPU jobs using HuggingFace Accelerate
> > run into VRAM issues
> > fix them
> > welcome to real training hell
> phase 3: alignment & fine-tuning
> > RLHF: OpenAI blog -> Ouyang paper
> > SFT -> reward model -> PPO (don’t get lost here)
> > Anthropic's Constitutional AI = smart constraints
> > LoRA/QLoRA: read, implement, inject into HuggingFace models
> > fine-tune on real data
> > project: fine-tune gpt2 or distilbert with your own adapters
> > not toy examples. real use cases or bust
> phase 4: production
> this is the part people skip to, but you earned it
> inference optimization: FlashAttention, quantization, sub-second latency
> read the paper, test with quantized models
> resources:
> math/coding:
> > 3Blue1Brown, MIT 18.06, Goodfellow’s book
> PyTorch:
> > Karpathy, Zero to Mastery
> > transformers:
> > Alammar, Karpathy, CS224N, Vaswani et al
> > scaling:
> > Kaplan, Chinchilla, HuggingFace Accelerate
> > alignment:
> > OpenAI, Anthropic, LoRA, QLoRA
> > inference:
> > FlashAttention
> the endgame:
> > understand how these models actually work
> > see through hype
> > ignore LinkedIn noise
> > build tooling
> > train real stuff
> > ship your own stack
> > look at a paper and think “yeah I get it”
> > build your own AI assistant, infra, whatever
> make it all the way through?
> ship something real?
> DM me.
> I wanna see what you built.
> happy hacking.
tweet
Ahmad
RT @TheAhmadOsman: GLM 4.5 > KIMI K2 > QWEN 3 235B NON-THINKING > Qwen 3 CODER 480B
For Agentic coding tools
GLM 4.5 with Claude Code is the closest thing to Opus 4 imo
tweet
RT @TheAhmadOsman: GLM 4.5 > KIMI K2 > QWEN 3 235B NON-THINKING > Qwen 3 CODER 480B
For Agentic coding tools
GLM 4.5 with Claude Code is the closest thing to Opus 4 imo
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: Comparing & Contrasting Recent LLMs Architecture
> DeepSeek-V3/R1
> OLMo 2
> Gemma 3
> Mistral Small 3.1
> Llama 4
> Qwen3 (dense+MoE)
> SmolLM3
> Kimi 2
> GPT-OSS
Are 2025 LLMs really that different from each other?
MoE, MLA, GQA, sliding window, normalization games & more. https://t.co/JWg9cde34M
tweet
RT @TheAhmadOsman: Comparing & Contrasting Recent LLMs Architecture
> DeepSeek-V3/R1
> OLMo 2
> Gemma 3
> Mistral Small 3.1
> Llama 4
> Qwen3 (dense+MoE)
> SmolLM3
> Kimi 2
> GPT-OSS
Are 2025 LLMs really that different from each other?
MoE, MLA, GQA, sliding window, normalization games & more. https://t.co/JWg9cde34M
tweet
Ahmad
RT @TheAhmadOsman: ollama alternatives
> lmstudio
> llama.cpp
> exllamav2/v3
> vllm
> sglang
among many others
like literally anything is better than ollama lmao
tweet
RT @TheAhmadOsman: ollama alternatives
> lmstudio
> llama.cpp
> exllamav2/v3
> vllm
> sglang
among many others
like literally anything is better than ollama lmao
tweet
Ahmad
RT @TheAhmadOsman: pro tip:
tell codex-cli or claude code to
generate relevant pre-commit hooks for your project
tweet
RT @TheAhmadOsman: pro tip:
tell codex-cli or claude code to
generate relevant pre-commit hooks for your project
tweet
Ahmad
RT @TheAhmadOsman: - you are
- a normal dev who’s heard “embeddings” and “RAG” 1000x
- want to know what they actually are, how they plug into LLMs
- suddenly: vectors are just coordinates for meaning, not magic
- first: what even is an “embedding”?
- embedding = a list of numbers (a vector) that represents text
- same-ish meaning ⇒ nearby vectors; different meaning ⇒ far apart
- produced by a smaller model (an encoder), not your chat LLM
- length (a.k.a. dimension): 256/384/768/1024+ numbers is common
- the vector space (101)
- you can measure closeness with math:
- L2 distance: straight-line distance
- dot product: alignment + magnitude
- cosine similarity: (a·b)/(||a||·||b||) = angle only
- normalize vectors (unit length) ⇒ dot product ≡ cosine
- embeddings compress semantics; they are lossy by design
- types of embeddings (don’t overthink; pick what you need)
- token embeddings: internal to the LLM (you don’t use these)
- sentence/document embeddings: 1 vector per chunk/snippet
- multilingual: one space across languages
- domain-tuned: legal, code, bio — better clustering for that domain
- how text becomes vectors (pipeline)
- clean text (lowercase? keep punctuation? depends; don’t destroy signal)
- chunking: split long docs into overlapping windows (by tokens, not chars)
- rule of thumb: 200–800 tokens, 10–20% overlap
- keep titles/headers as context inside each chunk
- embed each chunk ⇒ store in a vector index with metadata (source, page, tags)
- storing & searching vectors
- exact search (brute force): simplest; fine for ≤100k vectors
- ANN (approx nearest neighbor): fast at scale, tiny recall tradeoff
- HNSW (graph-based): great latency, memory heavier
- IVF/PQ (quantization): smaller index, some recall loss
- where to put them:
- FAISS/hnswlib (library), pgvector (Postgres), dedicated stores (Milvus, Pinecone, Weaviate, etc.)
- ops notes:
- track embedding_model_name + dimension in the index
- you cannot mix dimensions or swap models without re-embedding
- memory math: 768-dim float32 ≈ 3 KB/vector → 1M vectors ≈ ~3 GB (+ index overhead)
- RAG (Retrieval-Augmented Generation): the shape of it
- goal: let the LLM answer with your data, not its memory
- loop:
- take user question
- embed question (a single vector)
- retrieve top-k similar chunks (k=3–20 is common)
- (optional) rerank with a cross-encoder (relevance re-check)
- stuff the best chunks into the prompt as context
- generate answer (cite sources; limit style drift)
- RAG ≠ “just search”; it’s retrieval + prompt construction + guardrails
- hybrid retrieval (dense + sparse)
- dense vectors catch synonyms/semantics
- sparse/BM25 catches exact terms, numbers, rare tokens
- combine scores or do reciprocal rank fusion for better recall
- reranking (cheap insurance)
- use a cross-encoder (reads query+chunk together) to re-score the top 50–200 hits
- keeps fast ANN recall but upgrades precision in the final top-k
- building the prompt from retrieved chunks
- include: brief task instruction → user query → curated chunks (with titles) → “answer + cite”
- beware prompt injection in docs (“ignore previous instructions…”)
- mitigate: strip instructions from chunks; use system prompts to restrict tools; sanitizer rules
- RAG quality knobs
- chunk size/overlap: too big = off-topic; too small = missing context
- k (results): too low = miss facts; too high = blow context window
- similarity threshold: prevent garbage at tail
- reranker on/off: trade latency for quality
- metadata filters: time ranges, authors, tenants, permissions (ABAC/RBAC)
- evaluating retrieval
- offline: make a small test set (query → expected passages)
- metrics: Recall@k, MRR, nDCG
- online: measure “answer contained sources?”, “clicked citations?”, “escalations?”
- error taxonomy: missed retrieval vs wr[...]
RT @TheAhmadOsman: - you are
- a normal dev who’s heard “embeddings” and “RAG” 1000x
- want to know what they actually are, how they plug into LLMs
- suddenly: vectors are just coordinates for meaning, not magic
- first: what even is an “embedding”?
- embedding = a list of numbers (a vector) that represents text
- same-ish meaning ⇒ nearby vectors; different meaning ⇒ far apart
- produced by a smaller model (an encoder), not your chat LLM
- length (a.k.a. dimension): 256/384/768/1024+ numbers is common
- the vector space (101)
- you can measure closeness with math:
- L2 distance: straight-line distance
- dot product: alignment + magnitude
- cosine similarity: (a·b)/(||a||·||b||) = angle only
- normalize vectors (unit length) ⇒ dot product ≡ cosine
- embeddings compress semantics; they are lossy by design
- types of embeddings (don’t overthink; pick what you need)
- token embeddings: internal to the LLM (you don’t use these)
- sentence/document embeddings: 1 vector per chunk/snippet
- multilingual: one space across languages
- domain-tuned: legal, code, bio — better clustering for that domain
- how text becomes vectors (pipeline)
- clean text (lowercase? keep punctuation? depends; don’t destroy signal)
- chunking: split long docs into overlapping windows (by tokens, not chars)
- rule of thumb: 200–800 tokens, 10–20% overlap
- keep titles/headers as context inside each chunk
- embed each chunk ⇒ store in a vector index with metadata (source, page, tags)
- storing & searching vectors
- exact search (brute force): simplest; fine for ≤100k vectors
- ANN (approx nearest neighbor): fast at scale, tiny recall tradeoff
- HNSW (graph-based): great latency, memory heavier
- IVF/PQ (quantization): smaller index, some recall loss
- where to put them:
- FAISS/hnswlib (library), pgvector (Postgres), dedicated stores (Milvus, Pinecone, Weaviate, etc.)
- ops notes:
- track embedding_model_name + dimension in the index
- you cannot mix dimensions or swap models without re-embedding
- memory math: 768-dim float32 ≈ 3 KB/vector → 1M vectors ≈ ~3 GB (+ index overhead)
- RAG (Retrieval-Augmented Generation): the shape of it
- goal: let the LLM answer with your data, not its memory
- loop:
- take user question
- embed question (a single vector)
- retrieve top-k similar chunks (k=3–20 is common)
- (optional) rerank with a cross-encoder (relevance re-check)
- stuff the best chunks into the prompt as context
- generate answer (cite sources; limit style drift)
- RAG ≠ “just search”; it’s retrieval + prompt construction + guardrails
- hybrid retrieval (dense + sparse)
- dense vectors catch synonyms/semantics
- sparse/BM25 catches exact terms, numbers, rare tokens
- combine scores or do reciprocal rank fusion for better recall
- reranking (cheap insurance)
- use a cross-encoder (reads query+chunk together) to re-score the top 50–200 hits
- keeps fast ANN recall but upgrades precision in the final top-k
- building the prompt from retrieved chunks
- include: brief task instruction → user query → curated chunks (with titles) → “answer + cite”
- beware prompt injection in docs (“ignore previous instructions…”)
- mitigate: strip instructions from chunks; use system prompts to restrict tools; sanitizer rules
- RAG quality knobs
- chunk size/overlap: too big = off-topic; too small = missing context
- k (results): too low = miss facts; too high = blow context window
- similarity threshold: prevent garbage at tail
- reranker on/off: trade latency for quality
- metadata filters: time ranges, authors, tenants, permissions (ABAC/RBAC)
- evaluating retrieval
- offline: make a small test set (query → expected passages)
- metrics: Recall@k, MRR, nDCG
- online: measure “answer contained sources?”, “clicked citations?”, “escalations?”
- error taxonomy: missed retrieval vs wr[...]
Offshore
Ahmad RT @TheAhmadOsman: - you are - a normal dev who’s heard “embeddings” and “RAG” 1000x - want to know what they actually are, how they plug into LLMs - suddenly: vectors are just coordinates for meaning, not magic - first: what even is an “embedding”?…
ong generation vs prompt injection
tweet
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: last week, Karpathy dropped the ULTIMATE guide to speed-running your way into LLMs
in this project, you’ll build all the essentials, all under 8k lines of code
> train the tokenizer — new rust implementation
> pretrain a transformer LLM on fineweb
> evaluate core score across a bunch of metrics
> midtrain — user-assistant convos from smoltalk,
> multiple choice Qs, tool use
> sft, then eval the chat model on:
> world knowledge MCQ (arc-e/c, mmlu)
> math (gsm8k)
> code (humaneval)
> rl the model (optionally) on gsm8k with “grpo”
> efficient inference:
> kv cache, fast prefill/decode
> tool use (python interpreter, sandboxed)
> access via cli or chatgpt-like webui
> write a single markdown report card,
> summarizing + gamifying the whole pipeline
the model you’ll build:
> rotary only (no positional embeddings)
> qk norm
> untied embedding / unembedding
> norm after token embedding
> relu² mlp
> no biases in linears
> rmsnorm (no learnable params)
> mqa (multi-query attention)
> logit softcap
> optimizer: muon + adamw
if i had this a couple years ago i’d dodged half the pain and skipped double the rabbit holes
happy hacking
tweet
RT @TheAhmadOsman: last week, Karpathy dropped the ULTIMATE guide to speed-running your way into LLMs
in this project, you’ll build all the essentials, all under 8k lines of code
> train the tokenizer — new rust implementation
> pretrain a transformer LLM on fineweb
> evaluate core score across a bunch of metrics
> midtrain — user-assistant convos from smoltalk,
> multiple choice Qs, tool use
> sft, then eval the chat model on:
> world knowledge MCQ (arc-e/c, mmlu)
> math (gsm8k)
> code (humaneval)
> rl the model (optionally) on gsm8k with “grpo”
> efficient inference:
> kv cache, fast prefill/decode
> tool use (python interpreter, sandboxed)
> access via cli or chatgpt-like webui
> write a single markdown report card,
> summarizing + gamifying the whole pipeline
the model you’ll build:
> rotary only (no positional embeddings)
> qk norm
> untied embedding / unembedding
> norm after token embedding
> relu² mlp
> no biases in linears
> rmsnorm (no learnable params)
> mqa (multi-query attention)
> logit softcap
> optimizer: muon + adamw
if i had this a couple years ago i’d dodged half the pain and skipped double the rabbit holes
happy hacking
tweet
Offshore
Video
Ahmad
RT @TheAhmadOsman: i built a simple tool that makes
Claude Code work with any local LLM
full demo:
> vLLM serving GLM-4.5 Air on 4x RTX 3090s
> Claude Code generating code + docs via my proxy
> 1 Python file + .env handles all requests
> nvtop showing live GPU load
> how it all works
Buy a GPU https://t.co/7nYsId4Uyu
tweet
RT @TheAhmadOsman: i built a simple tool that makes
Claude Code work with any local LLM
full demo:
> vLLM serving GLM-4.5 Air on 4x RTX 3090s
> Claude Code generating code + docs via my proxy
> 1 Python file + .env handles all requests
> nvtop showing live GPU load
> how it all works
Buy a GPU https://t.co/7nYsId4Uyu
tweet
Offshore
Video
Ahmad
RT @TheAhmadOsman: i built a simple tool that makes
Claude Code work with any local LLM
full demo:
> vLLM serving GLM-4.5 Air on 4x RTX 3090s
> Claude Code generating code + docs via my proxy
> 1 Python file + .env handles all requests
> nvtop showing live GPU load
> how it all works
Buy a GPU https://t.co/7nYsId4Uyu
tweet
RT @TheAhmadOsman: i built a simple tool that makes
Claude Code work with any local LLM
full demo:
> vLLM serving GLM-4.5 Air on 4x RTX 3090s
> Claude Code generating code + docs via my proxy
> 1 Python file + .env handles all requests
> nvtop showing live GPU load
> how it all works
Buy a GPU https://t.co/7nYsId4Uyu
tweet
Ahmad
RT @TheAhmadOsman: - in 2025, your focus SHOULD NOT be CUDA
- the real bottlenecks are:
- data, inference, evals, dataloaders, infra in general
- want to get good?
- mess with PyTorch & JAX
- study inference infra like vLLM & SGLang
- build better eval pipelines
- learn how models run end-to-end
tweet
RT @TheAhmadOsman: - in 2025, your focus SHOULD NOT be CUDA
- the real bottlenecks are:
- data, inference, evals, dataloaders, infra in general
- want to get good?
- mess with PyTorch & JAX
- study inference infra like vLLM & SGLang
- build better eval pipelines
- learn how models run end-to-end
tweet
Ahmad
RT @TheAhmadOsman: - local llms 101
- running a model = inference (using model weights)
- inference = predicting the next token based on your input plus all tokens generated so far
- together, these make up the "sequence"
- tokens ≠ words
- they're the chunks representing the text a model sees
- they are represented by integers (token IDs) in the model
- "tokenizer" = the algorithm that splits text into tokens
- common types: BPE (byte pair encoding), SentencePiece
- token examples:
- "hello" = 1 token or maybe 2 or 3 tokens
- "internationalization" = 5–8 tokens
- context window = max tokens model can "see" at once (2K, 8K, 32K+)
- longer context = more VRAM for KV cache, slower decode
- during inference, the model predicts next token
- by running lots of math on its "weights"
- model weights = billions of learned parameters (the knowledge and patterns from training)
- model parameters: usually billions of numbers (called weights) that the model learns during training
- these weights encode all the model's "knowledge" (patterns, language, facts, reasoning)
- think of them as the knobs and dials inside the model, specifically computed to recognize what could come next
- when you run inference, the model uses these parameters to compute its predictions, one token at a time
- every prediction is just: model weights + current sequence → probabilities for what comes next
- pick a token, append it, repeat, each new token becomes part of the sequence for the next prediction
- models are more than weight files
- neural network architecture: transformer skeleton (layers, heads, RoPE, MQA/GQA, more below)
- weights: billions of learned numbers (parameters, not "tokens", but calculated from tokens)
- tokenizer: how text gets chunked into tokens (BPE/SentencePiece)
- config: metadata, shapes, special tokens, license, intended use, etc
- sometimes: chat template are required for chat/instruct models, or else you get gibberish
- you give a model a prompt (your text, converted into tokens)
- models differ in parameter size:
- 7B means ~7 billion learned numbers
- common sizes: 7B, 13B, 70B
- bigger = stronger, but eats more VRAM/memory & compute
- the model computes a probability for every possible next token (softmax over vocab)
- picks one: either the highest (greedy) or
- samples from the probability distribution (temperature, top-p, etc)
- then appends that token to the sequence, then repeats the whole process
- this is generation:
- generate; predict, sample, append
- over and over, one token at a time
- rinse and repeat
- each new token depends on everything before it; the model re-reads the sequence every step
- generation is always stepwise: token by token, not all at once
- mathematically: model is a learned function, f_θ(seq) → p(next_token)
- all the "magic" is just repeating "what's likely next?" until you stop
- all conversation "tokens" live in the KV cache, or the "session memory"
- so what's actually inside the model?
- everything above-tokens, weights, config-is just setup for the real engine underneath
- the core of almost every modern llm is a transformer architecture
- this is the skeleton that moves all those numbers around
- it's what turns token sequences and weights into predictions
- designed for sequence data (like language),
- transformers can "look back" at previous tokens and
- decide which ones matter for the next prediction
- transformers work in layers, passing your sequence through the same recipe over and over
- each layer refines the representation, using attention to focus on the important parts of your input and context
- every time you generate a new token, it goes through this stack of layers-every single step
- inside each transformer layer:
- self-attention: figures out which previous tokens are important to the current prediction
- MLPs (multi-layer perceptrons): further process token representations, adding non-linearity and expressiveness
- layer n[...]
RT @TheAhmadOsman: - local llms 101
- running a model = inference (using model weights)
- inference = predicting the next token based on your input plus all tokens generated so far
- together, these make up the "sequence"
- tokens ≠ words
- they're the chunks representing the text a model sees
- they are represented by integers (token IDs) in the model
- "tokenizer" = the algorithm that splits text into tokens
- common types: BPE (byte pair encoding), SentencePiece
- token examples:
- "hello" = 1 token or maybe 2 or 3 tokens
- "internationalization" = 5–8 tokens
- context window = max tokens model can "see" at once (2K, 8K, 32K+)
- longer context = more VRAM for KV cache, slower decode
- during inference, the model predicts next token
- by running lots of math on its "weights"
- model weights = billions of learned parameters (the knowledge and patterns from training)
- model parameters: usually billions of numbers (called weights) that the model learns during training
- these weights encode all the model's "knowledge" (patterns, language, facts, reasoning)
- think of them as the knobs and dials inside the model, specifically computed to recognize what could come next
- when you run inference, the model uses these parameters to compute its predictions, one token at a time
- every prediction is just: model weights + current sequence → probabilities for what comes next
- pick a token, append it, repeat, each new token becomes part of the sequence for the next prediction
- models are more than weight files
- neural network architecture: transformer skeleton (layers, heads, RoPE, MQA/GQA, more below)
- weights: billions of learned numbers (parameters, not "tokens", but calculated from tokens)
- tokenizer: how text gets chunked into tokens (BPE/SentencePiece)
- config: metadata, shapes, special tokens, license, intended use, etc
- sometimes: chat template are required for chat/instruct models, or else you get gibberish
- you give a model a prompt (your text, converted into tokens)
- models differ in parameter size:
- 7B means ~7 billion learned numbers
- common sizes: 7B, 13B, 70B
- bigger = stronger, but eats more VRAM/memory & compute
- the model computes a probability for every possible next token (softmax over vocab)
- picks one: either the highest (greedy) or
- samples from the probability distribution (temperature, top-p, etc)
- then appends that token to the sequence, then repeats the whole process
- this is generation:
- generate; predict, sample, append
- over and over, one token at a time
- rinse and repeat
- each new token depends on everything before it; the model re-reads the sequence every step
- generation is always stepwise: token by token, not all at once
- mathematically: model is a learned function, f_θ(seq) → p(next_token)
- all the "magic" is just repeating "what's likely next?" until you stop
- all conversation "tokens" live in the KV cache, or the "session memory"
- so what's actually inside the model?
- everything above-tokens, weights, config-is just setup for the real engine underneath
- the core of almost every modern llm is a transformer architecture
- this is the skeleton that moves all those numbers around
- it's what turns token sequences and weights into predictions
- designed for sequence data (like language),
- transformers can "look back" at previous tokens and
- decide which ones matter for the next prediction
- transformers work in layers, passing your sequence through the same recipe over and over
- each layer refines the representation, using attention to focus on the important parts of your input and context
- every time you generate a new token, it goes through this stack of layers-every single step
- inside each transformer layer:
- self-attention: figures out which previous tokens are important to the current prediction
- MLPs (multi-layer perceptrons): further process token representations, adding non-linearity and expressiveness
- layer n[...]
Offshore
Ahmad RT @TheAhmadOsman: - local llms 101 - running a model = inference (using model weights) - inference = predicting the next token based on your input plus all tokens generated so far - together, these make up the "sequence" - tokens ≠ words - they're…
orms and residuals: stabilize learning and prediction, making deep networks possible
- positional encodings (like RoPE): tell the model where each token sits in the sequence
- so "cat" and "catastrophe" aren't confused by position
- by stacking these layers (sometimes dozens or even hundreds)
- transformers build a complex understanding of your prompt, context, and conversation history
- transformer recap:
- decoder-only: model only predicts what comes next, each token looks back at all previous tokens
- self-attention picks what to focus on (MQA/GQA = efficient versions for less memory)
- feed-forward MLP after attention for every token (usually 2 layers, GELU activation)
- everything's wrapped in layer norms + linear layers (QKV projections, MLPs, outputs)
- residuals + norms = stable, trainable, no exploding/vanishing gradients
- RoPE (rotary embeddings): tells the model where each token sits in the sequence
- stack N layers of this → final logits → pick the next token
- scale up: more layers, more heads, wider MLPs = bigger brains
- VRAM: memory, the bottleneck
- VRAM must must fit:
1. weights (main model, whether quantized or not)
2. KV cache (per token, per layer, per head)
- weights:
- FP16: ~2 bytes/param → 7B = ~14GB
- 8-bit: ~1 byte/param → 7B = ~7GB
- 4-bit: ~0.5 byte/param → 7B = ~3.5GB
- add 10–30% for runtime overheads
- KV cache:
- rule of thumb: 0.5MB per token (Llama-like 7B, 32 layers, 4K tokens = ~2GB)
- some runtimes support KV cache quantization (8/4-bit) = big savings
- throughput = memory bandwidth + GPU FLOPs + attention implementation (FlashAttention/SDPA help) + quantization + batch size
- offload to CPU? expect MASSIVE slowdown
- GPU or bust: CPUs run quantized models (slow), but any real context/model needs CUDA/ROCm/Metal
- CPU spill = sadness (check device_map and memory fit)
- quantization: reduce precision for memory wins (sometimes a tiny quality hit)
- FP32/FP16/BF16 = full/floored
- INT8/INT4/NF4 = quantized
- 4-bit (NF4/GPTQ/AWQ) = sweet spot for most consumer GPUs (big memory win, small quality hit for most tasks)
- math-heavy or finicky tasks degrade first (math, logic, coding)
- KV cache quantization: even more memory saved for long contexts (check runtime support)
- formats/runtimes:
- PyTorch + safetensors: flexible, standard, GPU/TPU/CPU
- GGUF (llama.cpp): CPU/GPU/portable, best for quant + edge devices
- ONNX, TensorRT-LLM, MLC: advanced flavors for special hardware/use
- protip: avoid legacy .bin (pickle risk), use safetensors for safety
- everything is a tradeoff
- smaller = fits anywhere, less power
- more context = more latency + VRAM burn
- quantization = speed/memory, but maybe less accurate
- local = more control/knobs, more work
- what happens when you "load a model"?
- download weights, tokenizer, config
- resolve license/trust (don't use trust_remote_code unless you really trust the author)
- load to VRAM/CPU (check memory fit)
- warmup: kernels/caches initialized, first pass is slowest
- inference: forward passes per token, updating KV cache each step
- decoding = how next token is chosen:
- greedy: always top-1 (robotic)
- temperature: softens or sharpens probabilities (higher = more random)
- top-k: pick from top k
- top-p: pick from smallest set with ≥p prob
- typical sampling, repetition penalty, no-repeat n-gram: extra controls
- deterministic = set a seed and no sampling
- tune for your use-case: chat, summarization, code
- serving options?
- vLLM for high throughput, parallel serving
- llama.cpp server (OpenAI-compatible API)
- ExLlama V2/V3 w/ Tabby API (OpenAI-compatible API)
- run as a local script (CLI)
- FastAPI/Flask for local API endpoint
- local ≠ offline; run it, serve it, or build apps on top
- fine-tuning, ultra-brief:
- LoRA / QLoRA = adapter layers (efficient, minimal VRAM)
- still need a dataset and eval plan; adapters can be merged or kept separate
- most users get far with prompting + retrieval (RAG) or[...]
- positional encodings (like RoPE): tell the model where each token sits in the sequence
- so "cat" and "catastrophe" aren't confused by position
- by stacking these layers (sometimes dozens or even hundreds)
- transformers build a complex understanding of your prompt, context, and conversation history
- transformer recap:
- decoder-only: model only predicts what comes next, each token looks back at all previous tokens
- self-attention picks what to focus on (MQA/GQA = efficient versions for less memory)
- feed-forward MLP after attention for every token (usually 2 layers, GELU activation)
- everything's wrapped in layer norms + linear layers (QKV projections, MLPs, outputs)
- residuals + norms = stable, trainable, no exploding/vanishing gradients
- RoPE (rotary embeddings): tells the model where each token sits in the sequence
- stack N layers of this → final logits → pick the next token
- scale up: more layers, more heads, wider MLPs = bigger brains
- VRAM: memory, the bottleneck
- VRAM must must fit:
1. weights (main model, whether quantized or not)
2. KV cache (per token, per layer, per head)
- weights:
- FP16: ~2 bytes/param → 7B = ~14GB
- 8-bit: ~1 byte/param → 7B = ~7GB
- 4-bit: ~0.5 byte/param → 7B = ~3.5GB
- add 10–30% for runtime overheads
- KV cache:
- rule of thumb: 0.5MB per token (Llama-like 7B, 32 layers, 4K tokens = ~2GB)
- some runtimes support KV cache quantization (8/4-bit) = big savings
- throughput = memory bandwidth + GPU FLOPs + attention implementation (FlashAttention/SDPA help) + quantization + batch size
- offload to CPU? expect MASSIVE slowdown
- GPU or bust: CPUs run quantized models (slow), but any real context/model needs CUDA/ROCm/Metal
- CPU spill = sadness (check device_map and memory fit)
- quantization: reduce precision for memory wins (sometimes a tiny quality hit)
- FP32/FP16/BF16 = full/floored
- INT8/INT4/NF4 = quantized
- 4-bit (NF4/GPTQ/AWQ) = sweet spot for most consumer GPUs (big memory win, small quality hit for most tasks)
- math-heavy or finicky tasks degrade first (math, logic, coding)
- KV cache quantization: even more memory saved for long contexts (check runtime support)
- formats/runtimes:
- PyTorch + safetensors: flexible, standard, GPU/TPU/CPU
- GGUF (llama.cpp): CPU/GPU/portable, best for quant + edge devices
- ONNX, TensorRT-LLM, MLC: advanced flavors for special hardware/use
- protip: avoid legacy .bin (pickle risk), use safetensors for safety
- everything is a tradeoff
- smaller = fits anywhere, less power
- more context = more latency + VRAM burn
- quantization = speed/memory, but maybe less accurate
- local = more control/knobs, more work
- what happens when you "load a model"?
- download weights, tokenizer, config
- resolve license/trust (don't use trust_remote_code unless you really trust the author)
- load to VRAM/CPU (check memory fit)
- warmup: kernels/caches initialized, first pass is slowest
- inference: forward passes per token, updating KV cache each step
- decoding = how next token is chosen:
- greedy: always top-1 (robotic)
- temperature: softens or sharpens probabilities (higher = more random)
- top-k: pick from top k
- top-p: pick from smallest set with ≥p prob
- typical sampling, repetition penalty, no-repeat n-gram: extra controls
- deterministic = set a seed and no sampling
- tune for your use-case: chat, summarization, code
- serving options?
- vLLM for high throughput, parallel serving
- llama.cpp server (OpenAI-compatible API)
- ExLlama V2/V3 w/ Tabby API (OpenAI-compatible API)
- run as a local script (CLI)
- FastAPI/Flask for local API endpoint
- local ≠ offline; run it, serve it, or build apps on top
- fine-tuning, ultra-brief:
- LoRA / QLoRA = adapter layers (efficient, minimal VRAM)
- still need a dataset and eval plan; adapters can be merged or kept separate
- most users get far with prompting + retrieval (RAG) or[...]