Offshore
Photo
Dimitry Nakhla | Babylon Capital®
RT @DimitryNakhla: 10 Quality Stocks Offering 33% Higher FCF Yield Than the S&P 500 (LTM) 💵
1. $MA 3.18%
2. $DHR 3.18%
3. $INTU 3.21%
4. $V 3.30%
5. $SPGI 3.61%
6. $ADP 4.19%
7. $CSU 4.21%
8. $ICE 4.94%
9. $ABNB 5.37%
10. $BKNG 5.54%
—-
$SPY FCF Yield 2.35% (LTM)
tweet
RT @DimitryNakhla: 10 Quality Stocks Offering 33% Higher FCF Yield Than the S&P 500 (LTM) 💵
1. $MA 3.18%
2. $DHR 3.18%
3. $INTU 3.21%
4. $V 3.30%
5. $SPGI 3.61%
6. $ADP 4.19%
7. $CSU 4.21%
8. $ICE 4.94%
9. $ABNB 5.37%
10. $BKNG 5.54%
—-
$SPY FCF Yield 2.35% (LTM)
$SPY The S&P 500 free cash flow yield currently sits at 2.35%. https://t.co/MQYaZZGvNE - Koyfintweet
Offshore
Photo
Ahmad
أنا عادةً بكبر دماغي، بس ديه مش أول مرة أحمد ينتقدني فيها، وكل مرة بيزداد كلامه ثقلا على النفس إنها تعديه
والمشكلة إن مفيش إنتقاد واضح أصلًا، لو أنا حسابي مؤذي إعملي بلوك/متعملش فولو، لو هدفك تساعدني وتنصحني ابعت على الخاص سواء نصيحة او استواضح، ومش هتبقى أول مرة نتواصل مع بعض يعني، انما هنقول كلوت ونسخر من غير حتى ما نبقى بننتقد المحتوى نفسه بشكل موسع ونسمع لبعض؟... يا أخي التمس لأخيك ٧٠ عذرًا
ردي عليه واضح، وربنا وحده يعلم بالنوايا، فياريت نخف على بعض ونتعامل بحسن ظن
tweet
أنا عادةً بكبر دماغي، بس ديه مش أول مرة أحمد ينتقدني فيها، وكل مرة بيزداد كلامه ثقلا على النفس إنها تعديه
والمشكلة إن مفيش إنتقاد واضح أصلًا، لو أنا حسابي مؤذي إعملي بلوك/متعملش فولو، لو هدفك تساعدني وتنصحني ابعت على الخاص سواء نصيحة او استواضح، ومش هتبقى أول مرة نتواصل مع بعض يعني، انما هنقول كلوت ونسخر من غير حتى ما نبقى بننتقد المحتوى نفسه بشكل موسع ونسمع لبعض؟... يا أخي التمس لأخيك ٧٠ عذرًا
ردي عليه واضح، وربنا وحده يعلم بالنوايا، فياريت نخف على بعض ونتعامل بحسن ظن
احمد كان أبتدى كويس و الواحد كان بيحب يسمع بيقول ايه، الاهتمام تحول الي الريتش و الكلوت (أنا عارف انه كلاوت بس كلوت أوقع)
فكل كل الكلام بقى كلام شعبوي للاستهلاك بدون اي فايده حقيقية للبيستهلك المحتوى.
كان نفسي يفضل أحمد الـresearcher او حتى يكون في توازن ما بين الطبلة و ما بين القيمة الحقيقية - Ahmedtweet
Ahmad
RT @TheAhmadOsman: - local llms 101
- running a model = inference (using model weights)
- inference = predicting the next token based on your input plus all tokens generated so far
- together, these make up the "sequence"
- tokens ≠ words
- they're the chunks representing the text a model sees
- they are represented by integers (token IDs) in the model
- "tokenizer" = the algorithm that splits text into tokens
- common types: BPE (byte pair encoding), SentencePiece
- token examples:
- "hello" = 1 token or maybe 2 or 3 tokens
- "internationalization" = 5–8 tokens
- context window = max tokens model can "see" at once (2K, 8K, 32K+)
- longer context = more VRAM for KV cache, slower decode
- during inference, the model predicts next token
- by running lots of math on its "weights"
- model weights = billions of learned parameters (the knowledge and patterns from training)
- model parameters: usually billions of numbers (called weights) that the model learns during training
- these weights encode all the model's "knowledge" (patterns, language, facts, reasoning)
- think of them as the knobs and dials inside the model, specifically computed to recognize what could come next
- when you run inference, the model uses these parameters to compute its predictions, one token at a time
- every prediction is just: model weights + current sequence → probabilities for what comes next
- pick a token, append it, repeat, each new token becomes part of the sequence for the next prediction
- models are more than weight files
- neural network architecture: transformer skeleton (layers, heads, RoPE, MQA/GQA, more below)
- weights: billions of learned numbers (parameters, not "tokens", but calculated from tokens)
- tokenizer: how text gets chunked into tokens (BPE/SentencePiece)
- config: metadata, shapes, special tokens, license, intended use, etc
- sometimes: chat template are required for chat/instruct models, or else you get gibberish
- you give a model a prompt (your text, converted into tokens)
- models differ in parameter size:
- 7B means ~7 billion learned numbers
- common sizes: 7B, 13B, 70B
- bigger = stronger, but eats more VRAM/memory & compute
- the model computes a probability for every possible next token (softmax over vocab)
- picks one: either the highest (greedy) or
- samples from the probability distribution (temperature, top-p, etc)
- then appends that token to the sequence, then repeats the whole process
- this is generation:
- generate; predict, sample, append
- over and over, one token at a time
- rinse and repeat
- each new token depends on everything before it; the model re-reads the sequence every step
- generation is always stepwise: token by token, not all at once
- mathematically: model is a learned function, f_θ(seq) → p(next_token)
- all the "magic" is just repeating "what's likely next?" until you stop
- all conversation "tokens" live in the KV cache, or the "session memory"
- so what's actually inside the model?
- everything above-tokens, weights, config-is just setup for the real engine underneath
- the core of almost every modern llm is a transformer architecture
- this is the skeleton that moves all those numbers around
- it's what turns token sequences and weights into predictions
- designed for sequence data (like language),
- transformers can "look back" at previous tokens and
- decide which ones matter for the next prediction
- transformers work in layers, passing your sequence through the same recipe over and over
- each layer refines the representation, using attention to focus on the important parts of your input and context
- every time you generate a new token, it goes through this stack of layers-every single step
- inside each transformer layer:
- self-attention: figures out which previous tokens are important to the current prediction
- MLPs (multi-layer perceptrons): further process token representations, adding non-linearity and expressiveness
- layer n[...]
RT @TheAhmadOsman: - local llms 101
- running a model = inference (using model weights)
- inference = predicting the next token based on your input plus all tokens generated so far
- together, these make up the "sequence"
- tokens ≠ words
- they're the chunks representing the text a model sees
- they are represented by integers (token IDs) in the model
- "tokenizer" = the algorithm that splits text into tokens
- common types: BPE (byte pair encoding), SentencePiece
- token examples:
- "hello" = 1 token or maybe 2 or 3 tokens
- "internationalization" = 5–8 tokens
- context window = max tokens model can "see" at once (2K, 8K, 32K+)
- longer context = more VRAM for KV cache, slower decode
- during inference, the model predicts next token
- by running lots of math on its "weights"
- model weights = billions of learned parameters (the knowledge and patterns from training)
- model parameters: usually billions of numbers (called weights) that the model learns during training
- these weights encode all the model's "knowledge" (patterns, language, facts, reasoning)
- think of them as the knobs and dials inside the model, specifically computed to recognize what could come next
- when you run inference, the model uses these parameters to compute its predictions, one token at a time
- every prediction is just: model weights + current sequence → probabilities for what comes next
- pick a token, append it, repeat, each new token becomes part of the sequence for the next prediction
- models are more than weight files
- neural network architecture: transformer skeleton (layers, heads, RoPE, MQA/GQA, more below)
- weights: billions of learned numbers (parameters, not "tokens", but calculated from tokens)
- tokenizer: how text gets chunked into tokens (BPE/SentencePiece)
- config: metadata, shapes, special tokens, license, intended use, etc
- sometimes: chat template are required for chat/instruct models, or else you get gibberish
- you give a model a prompt (your text, converted into tokens)
- models differ in parameter size:
- 7B means ~7 billion learned numbers
- common sizes: 7B, 13B, 70B
- bigger = stronger, but eats more VRAM/memory & compute
- the model computes a probability for every possible next token (softmax over vocab)
- picks one: either the highest (greedy) or
- samples from the probability distribution (temperature, top-p, etc)
- then appends that token to the sequence, then repeats the whole process
- this is generation:
- generate; predict, sample, append
- over and over, one token at a time
- rinse and repeat
- each new token depends on everything before it; the model re-reads the sequence every step
- generation is always stepwise: token by token, not all at once
- mathematically: model is a learned function, f_θ(seq) → p(next_token)
- all the "magic" is just repeating "what's likely next?" until you stop
- all conversation "tokens" live in the KV cache, or the "session memory"
- so what's actually inside the model?
- everything above-tokens, weights, config-is just setup for the real engine underneath
- the core of almost every modern llm is a transformer architecture
- this is the skeleton that moves all those numbers around
- it's what turns token sequences and weights into predictions
- designed for sequence data (like language),
- transformers can "look back" at previous tokens and
- decide which ones matter for the next prediction
- transformers work in layers, passing your sequence through the same recipe over and over
- each layer refines the representation, using attention to focus on the important parts of your input and context
- every time you generate a new token, it goes through this stack of layers-every single step
- inside each transformer layer:
- self-attention: figures out which previous tokens are important to the current prediction
- MLPs (multi-layer perceptrons): further process token representations, adding non-linearity and expressiveness
- layer n[...]
Offshore
Ahmad RT @TheAhmadOsman: - local llms 101 - running a model = inference (using model weights) - inference = predicting the next token based on your input plus all tokens generated so far - together, these make up the "sequence" - tokens ≠ words - they're…
orms and residuals: stabilize learning and prediction, making deep networks possible
- positional encodings (like RoPE): tell the model where each token sits in the sequence
- so "cat" and "catastrophe" aren't confused by position
- by stacking these layers (sometimes dozens or even hundreds)
- transformers build a complex understanding of your prompt, context, and conversation history
- transformer recap:
- decoder-only: model only predicts what comes next, each token looks back at all previous tokens
- self-attention picks what to focus on (MQA/GQA = efficient versions for less memory)
- feed-forward MLP after attention for every token (usually 2 layers, GELU activation)
- everything's wrapped in layer norms + linear layers (QKV projections, MLPs, outputs)
- residuals + norms = stable, trainable, no exploding/vanishing gradients
- RoPE (rotary embeddings): tells the model where each token sits in the sequence
- stack N layers of this → final logits → pick the next token
- scale up: more layers, more heads, wider MLPs = bigger brains
- VRAM: memory, the bottleneck
- VRAM must must fit:
1. weights (main model, whether quantized or not)
2. KV cache (per token, per layer, per head)
- weights:
- FP16: ~2 bytes/param → 7B = ~14GB
- 8-bit: ~1 byte/param → 7B = ~7GB
- 4-bit: ~0.5 byte/param → 7B = ~3.5GB
- add 10–30% for runtime overheads
- KV cache:
- rule of thumb: 0.5MB per token (Llama-like 7B, 32 layers, 4K tokens = ~2GB)
- some runtimes support KV cache quantization (8/4-bit) = big savings
- throughput = memory bandwidth + GPU FLOPs + attention implementation (FlashAttention/SDPA help) + quantization + batch size
- offload to CPU? expect MASSIVE slowdown
- GPU or bust: CPUs run quantized models (slow), but any real context/model needs CUDA/ROCm/Metal
- CPU spill = sadness (check device_map and memory fit)
- quantization: reduce precision for memory wins (sometimes a tiny quality hit)
- FP32/FP16/BF16 = full/floored
- INT8/INT4/NF4 = quantized
- 4-bit (NF4/GPTQ/AWQ) = sweet spot for most consumer GPUs (big memory win, small quality hit for most tasks)
- math-heavy or finicky tasks degrade first (math, logic, coding)
- KV cache quantization: even more memory saved for long contexts (check runtime support)
- formats/runtimes:
- PyTorch + safetensors: flexible, standard, GPU/TPU/CPU
- GGUF (llama.cpp): CPU/GPU/portable, best for quant + edge devices
- ONNX, TensorRT-LLM, MLC: advanced flavors for special hardware/use
- protip: avoid legacy .bin (pickle risk), use safetensors for safety
- everything is a tradeoff
- smaller = fits anywhere, less power
- more context = more latency + VRAM burn
- quantization = speed/memory, but maybe less accurate
- local = more control/knobs, more work
- what happens when you "load a model"?
- download weights, tokenizer, config
- resolve license/trust (don't use trust_remote_code unless you really trust the author)
- load to VRAM/CPU (check memory fit)
- warmup: kernels/caches initialized, first pass is slowest
- inference: forward passes per token, updating KV cache each step
- decoding = how next token is chosen:
- greedy: always top-1 (robotic)
- temperature: softens or sharpens probabilities (higher = more random)
- top-k: pick from top k
- top-p: pick from smallest set with ≥p prob
- typical sampling, repetition penalty, no-repeat n-gram: extra controls
- deterministic = set a seed and no sampling
- tune for your use-case: chat, summarization, code
- serving options?
- vLLM for high throughput, parallel serving
- llama.cpp server (OpenAI-compatible API)
- ExLlama V2/V3 w/ Tabby API (OpenAI-compatible API)
- run as a local script (CLI)
- FastAPI/Flask for local API endpoint
- local ≠ offline; run it, serve it, or build apps on top
- fine-tuning, ultra-brief:
- LoRA / QLoRA = adapter layers (efficient, minimal VRAM)
- still need a dataset and eval plan; adapters can be merged or kept separate
- most users get far with prompting + retrieval (RAG) or[...]
- positional encodings (like RoPE): tell the model where each token sits in the sequence
- so "cat" and "catastrophe" aren't confused by position
- by stacking these layers (sometimes dozens or even hundreds)
- transformers build a complex understanding of your prompt, context, and conversation history
- transformer recap:
- decoder-only: model only predicts what comes next, each token looks back at all previous tokens
- self-attention picks what to focus on (MQA/GQA = efficient versions for less memory)
- feed-forward MLP after attention for every token (usually 2 layers, GELU activation)
- everything's wrapped in layer norms + linear layers (QKV projections, MLPs, outputs)
- residuals + norms = stable, trainable, no exploding/vanishing gradients
- RoPE (rotary embeddings): tells the model where each token sits in the sequence
- stack N layers of this → final logits → pick the next token
- scale up: more layers, more heads, wider MLPs = bigger brains
- VRAM: memory, the bottleneck
- VRAM must must fit:
1. weights (main model, whether quantized or not)
2. KV cache (per token, per layer, per head)
- weights:
- FP16: ~2 bytes/param → 7B = ~14GB
- 8-bit: ~1 byte/param → 7B = ~7GB
- 4-bit: ~0.5 byte/param → 7B = ~3.5GB
- add 10–30% for runtime overheads
- KV cache:
- rule of thumb: 0.5MB per token (Llama-like 7B, 32 layers, 4K tokens = ~2GB)
- some runtimes support KV cache quantization (8/4-bit) = big savings
- throughput = memory bandwidth + GPU FLOPs + attention implementation (FlashAttention/SDPA help) + quantization + batch size
- offload to CPU? expect MASSIVE slowdown
- GPU or bust: CPUs run quantized models (slow), but any real context/model needs CUDA/ROCm/Metal
- CPU spill = sadness (check device_map and memory fit)
- quantization: reduce precision for memory wins (sometimes a tiny quality hit)
- FP32/FP16/BF16 = full/floored
- INT8/INT4/NF4 = quantized
- 4-bit (NF4/GPTQ/AWQ) = sweet spot for most consumer GPUs (big memory win, small quality hit for most tasks)
- math-heavy or finicky tasks degrade first (math, logic, coding)
- KV cache quantization: even more memory saved for long contexts (check runtime support)
- formats/runtimes:
- PyTorch + safetensors: flexible, standard, GPU/TPU/CPU
- GGUF (llama.cpp): CPU/GPU/portable, best for quant + edge devices
- ONNX, TensorRT-LLM, MLC: advanced flavors for special hardware/use
- protip: avoid legacy .bin (pickle risk), use safetensors for safety
- everything is a tradeoff
- smaller = fits anywhere, less power
- more context = more latency + VRAM burn
- quantization = speed/memory, but maybe less accurate
- local = more control/knobs, more work
- what happens when you "load a model"?
- download weights, tokenizer, config
- resolve license/trust (don't use trust_remote_code unless you really trust the author)
- load to VRAM/CPU (check memory fit)
- warmup: kernels/caches initialized, first pass is slowest
- inference: forward passes per token, updating KV cache each step
- decoding = how next token is chosen:
- greedy: always top-1 (robotic)
- temperature: softens or sharpens probabilities (higher = more random)
- top-k: pick from top k
- top-p: pick from smallest set with ≥p prob
- typical sampling, repetition penalty, no-repeat n-gram: extra controls
- deterministic = set a seed and no sampling
- tune for your use-case: chat, summarization, code
- serving options?
- vLLM for high throughput, parallel serving
- llama.cpp server (OpenAI-compatible API)
- ExLlama V2/V3 w/ Tabby API (OpenAI-compatible API)
- run as a local script (CLI)
- FastAPI/Flask for local API endpoint
- local ≠ offline; run it, serve it, or build apps on top
- fine-tuning, ultra-brief:
- LoRA / QLoRA = adapter layers (efficient, minimal VRAM)
- still need a dataset and eval plan; adapters can be merged or kept separate
- most users get far with prompting + retrieval (RAG) or[...]
Offshore
orms and residuals: stabilize learning and prediction, making deep networks possible - positional encodings (like RoPE): tell the model where each token sits in the sequence - so "cat" and "catastrophe" aren't confused by position - by stacking these layers…
few-shot for niche tasks
- common pitfalls
- OOM? out of memory. Model or context too big, quantize or shrink context
- gibberish? used a base model with a chat prompt, or wrong template; check temperature/top_p
- slow? offload to CPU, wrong drivers, no FlashAttention; check CUDA/ROCm/Metal, memory fit
- unsafe? don't use random .bin or trust_remote_code; prefer safetensors, verify source
- why run locally?
- control: all the knobs are yours to tweak:
- sampler, chat templates, decoding, system prompts, quantization, context
- cost: no per-token API billing-just upfront hardware
- privacy: prompts and outputs stay on your machine
- latency: no network roundtrips, instant token streaming
- challenges:
- hardware limits (VRAM/memory = max model/context)
- ecosystem variance (different runtimes, quant schemes, templates)
- ops burden (setup, drivers, updates)
- running local checklist:
- pick a model (prefer chat-tuned, sized for your VRAM)
- pick precision (4-bit saves RAM, FP16 for max quality)
- install runtime (vLLM, llama.cpp, Transformers+PyTorch, etc)
- run it, get tokens/sec, check memory fit
- use correct chat template (apply_chat_template)
- tune decoding (temp/top_p)
- benchmark on your task
- serve as local API (or go wild and fine-tune it)
- glossary:
- token: smallest unit (subword/char)
- context window: max tokens visible to model
- KV cache: session memory, per-layer attention state
- quantization: lower precision for memory/speed
- RoPE: rotary position embeddings (for order)
- GQA/MQA: efficient attention for memory bandwidth
- decoding: method for picking next token
- RAG: retrieval-augmented generation, add real info
- misc:
- common architectures: LLaMA, Falcon, Mistral, GPT-NeoX, etc
- base model: not fine-tuned for chat (LLaMA, Falcon, etc)
- chat-tuned: fine-tuned for dialogue (Alpaca, Vicuna, etc)
- instruct-tuned: fine-tuned for following instructions (LLaMA-2-Chat, Mistral-Instruct, etc)
- chat/instruct models usually need a special prompt template to work well
- chat template: system/user/assistant markup is required; wrong template = junk output
- base models can do few-shot chat prompting, but not as well as chat-tuned ones
- quantized: weights stored in lower precision (8-bit, 4-bit) for memory savings, at some quality loss
- quantization is a tradeoff: memory/speed vs quality
- 4-bit (NF4/GPTQ/AWQ) is the sweet spot for most consumer GPUs (huge memory win, minor quality drop for most tasks)
- math-heavy or finicky tasks degrade first (math, logic, code)
- quantization types: FP16 (full), INT8 (quantized), INT4/NF4 (more quantized), etc.
- some runtimes support quantized KV cache (8/4-bit), big savings for long contexts
- formats/runtimes:
- PyTorch + safetensors: flexible, standard, works on GPU/TPU/CPU
- GGUF (llama.cpp): CPU/GPU, portable, best for quant + edge devices
- ONNX, TensorRT-LLM, MLC: advanced options for special hardware
- avoid legacy .bin (pickle risk), use safetensors for safety
- everything is a tradeoff:
- smaller = fits anywhere, less power
- more context = more latency + VRAM burn
- quantization = faster/leaner, maybe less accurate
- local = full control/knobs, but more work
- final words:
- local LLMs = memory math + correct formatting
- fit weights and KV cache in memory
- use the right chat template and decoding strategy
- know your knobs: quantization, context, decoding, batch, hardware
- master these, and you can run (and reason about) almost any modern model locally
tweet
- common pitfalls
- OOM? out of memory. Model or context too big, quantize or shrink context
- gibberish? used a base model with a chat prompt, or wrong template; check temperature/top_p
- slow? offload to CPU, wrong drivers, no FlashAttention; check CUDA/ROCm/Metal, memory fit
- unsafe? don't use random .bin or trust_remote_code; prefer safetensors, verify source
- why run locally?
- control: all the knobs are yours to tweak:
- sampler, chat templates, decoding, system prompts, quantization, context
- cost: no per-token API billing-just upfront hardware
- privacy: prompts and outputs stay on your machine
- latency: no network roundtrips, instant token streaming
- challenges:
- hardware limits (VRAM/memory = max model/context)
- ecosystem variance (different runtimes, quant schemes, templates)
- ops burden (setup, drivers, updates)
- running local checklist:
- pick a model (prefer chat-tuned, sized for your VRAM)
- pick precision (4-bit saves RAM, FP16 for max quality)
- install runtime (vLLM, llama.cpp, Transformers+PyTorch, etc)
- run it, get tokens/sec, check memory fit
- use correct chat template (apply_chat_template)
- tune decoding (temp/top_p)
- benchmark on your task
- serve as local API (or go wild and fine-tune it)
- glossary:
- token: smallest unit (subword/char)
- context window: max tokens visible to model
- KV cache: session memory, per-layer attention state
- quantization: lower precision for memory/speed
- RoPE: rotary position embeddings (for order)
- GQA/MQA: efficient attention for memory bandwidth
- decoding: method for picking next token
- RAG: retrieval-augmented generation, add real info
- misc:
- common architectures: LLaMA, Falcon, Mistral, GPT-NeoX, etc
- base model: not fine-tuned for chat (LLaMA, Falcon, etc)
- chat-tuned: fine-tuned for dialogue (Alpaca, Vicuna, etc)
- instruct-tuned: fine-tuned for following instructions (LLaMA-2-Chat, Mistral-Instruct, etc)
- chat/instruct models usually need a special prompt template to work well
- chat template: system/user/assistant markup is required; wrong template = junk output
- base models can do few-shot chat prompting, but not as well as chat-tuned ones
- quantized: weights stored in lower precision (8-bit, 4-bit) for memory savings, at some quality loss
- quantization is a tradeoff: memory/speed vs quality
- 4-bit (NF4/GPTQ/AWQ) is the sweet spot for most consumer GPUs (huge memory win, minor quality drop for most tasks)
- math-heavy or finicky tasks degrade first (math, logic, code)
- quantization types: FP16 (full), INT8 (quantized), INT4/NF4 (more quantized), etc.
- some runtimes support quantized KV cache (8/4-bit), big savings for long contexts
- formats/runtimes:
- PyTorch + safetensors: flexible, standard, works on GPU/TPU/CPU
- GGUF (llama.cpp): CPU/GPU, portable, best for quant + edge devices
- ONNX, TensorRT-LLM, MLC: advanced options for special hardware
- avoid legacy .bin (pickle risk), use safetensors for safety
- everything is a tradeoff:
- smaller = fits anywhere, less power
- more context = more latency + VRAM burn
- quantization = faster/leaner, maybe less accurate
- local = full control/knobs, but more work
- final words:
- local LLMs = memory math + correct formatting
- fit weights and KV cache in memory
- use the right chat template and decoding strategy
- know your knobs: quantization, context, decoding, batch, hardware
- master these, and you can run (and reason about) almost any modern model locally
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: the Buy a GPU website & guide is launching this week
so, what should you expect? https://t.co/e36YLjAdoo
tweet
RT @TheAhmadOsman: the Buy a GPU website & guide is launching this week
so, what should you expect? https://t.co/e36YLjAdoo
tweet
Ahmad
RT @TheAhmadOsman: - you are
- a normal dev who’s heard “embeddings” and “RAG” 1000x
- want to know what they actually are, how they plug into LLMs
- suddenly: vectors are just coordinates for meaning, not magic
- first: what even is an “embedding”?
- embedding = a list of numbers (a vector) that represents text
- same-ish meaning ⇒ nearby vectors; different meaning ⇒ far apart
- produced by a smaller model (an encoder), not your chat LLM
- length (a.k.a. dimension): 256/384/768/1024+ numbers is common
- the vector space (101)
- you can measure closeness with math:
- L2 distance: straight-line distance
- dot product: alignment + magnitude
- cosine similarity: (a·b)/(||a||·||b||) = angle only
- normalize vectors (unit length) ⇒ dot product ≡ cosine
- embeddings compress semantics; they are lossy by design
- types of embeddings (don’t overthink; pick what you need)
- token embeddings: internal to the LLM (you don’t use these)
- sentence/document embeddings: 1 vector per chunk/snippet
- multilingual: one space across languages
- domain-tuned: legal, code, bio — better clustering for that domain
- how text becomes vectors (pipeline)
- clean text (lowercase? keep punctuation? depends; don’t destroy signal)
- chunking: split long docs into overlapping windows (by tokens, not chars)
- rule of thumb: 200–800 tokens, 10–20% overlap
- keep titles/headers as context inside each chunk
- embed each chunk ⇒ store in a vector index with metadata (source, page, tags)
- storing & searching vectors
- exact search (brute force): simplest; fine for ≤100k vectors
- ANN (approx nearest neighbor): fast at scale, tiny recall tradeoff
- HNSW (graph-based): great latency, memory heavier
- IVF/PQ (quantization): smaller index, some recall loss
- where to put them:
- FAISS/hnswlib (library), pgvector (Postgres), dedicated stores (Milvus, Pinecone, Weaviate, etc.)
- ops notes:
- track embedding_model_name + dimension in the index
- you cannot mix dimensions or swap models without re-embedding
- memory math: 768-dim float32 ≈ 3 KB/vector → 1M vectors ≈ ~3 GB (+ index overhead)
- RAG (Retrieval-Augmented Generation): the shape of it
- goal: let the LLM answer with your data, not its memory
- loop:
- take user question
- embed question (a single vector)
- retrieve top-k similar chunks (k=3–20 is common)
- (optional) rerank with a cross-encoder (relevance re-check)
- stuff the best chunks into the prompt as context
- generate answer (cite sources; limit style drift)
- RAG ≠ “just search”; it’s retrieval + prompt construction + guardrails
- hybrid retrieval (dense + sparse)
- dense vectors catch synonyms/semantics
- sparse/BM25 catches exact terms, numbers, rare tokens
- combine scores or do reciprocal rank fusion for better recall
- reranking (cheap insurance)
- use a cross-encoder (reads query+chunk together) to re-score the top 50–200 hits
- keeps fast ANN recall but upgrades precision in the final top-k
- building the prompt from retrieved chunks
- include: brief task instruction → user query → curated chunks (with titles) → “answer + cite”
- beware prompt injection in docs (“ignore previous instructions…”)
- mitigate: strip instructions from chunks; use system prompts to restrict tools; sanitizer rules
- RAG quality knobs
- chunk size/overlap: too big = off-topic; too small = missing context
- k (results): too low = miss facts; too high = blow context window
- similarity threshold: prevent garbage at tail
- reranker on/off: trade latency for quality
- metadata filters: time ranges, authors, tenants, permissions (ABAC/RBAC)
- evaluating retrieval
- offline: make a small test set (query → expected passages)
- metrics: Recall@k, MRR, nDCG
- online: measure “answer contained sources?”, “clicked citations?”, “escalations?”
- error taxonomy: missed retrieval vs wr[...]
RT @TheAhmadOsman: - you are
- a normal dev who’s heard “embeddings” and “RAG” 1000x
- want to know what they actually are, how they plug into LLMs
- suddenly: vectors are just coordinates for meaning, not magic
- first: what even is an “embedding”?
- embedding = a list of numbers (a vector) that represents text
- same-ish meaning ⇒ nearby vectors; different meaning ⇒ far apart
- produced by a smaller model (an encoder), not your chat LLM
- length (a.k.a. dimension): 256/384/768/1024+ numbers is common
- the vector space (101)
- you can measure closeness with math:
- L2 distance: straight-line distance
- dot product: alignment + magnitude
- cosine similarity: (a·b)/(||a||·||b||) = angle only
- normalize vectors (unit length) ⇒ dot product ≡ cosine
- embeddings compress semantics; they are lossy by design
- types of embeddings (don’t overthink; pick what you need)
- token embeddings: internal to the LLM (you don’t use these)
- sentence/document embeddings: 1 vector per chunk/snippet
- multilingual: one space across languages
- domain-tuned: legal, code, bio — better clustering for that domain
- how text becomes vectors (pipeline)
- clean text (lowercase? keep punctuation? depends; don’t destroy signal)
- chunking: split long docs into overlapping windows (by tokens, not chars)
- rule of thumb: 200–800 tokens, 10–20% overlap
- keep titles/headers as context inside each chunk
- embed each chunk ⇒ store in a vector index with metadata (source, page, tags)
- storing & searching vectors
- exact search (brute force): simplest; fine for ≤100k vectors
- ANN (approx nearest neighbor): fast at scale, tiny recall tradeoff
- HNSW (graph-based): great latency, memory heavier
- IVF/PQ (quantization): smaller index, some recall loss
- where to put them:
- FAISS/hnswlib (library), pgvector (Postgres), dedicated stores (Milvus, Pinecone, Weaviate, etc.)
- ops notes:
- track embedding_model_name + dimension in the index
- you cannot mix dimensions or swap models without re-embedding
- memory math: 768-dim float32 ≈ 3 KB/vector → 1M vectors ≈ ~3 GB (+ index overhead)
- RAG (Retrieval-Augmented Generation): the shape of it
- goal: let the LLM answer with your data, not its memory
- loop:
- take user question
- embed question (a single vector)
- retrieve top-k similar chunks (k=3–20 is common)
- (optional) rerank with a cross-encoder (relevance re-check)
- stuff the best chunks into the prompt as context
- generate answer (cite sources; limit style drift)
- RAG ≠ “just search”; it’s retrieval + prompt construction + guardrails
- hybrid retrieval (dense + sparse)
- dense vectors catch synonyms/semantics
- sparse/BM25 catches exact terms, numbers, rare tokens
- combine scores or do reciprocal rank fusion for better recall
- reranking (cheap insurance)
- use a cross-encoder (reads query+chunk together) to re-score the top 50–200 hits
- keeps fast ANN recall but upgrades precision in the final top-k
- building the prompt from retrieved chunks
- include: brief task instruction → user query → curated chunks (with titles) → “answer + cite”
- beware prompt injection in docs (“ignore previous instructions…”)
- mitigate: strip instructions from chunks; use system prompts to restrict tools; sanitizer rules
- RAG quality knobs
- chunk size/overlap: too big = off-topic; too small = missing context
- k (results): too low = miss facts; too high = blow context window
- similarity threshold: prevent garbage at tail
- reranker on/off: trade latency for quality
- metadata filters: time ranges, authors, tenants, permissions (ABAC/RBAC)
- evaluating retrieval
- offline: make a small test set (query → expected passages)
- metrics: Recall@k, MRR, nDCG
- online: measure “answer contained sources?”, “clicked citations?”, “escalations?”
- error taxonomy: missed retrieval vs wr[...]
Offshore
Ahmad RT @TheAhmadOsman: - you are - a normal dev who’s heard “embeddings” and “RAG” 1000x - want to know what they actually are, how they plug into LLMs - suddenly: vectors are just coordinates for meaning, not magic - first: what even is an “embedding”?…
ong generation vs prompt injection
tweet
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: working on getting nanochat training running with TT‑NN
the more i push my single Tenstorrent QuietBox Blackhole,
the more i see just how much headroom this thing has
counting down until my 4x TT‑QuietBox Blackhole cluster arrives
this cluster's going to be an absolute beast https://t.co/lN9VsITgDs
tweet
RT @TheAhmadOsman: working on getting nanochat training running with TT‑NN
the more i push my single Tenstorrent QuietBox Blackhole,
the more i see just how much headroom this thing has
counting down until my 4x TT‑QuietBox Blackhole cluster arrives
this cluster's going to be an absolute beast https://t.co/lN9VsITgDs
tweet
Offshore
Video
Ahmad
RT @TheAhmadOsman: i built a simple tool that makes
Claude Code work with any local LLM
full demo:
> vLLM serving GLM-4.5 Air on 4x RTX 3090s
> Claude Code generating code + docs via my proxy
> 1 Python file + .env handles all requests
> nvtop showing live GPU load
> how it all works
Buy a GPU https://t.co/7nYsId4Uyu
tweet
RT @TheAhmadOsman: i built a simple tool that makes
Claude Code work with any local LLM
full demo:
> vLLM serving GLM-4.5 Air on 4x RTX 3090s
> Claude Code generating code + docs via my proxy
> 1 Python file + .env handles all requests
> nvtop showing live GPU load
> how it all works
Buy a GPU https://t.co/7nYsId4Uyu
tweet