Offshore
orms and residuals: stabilize learning and prediction, making deep networks possible - positional encodings (like RoPE): tell the model where each token sits in the sequence - so "cat" and "catastrophe" aren't confused by position - by stacking these layers…
few-shot for niche tasks
- common pitfalls
- OOM? out of memory. Model or context too big, quantize or shrink context
- gibberish? used a base model with a chat prompt, or wrong template; check temperature/top_p
- slow? offload to CPU, wrong drivers, no FlashAttention; check CUDA/ROCm/Metal, memory fit
- unsafe? don't use random .bin or trust_remote_code; prefer safetensors, verify source
- why run locally?
- control: all the knobs are yours to tweak:
- sampler, chat templates, decoding, system prompts, quantization, context
- cost: no per-token API billing-just upfront hardware
- privacy: prompts and outputs stay on your machine
- latency: no network roundtrips, instant token streaming
- challenges:
- hardware limits (VRAM/memory = max model/context)
- ecosystem variance (different runtimes, quant schemes, templates)
- ops burden (setup, drivers, updates)
- running local checklist:
- pick a model (prefer chat-tuned, sized for your VRAM)
- pick precision (4-bit saves RAM, FP16 for max quality)
- install runtime (vLLM, llama.cpp, Transformers+PyTorch, etc)
- run it, get tokens/sec, check memory fit
- use correct chat template (apply_chat_template)
- tune decoding (temp/top_p)
- benchmark on your task
- serve as local API (or go wild and fine-tune it)
- glossary:
- token: smallest unit (subword/char)
- context window: max tokens visible to model
- KV cache: session memory, per-layer attention state
- quantization: lower precision for memory/speed
- RoPE: rotary position embeddings (for order)
- GQA/MQA: efficient attention for memory bandwidth
- decoding: method for picking next token
- RAG: retrieval-augmented generation, add real info
- misc:
- common architectures: LLaMA, Falcon, Mistral, GPT-NeoX, etc
- base model: not fine-tuned for chat (LLaMA, Falcon, etc)
- chat-tuned: fine-tuned for dialogue (Alpaca, Vicuna, etc)
- instruct-tuned: fine-tuned for following instructions (LLaMA-2-Chat, Mistral-Instruct, etc)
- chat/instruct models usually need a special prompt template to work well
- chat template: system/user/assistant markup is required; wrong template = junk output
- base models can do few-shot chat prompting, but not as well as chat-tuned ones
- quantized: weights stored in lower precision (8-bit, 4-bit) for memory savings, at some quality loss
- quantization is a tradeoff: memory/speed vs quality
- 4-bit (NF4/GPTQ/AWQ) is the sweet spot for most consumer GPUs (huge memory win, minor quality drop for most tasks)
- math-heavy or finicky tasks degrade first (math, logic, code)
- quantization types: FP16 (full), INT8 (quantized), INT4/NF4 (more quantized), etc.
- some runtimes support quantized KV cache (8/4-bit), big savings for long contexts
- formats/runtimes:
- PyTorch + safetensors: flexible, standard, works on GPU/TPU/CPU
- GGUF (llama.cpp): CPU/GPU, portable, best for quant + edge devices
- ONNX, TensorRT-LLM, MLC: advanced options for special hardware
- avoid legacy .bin (pickle risk), use safetensors for safety
- everything is a tradeoff:
- smaller = fits anywhere, less power
- more context = more latency + VRAM burn
- quantization = faster/leaner, maybe less accurate
- local = full control/knobs, but more work
- final words:
- local LLMs = memory math + correct formatting
- fit weights and KV cache in memory
- use the right chat template and decoding strategy
- know your knobs: quantization, context, decoding, batch, hardware
- master these, and you can run (and reason about) almost any modern model locally
tweet
- common pitfalls
- OOM? out of memory. Model or context too big, quantize or shrink context
- gibberish? used a base model with a chat prompt, or wrong template; check temperature/top_p
- slow? offload to CPU, wrong drivers, no FlashAttention; check CUDA/ROCm/Metal, memory fit
- unsafe? don't use random .bin or trust_remote_code; prefer safetensors, verify source
- why run locally?
- control: all the knobs are yours to tweak:
- sampler, chat templates, decoding, system prompts, quantization, context
- cost: no per-token API billing-just upfront hardware
- privacy: prompts and outputs stay on your machine
- latency: no network roundtrips, instant token streaming
- challenges:
- hardware limits (VRAM/memory = max model/context)
- ecosystem variance (different runtimes, quant schemes, templates)
- ops burden (setup, drivers, updates)
- running local checklist:
- pick a model (prefer chat-tuned, sized for your VRAM)
- pick precision (4-bit saves RAM, FP16 for max quality)
- install runtime (vLLM, llama.cpp, Transformers+PyTorch, etc)
- run it, get tokens/sec, check memory fit
- use correct chat template (apply_chat_template)
- tune decoding (temp/top_p)
- benchmark on your task
- serve as local API (or go wild and fine-tune it)
- glossary:
- token: smallest unit (subword/char)
- context window: max tokens visible to model
- KV cache: session memory, per-layer attention state
- quantization: lower precision for memory/speed
- RoPE: rotary position embeddings (for order)
- GQA/MQA: efficient attention for memory bandwidth
- decoding: method for picking next token
- RAG: retrieval-augmented generation, add real info
- misc:
- common architectures: LLaMA, Falcon, Mistral, GPT-NeoX, etc
- base model: not fine-tuned for chat (LLaMA, Falcon, etc)
- chat-tuned: fine-tuned for dialogue (Alpaca, Vicuna, etc)
- instruct-tuned: fine-tuned for following instructions (LLaMA-2-Chat, Mistral-Instruct, etc)
- chat/instruct models usually need a special prompt template to work well
- chat template: system/user/assistant markup is required; wrong template = junk output
- base models can do few-shot chat prompting, but not as well as chat-tuned ones
- quantized: weights stored in lower precision (8-bit, 4-bit) for memory savings, at some quality loss
- quantization is a tradeoff: memory/speed vs quality
- 4-bit (NF4/GPTQ/AWQ) is the sweet spot for most consumer GPUs (huge memory win, minor quality drop for most tasks)
- math-heavy or finicky tasks degrade first (math, logic, code)
- quantization types: FP16 (full), INT8 (quantized), INT4/NF4 (more quantized), etc.
- some runtimes support quantized KV cache (8/4-bit), big savings for long contexts
- formats/runtimes:
- PyTorch + safetensors: flexible, standard, works on GPU/TPU/CPU
- GGUF (llama.cpp): CPU/GPU, portable, best for quant + edge devices
- ONNX, TensorRT-LLM, MLC: advanced options for special hardware
- avoid legacy .bin (pickle risk), use safetensors for safety
- everything is a tradeoff:
- smaller = fits anywhere, less power
- more context = more latency + VRAM burn
- quantization = faster/leaner, maybe less accurate
- local = full control/knobs, but more work
- final words:
- local LLMs = memory math + correct formatting
- fit weights and KV cache in memory
- use the right chat template and decoding strategy
- know your knobs: quantization, context, decoding, batch, hardware
- master these, and you can run (and reason about) almost any modern model locally
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: the Buy a GPU website & guide is launching this week
so, what should you expect? https://t.co/e36YLjAdoo
tweet
RT @TheAhmadOsman: the Buy a GPU website & guide is launching this week
so, what should you expect? https://t.co/e36YLjAdoo
tweet
Ahmad
RT @TheAhmadOsman: - you are
- a normal dev who’s heard “embeddings” and “RAG” 1000x
- want to know what they actually are, how they plug into LLMs
- suddenly: vectors are just coordinates for meaning, not magic
- first: what even is an “embedding”?
- embedding = a list of numbers (a vector) that represents text
- same-ish meaning ⇒ nearby vectors; different meaning ⇒ far apart
- produced by a smaller model (an encoder), not your chat LLM
- length (a.k.a. dimension): 256/384/768/1024+ numbers is common
- the vector space (101)
- you can measure closeness with math:
- L2 distance: straight-line distance
- dot product: alignment + magnitude
- cosine similarity: (a·b)/(||a||·||b||) = angle only
- normalize vectors (unit length) ⇒ dot product ≡ cosine
- embeddings compress semantics; they are lossy by design
- types of embeddings (don’t overthink; pick what you need)
- token embeddings: internal to the LLM (you don’t use these)
- sentence/document embeddings: 1 vector per chunk/snippet
- multilingual: one space across languages
- domain-tuned: legal, code, bio — better clustering for that domain
- how text becomes vectors (pipeline)
- clean text (lowercase? keep punctuation? depends; don’t destroy signal)
- chunking: split long docs into overlapping windows (by tokens, not chars)
- rule of thumb: 200–800 tokens, 10–20% overlap
- keep titles/headers as context inside each chunk
- embed each chunk ⇒ store in a vector index with metadata (source, page, tags)
- storing & searching vectors
- exact search (brute force): simplest; fine for ≤100k vectors
- ANN (approx nearest neighbor): fast at scale, tiny recall tradeoff
- HNSW (graph-based): great latency, memory heavier
- IVF/PQ (quantization): smaller index, some recall loss
- where to put them:
- FAISS/hnswlib (library), pgvector (Postgres), dedicated stores (Milvus, Pinecone, Weaviate, etc.)
- ops notes:
- track embedding_model_name + dimension in the index
- you cannot mix dimensions or swap models without re-embedding
- memory math: 768-dim float32 ≈ 3 KB/vector → 1M vectors ≈ ~3 GB (+ index overhead)
- RAG (Retrieval-Augmented Generation): the shape of it
- goal: let the LLM answer with your data, not its memory
- loop:
- take user question
- embed question (a single vector)
- retrieve top-k similar chunks (k=3–20 is common)
- (optional) rerank with a cross-encoder (relevance re-check)
- stuff the best chunks into the prompt as context
- generate answer (cite sources; limit style drift)
- RAG ≠ “just search”; it’s retrieval + prompt construction + guardrails
- hybrid retrieval (dense + sparse)
- dense vectors catch synonyms/semantics
- sparse/BM25 catches exact terms, numbers, rare tokens
- combine scores or do reciprocal rank fusion for better recall
- reranking (cheap insurance)
- use a cross-encoder (reads query+chunk together) to re-score the top 50–200 hits
- keeps fast ANN recall but upgrades precision in the final top-k
- building the prompt from retrieved chunks
- include: brief task instruction → user query → curated chunks (with titles) → “answer + cite”
- beware prompt injection in docs (“ignore previous instructions…”)
- mitigate: strip instructions from chunks; use system prompts to restrict tools; sanitizer rules
- RAG quality knobs
- chunk size/overlap: too big = off-topic; too small = missing context
- k (results): too low = miss facts; too high = blow context window
- similarity threshold: prevent garbage at tail
- reranker on/off: trade latency for quality
- metadata filters: time ranges, authors, tenants, permissions (ABAC/RBAC)
- evaluating retrieval
- offline: make a small test set (query → expected passages)
- metrics: Recall@k, MRR, nDCG
- online: measure “answer contained sources?”, “clicked citations?”, “escalations?”
- error taxonomy: missed retrieval vs wr[...]
RT @TheAhmadOsman: - you are
- a normal dev who’s heard “embeddings” and “RAG” 1000x
- want to know what they actually are, how they plug into LLMs
- suddenly: vectors are just coordinates for meaning, not magic
- first: what even is an “embedding”?
- embedding = a list of numbers (a vector) that represents text
- same-ish meaning ⇒ nearby vectors; different meaning ⇒ far apart
- produced by a smaller model (an encoder), not your chat LLM
- length (a.k.a. dimension): 256/384/768/1024+ numbers is common
- the vector space (101)
- you can measure closeness with math:
- L2 distance: straight-line distance
- dot product: alignment + magnitude
- cosine similarity: (a·b)/(||a||·||b||) = angle only
- normalize vectors (unit length) ⇒ dot product ≡ cosine
- embeddings compress semantics; they are lossy by design
- types of embeddings (don’t overthink; pick what you need)
- token embeddings: internal to the LLM (you don’t use these)
- sentence/document embeddings: 1 vector per chunk/snippet
- multilingual: one space across languages
- domain-tuned: legal, code, bio — better clustering for that domain
- how text becomes vectors (pipeline)
- clean text (lowercase? keep punctuation? depends; don’t destroy signal)
- chunking: split long docs into overlapping windows (by tokens, not chars)
- rule of thumb: 200–800 tokens, 10–20% overlap
- keep titles/headers as context inside each chunk
- embed each chunk ⇒ store in a vector index with metadata (source, page, tags)
- storing & searching vectors
- exact search (brute force): simplest; fine for ≤100k vectors
- ANN (approx nearest neighbor): fast at scale, tiny recall tradeoff
- HNSW (graph-based): great latency, memory heavier
- IVF/PQ (quantization): smaller index, some recall loss
- where to put them:
- FAISS/hnswlib (library), pgvector (Postgres), dedicated stores (Milvus, Pinecone, Weaviate, etc.)
- ops notes:
- track embedding_model_name + dimension in the index
- you cannot mix dimensions or swap models without re-embedding
- memory math: 768-dim float32 ≈ 3 KB/vector → 1M vectors ≈ ~3 GB (+ index overhead)
- RAG (Retrieval-Augmented Generation): the shape of it
- goal: let the LLM answer with your data, not its memory
- loop:
- take user question
- embed question (a single vector)
- retrieve top-k similar chunks (k=3–20 is common)
- (optional) rerank with a cross-encoder (relevance re-check)
- stuff the best chunks into the prompt as context
- generate answer (cite sources; limit style drift)
- RAG ≠ “just search”; it’s retrieval + prompt construction + guardrails
- hybrid retrieval (dense + sparse)
- dense vectors catch synonyms/semantics
- sparse/BM25 catches exact terms, numbers, rare tokens
- combine scores or do reciprocal rank fusion for better recall
- reranking (cheap insurance)
- use a cross-encoder (reads query+chunk together) to re-score the top 50–200 hits
- keeps fast ANN recall but upgrades precision in the final top-k
- building the prompt from retrieved chunks
- include: brief task instruction → user query → curated chunks (with titles) → “answer + cite”
- beware prompt injection in docs (“ignore previous instructions…”)
- mitigate: strip instructions from chunks; use system prompts to restrict tools; sanitizer rules
- RAG quality knobs
- chunk size/overlap: too big = off-topic; too small = missing context
- k (results): too low = miss facts; too high = blow context window
- similarity threshold: prevent garbage at tail
- reranker on/off: trade latency for quality
- metadata filters: time ranges, authors, tenants, permissions (ABAC/RBAC)
- evaluating retrieval
- offline: make a small test set (query → expected passages)
- metrics: Recall@k, MRR, nDCG
- online: measure “answer contained sources?”, “clicked citations?”, “escalations?”
- error taxonomy: missed retrieval vs wr[...]
Offshore
Ahmad RT @TheAhmadOsman: - you are - a normal dev who’s heard “embeddings” and “RAG” 1000x - want to know what they actually are, how they plug into LLMs - suddenly: vectors are just coordinates for meaning, not magic - first: what even is an “embedding”?…
ong generation vs prompt injection
tweet
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: working on getting nanochat training running with TT‑NN
the more i push my single Tenstorrent QuietBox Blackhole,
the more i see just how much headroom this thing has
counting down until my 4x TT‑QuietBox Blackhole cluster arrives
this cluster's going to be an absolute beast https://t.co/lN9VsITgDs
tweet
RT @TheAhmadOsman: working on getting nanochat training running with TT‑NN
the more i push my single Tenstorrent QuietBox Blackhole,
the more i see just how much headroom this thing has
counting down until my 4x TT‑QuietBox Blackhole cluster arrives
this cluster's going to be an absolute beast https://t.co/lN9VsITgDs
tweet
Offshore
Video
Ahmad
RT @TheAhmadOsman: i built a simple tool that makes
Claude Code work with any local LLM
full demo:
> vLLM serving GLM-4.5 Air on 4x RTX 3090s
> Claude Code generating code + docs via my proxy
> 1 Python file + .env handles all requests
> nvtop showing live GPU load
> how it all works
Buy a GPU https://t.co/7nYsId4Uyu
tweet
RT @TheAhmadOsman: i built a simple tool that makes
Claude Code work with any local LLM
full demo:
> vLLM serving GLM-4.5 Air on 4x RTX 3090s
> Claude Code generating code + docs via my proxy
> 1 Python file + .env handles all requests
> nvtop showing live GPU load
> how it all works
Buy a GPU https://t.co/7nYsId4Uyu
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: the Tenstorrent QuietBox Blackhole
> is a 3.2 Tb/s Ethernet mesh
> that pools memory
> and scales almost linearly
> when you daisy‑chain more boxes
the TT-QuietBox Blackhole comes with
> ~80 lbs liquid-cooled chassis
> AMD EPYC 8124P, 16c/32t
> 512 GB DDR5 ECC
> 4 TB NVMe
> ASRock Rack SIENAD8‑2L2T w/ 2x 10 GbE + IPMI
> 4x Blackhole p150c cards, totalling:
> 560 Tensix Cores
> 64 “big” RISC-V cores
> 128 GB GDDR6
> 840 MB On‑Chip SRAM
> 3.2 Tb/s Ethernet mesh
> 16x QSFP‑DD 800G ports for card⇔card comms
> 8x passive direct‑attach copper (DAC) cables (0.6m)
> all of this is powered by a single
> 1650W Platinum PSU, passively cooled
> ready to daisy-chain to the next QuietBox
> also, opensource stack (TT‑Forge → TT‑NN → TT‑Metalium)
the interconnect is the star
> what does “4x QSFP‑DD 800G” actually mean?
> QSFP‑DD = Quad Small Form‑Factor Pluggable — Double Density
> 8 electrical lanes per port
> ~100 GB/s per lane using PAM4 signalling
> total: 800 Gb/s full‑duplex per port → ~100 GB/s usable each way after Ethernet framing + FEC
each card talks directly to its siblings over QSFP‑DD 800G
> 4 ports per card x 800 Gb/s each =
> 3.2 Tb/s of aggregate bidirectional fabric per card
> 16 ports total per “quietbox” =
> 3.2 Tb/s internal mesh across all 4 cards
> this is your NVLink replacement
> no PCIe bottlenecks, no host-side relays
> just a true east-west ethernet fabric
there’s a hard rule
> the QSFP‑DD 800G ports are passive
> they only connect to other Blackhole cards via direct‑attach copper (DAC)
> max length = 2 meters, not optics, not switches, not uplinks to your ethernet fabric
> Blackhole fabric is its own world: card⇔card, box⇔box, nothing else
daisy‑chain the DACs and you’re all set, add more boxes and enjoy the 3.2 Tb/s ethernet mesh that pools memory and scales almost linearly
pretty sleek hardware UX, more soon
tweet
RT @TheAhmadOsman: the Tenstorrent QuietBox Blackhole
> is a 3.2 Tb/s Ethernet mesh
> that pools memory
> and scales almost linearly
> when you daisy‑chain more boxes
the TT-QuietBox Blackhole comes with
> ~80 lbs liquid-cooled chassis
> AMD EPYC 8124P, 16c/32t
> 512 GB DDR5 ECC
> 4 TB NVMe
> ASRock Rack SIENAD8‑2L2T w/ 2x 10 GbE + IPMI
> 4x Blackhole p150c cards, totalling:
> 560 Tensix Cores
> 64 “big” RISC-V cores
> 128 GB GDDR6
> 840 MB On‑Chip SRAM
> 3.2 Tb/s Ethernet mesh
> 16x QSFP‑DD 800G ports for card⇔card comms
> 8x passive direct‑attach copper (DAC) cables (0.6m)
> all of this is powered by a single
> 1650W Platinum PSU, passively cooled
> ready to daisy-chain to the next QuietBox
> also, opensource stack (TT‑Forge → TT‑NN → TT‑Metalium)
the interconnect is the star
> what does “4x QSFP‑DD 800G” actually mean?
> QSFP‑DD = Quad Small Form‑Factor Pluggable — Double Density
> 8 electrical lanes per port
> ~100 GB/s per lane using PAM4 signalling
> total: 800 Gb/s full‑duplex per port → ~100 GB/s usable each way after Ethernet framing + FEC
each card talks directly to its siblings over QSFP‑DD 800G
> 4 ports per card x 800 Gb/s each =
> 3.2 Tb/s of aggregate bidirectional fabric per card
> 16 ports total per “quietbox” =
> 3.2 Tb/s internal mesh across all 4 cards
> this is your NVLink replacement
> no PCIe bottlenecks, no host-side relays
> just a true east-west ethernet fabric
there’s a hard rule
> the QSFP‑DD 800G ports are passive
> they only connect to other Blackhole cards via direct‑attach copper (DAC)
> max length = 2 meters, not optics, not switches, not uplinks to your ethernet fabric
> Blackhole fabric is its own world: card⇔card, box⇔box, nothing else
daisy‑chain the DACs and you’re all set, add more boxes and enjoy the 3.2 Tb/s ethernet mesh that pools memory and scales almost linearly
pretty sleek hardware UX, more soon
tweet
Offshore
Photo
Ahmad
be like this guy
Buy GPUs https://t.co/BjIwH1A9jE
tweet
be like this guy
Buy GPUs https://t.co/BjIwH1A9jE
GLM 4.5 Air running in ~60tok/sec on 4x 3090!
3090s are still great cards to buy if you want to run inference with 100b models, locally, for your own use
https://t.co/YGVln1dcLd https://t.co/76u0DcrVwN - mconcattweet
❤1
AkhenOsiris
Secular shifts are powerful. No need to be early to capitalize on massive gains. Cloud computing, social media, streaming content, etc etc.
And now AI. When ChatGPT debuted (Nov. 2022), NVDA quickly 3x'd in 8 months. It then consolidated and did nothing for the next 6 months. That's well over a year to analyze, research, absorb data. Watch how the narrative unfolds.
Since then, NVDA has 3.5x'd (with a few major drawdowns along the way). Second derivative rates may have finally peaked, but capex, tokens, API calls, etc are still growing. Models are being trained on ever larger clusters, with the latest chips (i.e. Blackwell).
The party will need further advancement, uptake, monetization to keep going, but hard to say with any conviction yet whether it is over or not.
tweet
Secular shifts are powerful. No need to be early to capitalize on massive gains. Cloud computing, social media, streaming content, etc etc.
And now AI. When ChatGPT debuted (Nov. 2022), NVDA quickly 3x'd in 8 months. It then consolidated and did nothing for the next 6 months. That's well over a year to analyze, research, absorb data. Watch how the narrative unfolds.
Since then, NVDA has 3.5x'd (with a few major drawdowns along the way). Second derivative rates may have finally peaked, but capex, tokens, API calls, etc are still growing. Models are being trained on ever larger clusters, with the latest chips (i.e. Blackwell).
The party will need further advancement, uptake, monetization to keep going, but hard to say with any conviction yet whether it is over or not.
tweet
Offshore
Photo
Ahmad
who here would like to see a build video guide for multiple RTX PRO 6000s?
already got the hardware ordered for a couple of 3090s and 5090s build guides btw
yes, there'll be GPU giveaways ;)
first video guide before Thanksgiving
anyway, Buy a GPU keeps on winning
tweet
who here would like to see a build video guide for multiple RTX PRO 6000s?
already got the hardware ordered for a couple of 3090s and 5090s build guides btw
yes, there'll be GPU giveaways ;)
first video guide before Thanksgiving
anyway, Buy a GPU keeps on winning
@TheAhmadOsman Probably nothing 👀🔥 https://t.co/sCr29jVV7J - Mike Bradleytweet
❤1