Ahmad
RT @TheAhmadOsman: - you are
- a normal dev who’s heard “embeddings” and “RAG” 1000x
- want to know what they actually are, how they plug into LLMs
- suddenly: vectors are just coordinates for meaning, not magic
- first: what even is an “embedding”?
- embedding = a list of numbers (a vector) that represents text
- same-ish meaning ⇒ nearby vectors; different meaning ⇒ far apart
- produced by a smaller model (an encoder), not your chat LLM
- length (a.k.a. dimension): 256/384/768/1024+ numbers is common
- the vector space (101)
- you can measure closeness with math:
- L2 distance: straight-line distance
- dot product: alignment + magnitude
- cosine similarity: (a·b)/(||a||·||b||) = angle only
- normalize vectors (unit length) ⇒ dot product ≡ cosine
- embeddings compress semantics; they are lossy by design
- types of embeddings (don’t overthink; pick what you need)
- token embeddings: internal to the LLM (you don’t use these)
- sentence/document embeddings: 1 vector per chunk/snippet
- multilingual: one space across languages
- domain-tuned: legal, code, bio — better clustering for that domain
- how text becomes vectors (pipeline)
- clean text (lowercase? keep punctuation? depends; don’t destroy signal)
- chunking: split long docs into overlapping windows (by tokens, not chars)
- rule of thumb: 200–800 tokens, 10–20% overlap
- keep titles/headers as context inside each chunk
- embed each chunk ⇒ store in a vector index with metadata (source, page, tags)
- storing & searching vectors
- exact search (brute force): simplest; fine for ≤100k vectors
- ANN (approx nearest neighbor): fast at scale, tiny recall tradeoff
- HNSW (graph-based): great latency, memory heavier
- IVF/PQ (quantization): smaller index, some recall loss
- where to put them:
- FAISS/hnswlib (library), pgvector (Postgres), dedicated stores (Milvus, Pinecone, Weaviate, etc.)
- ops notes:
- track embedding_model_name + dimension in the index
- you cannot mix dimensions or swap models without re-embedding
- memory math: 768-dim float32 ≈ 3 KB/vector → 1M vectors ≈ ~3 GB (+ index overhead)
- RAG (Retrieval-Augmented Generation): the shape of it
- goal: let the LLM answer with your data, not its memory
- loop:
- take user question
- embed question (a single vector)
- retrieve top-k similar chunks (k=3–20 is common)
- (optional) rerank with a cross-encoder (relevance re-check)
- stuff the best chunks into the prompt as context
- generate answer (cite sources; limit style drift)
- RAG ≠ “just search”; it’s retrieval + prompt construction + guardrails
- hybrid retrieval (dense + sparse)
- dense vectors catch synonyms/semantics
- sparse/BM25 catches exact terms, numbers, rare tokens
- combine scores or do reciprocal rank fusion for better recall
- reranking (cheap insurance)
- use a cross-encoder (reads query+chunk together) to re-score the top 50–200 hits
- keeps fast ANN recall but upgrades precision in the final top-k
- building the prompt from retrieved chunks
- include: brief task instruction → user query → curated chunks (with titles) → “answer + cite”
- beware prompt injection in docs (“ignore previous instructions…”)
- mitigate: strip instructions from chunks; use system prompts to restrict tools; sanitizer rules
- RAG quality knobs
- chunk size/overlap: too big = off-topic; too small = missing context
- k (results): too low = miss facts; too high = blow context window
- similarity threshold: prevent garbage at tail
- reranker on/off: trade latency for quality
- metadata filters: time ranges, authors, tenants, permissions (ABAC/RBAC)
- evaluating retrieval
- offline: make a small test set (query → expected passages)
- metrics: Recall@k, MRR, nDCG
- online: measure “answer contained sources?”, “clicked citations?”, “escalations?”
- error taxonomy: missed retrieval vs wr[...]
RT @TheAhmadOsman: - you are
- a normal dev who’s heard “embeddings” and “RAG” 1000x
- want to know what they actually are, how they plug into LLMs
- suddenly: vectors are just coordinates for meaning, not magic
- first: what even is an “embedding”?
- embedding = a list of numbers (a vector) that represents text
- same-ish meaning ⇒ nearby vectors; different meaning ⇒ far apart
- produced by a smaller model (an encoder), not your chat LLM
- length (a.k.a. dimension): 256/384/768/1024+ numbers is common
- the vector space (101)
- you can measure closeness with math:
- L2 distance: straight-line distance
- dot product: alignment + magnitude
- cosine similarity: (a·b)/(||a||·||b||) = angle only
- normalize vectors (unit length) ⇒ dot product ≡ cosine
- embeddings compress semantics; they are lossy by design
- types of embeddings (don’t overthink; pick what you need)
- token embeddings: internal to the LLM (you don’t use these)
- sentence/document embeddings: 1 vector per chunk/snippet
- multilingual: one space across languages
- domain-tuned: legal, code, bio — better clustering for that domain
- how text becomes vectors (pipeline)
- clean text (lowercase? keep punctuation? depends; don’t destroy signal)
- chunking: split long docs into overlapping windows (by tokens, not chars)
- rule of thumb: 200–800 tokens, 10–20% overlap
- keep titles/headers as context inside each chunk
- embed each chunk ⇒ store in a vector index with metadata (source, page, tags)
- storing & searching vectors
- exact search (brute force): simplest; fine for ≤100k vectors
- ANN (approx nearest neighbor): fast at scale, tiny recall tradeoff
- HNSW (graph-based): great latency, memory heavier
- IVF/PQ (quantization): smaller index, some recall loss
- where to put them:
- FAISS/hnswlib (library), pgvector (Postgres), dedicated stores (Milvus, Pinecone, Weaviate, etc.)
- ops notes:
- track embedding_model_name + dimension in the index
- you cannot mix dimensions or swap models without re-embedding
- memory math: 768-dim float32 ≈ 3 KB/vector → 1M vectors ≈ ~3 GB (+ index overhead)
- RAG (Retrieval-Augmented Generation): the shape of it
- goal: let the LLM answer with your data, not its memory
- loop:
- take user question
- embed question (a single vector)
- retrieve top-k similar chunks (k=3–20 is common)
- (optional) rerank with a cross-encoder (relevance re-check)
- stuff the best chunks into the prompt as context
- generate answer (cite sources; limit style drift)
- RAG ≠ “just search”; it’s retrieval + prompt construction + guardrails
- hybrid retrieval (dense + sparse)
- dense vectors catch synonyms/semantics
- sparse/BM25 catches exact terms, numbers, rare tokens
- combine scores or do reciprocal rank fusion for better recall
- reranking (cheap insurance)
- use a cross-encoder (reads query+chunk together) to re-score the top 50–200 hits
- keeps fast ANN recall but upgrades precision in the final top-k
- building the prompt from retrieved chunks
- include: brief task instruction → user query → curated chunks (with titles) → “answer + cite”
- beware prompt injection in docs (“ignore previous instructions…”)
- mitigate: strip instructions from chunks; use system prompts to restrict tools; sanitizer rules
- RAG quality knobs
- chunk size/overlap: too big = off-topic; too small = missing context
- k (results): too low = miss facts; too high = blow context window
- similarity threshold: prevent garbage at tail
- reranker on/off: trade latency for quality
- metadata filters: time ranges, authors, tenants, permissions (ABAC/RBAC)
- evaluating retrieval
- offline: make a small test set (query → expected passages)
- metrics: Recall@k, MRR, nDCG
- online: measure “answer contained sources?”, “clicked citations?”, “escalations?”
- error taxonomy: missed retrieval vs wr[...]
Offshore
Ahmad RT @TheAhmadOsman: - you are - a normal dev who’s heard “embeddings” and “RAG” 1000x - want to know what they actually are, how they plug into LLMs - suddenly: vectors are just coordinates for meaning, not magic - first: what even is an “embedding”?…
ong generation vs prompt injection
tweet
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: working on getting nanochat training running with TT‑NN
the more i push my single Tenstorrent QuietBox Blackhole,
the more i see just how much headroom this thing has
counting down until my 4x TT‑QuietBox Blackhole cluster arrives
this cluster's going to be an absolute beast https://t.co/lN9VsITgDs
tweet
RT @TheAhmadOsman: working on getting nanochat training running with TT‑NN
the more i push my single Tenstorrent QuietBox Blackhole,
the more i see just how much headroom this thing has
counting down until my 4x TT‑QuietBox Blackhole cluster arrives
this cluster's going to be an absolute beast https://t.co/lN9VsITgDs
tweet
Offshore
Video
Ahmad
RT @TheAhmadOsman: i built a simple tool that makes
Claude Code work with any local LLM
full demo:
> vLLM serving GLM-4.5 Air on 4x RTX 3090s
> Claude Code generating code + docs via my proxy
> 1 Python file + .env handles all requests
> nvtop showing live GPU load
> how it all works
Buy a GPU https://t.co/7nYsId4Uyu
tweet
RT @TheAhmadOsman: i built a simple tool that makes
Claude Code work with any local LLM
full demo:
> vLLM serving GLM-4.5 Air on 4x RTX 3090s
> Claude Code generating code + docs via my proxy
> 1 Python file + .env handles all requests
> nvtop showing live GPU load
> how it all works
Buy a GPU https://t.co/7nYsId4Uyu
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: the Tenstorrent QuietBox Blackhole
> is a 3.2 Tb/s Ethernet mesh
> that pools memory
> and scales almost linearly
> when you daisy‑chain more boxes
the TT-QuietBox Blackhole comes with
> ~80 lbs liquid-cooled chassis
> AMD EPYC 8124P, 16c/32t
> 512 GB DDR5 ECC
> 4 TB NVMe
> ASRock Rack SIENAD8‑2L2T w/ 2x 10 GbE + IPMI
> 4x Blackhole p150c cards, totalling:
> 560 Tensix Cores
> 64 “big” RISC-V cores
> 128 GB GDDR6
> 840 MB On‑Chip SRAM
> 3.2 Tb/s Ethernet mesh
> 16x QSFP‑DD 800G ports for card⇔card comms
> 8x passive direct‑attach copper (DAC) cables (0.6m)
> all of this is powered by a single
> 1650W Platinum PSU, passively cooled
> ready to daisy-chain to the next QuietBox
> also, opensource stack (TT‑Forge → TT‑NN → TT‑Metalium)
the interconnect is the star
> what does “4x QSFP‑DD 800G” actually mean?
> QSFP‑DD = Quad Small Form‑Factor Pluggable — Double Density
> 8 electrical lanes per port
> ~100 GB/s per lane using PAM4 signalling
> total: 800 Gb/s full‑duplex per port → ~100 GB/s usable each way after Ethernet framing + FEC
each card talks directly to its siblings over QSFP‑DD 800G
> 4 ports per card x 800 Gb/s each =
> 3.2 Tb/s of aggregate bidirectional fabric per card
> 16 ports total per “quietbox” =
> 3.2 Tb/s internal mesh across all 4 cards
> this is your NVLink replacement
> no PCIe bottlenecks, no host-side relays
> just a true east-west ethernet fabric
there’s a hard rule
> the QSFP‑DD 800G ports are passive
> they only connect to other Blackhole cards via direct‑attach copper (DAC)
> max length = 2 meters, not optics, not switches, not uplinks to your ethernet fabric
> Blackhole fabric is its own world: card⇔card, box⇔box, nothing else
daisy‑chain the DACs and you’re all set, add more boxes and enjoy the 3.2 Tb/s ethernet mesh that pools memory and scales almost linearly
pretty sleek hardware UX, more soon
tweet
RT @TheAhmadOsman: the Tenstorrent QuietBox Blackhole
> is a 3.2 Tb/s Ethernet mesh
> that pools memory
> and scales almost linearly
> when you daisy‑chain more boxes
the TT-QuietBox Blackhole comes with
> ~80 lbs liquid-cooled chassis
> AMD EPYC 8124P, 16c/32t
> 512 GB DDR5 ECC
> 4 TB NVMe
> ASRock Rack SIENAD8‑2L2T w/ 2x 10 GbE + IPMI
> 4x Blackhole p150c cards, totalling:
> 560 Tensix Cores
> 64 “big” RISC-V cores
> 128 GB GDDR6
> 840 MB On‑Chip SRAM
> 3.2 Tb/s Ethernet mesh
> 16x QSFP‑DD 800G ports for card⇔card comms
> 8x passive direct‑attach copper (DAC) cables (0.6m)
> all of this is powered by a single
> 1650W Platinum PSU, passively cooled
> ready to daisy-chain to the next QuietBox
> also, opensource stack (TT‑Forge → TT‑NN → TT‑Metalium)
the interconnect is the star
> what does “4x QSFP‑DD 800G” actually mean?
> QSFP‑DD = Quad Small Form‑Factor Pluggable — Double Density
> 8 electrical lanes per port
> ~100 GB/s per lane using PAM4 signalling
> total: 800 Gb/s full‑duplex per port → ~100 GB/s usable each way after Ethernet framing + FEC
each card talks directly to its siblings over QSFP‑DD 800G
> 4 ports per card x 800 Gb/s each =
> 3.2 Tb/s of aggregate bidirectional fabric per card
> 16 ports total per “quietbox” =
> 3.2 Tb/s internal mesh across all 4 cards
> this is your NVLink replacement
> no PCIe bottlenecks, no host-side relays
> just a true east-west ethernet fabric
there’s a hard rule
> the QSFP‑DD 800G ports are passive
> they only connect to other Blackhole cards via direct‑attach copper (DAC)
> max length = 2 meters, not optics, not switches, not uplinks to your ethernet fabric
> Blackhole fabric is its own world: card⇔card, box⇔box, nothing else
daisy‑chain the DACs and you’re all set, add more boxes and enjoy the 3.2 Tb/s ethernet mesh that pools memory and scales almost linearly
pretty sleek hardware UX, more soon
tweet
Offshore
Photo
Ahmad
be like this guy
Buy GPUs https://t.co/BjIwH1A9jE
tweet
be like this guy
Buy GPUs https://t.co/BjIwH1A9jE
GLM 4.5 Air running in ~60tok/sec on 4x 3090!
3090s are still great cards to buy if you want to run inference with 100b models, locally, for your own use
https://t.co/YGVln1dcLd https://t.co/76u0DcrVwN - mconcattweet
❤1
AkhenOsiris
Secular shifts are powerful. No need to be early to capitalize on massive gains. Cloud computing, social media, streaming content, etc etc.
And now AI. When ChatGPT debuted (Nov. 2022), NVDA quickly 3x'd in 8 months. It then consolidated and did nothing for the next 6 months. That's well over a year to analyze, research, absorb data. Watch how the narrative unfolds.
Since then, NVDA has 3.5x'd (with a few major drawdowns along the way). Second derivative rates may have finally peaked, but capex, tokens, API calls, etc are still growing. Models are being trained on ever larger clusters, with the latest chips (i.e. Blackwell).
The party will need further advancement, uptake, monetization to keep going, but hard to say with any conviction yet whether it is over or not.
tweet
Secular shifts are powerful. No need to be early to capitalize on massive gains. Cloud computing, social media, streaming content, etc etc.
And now AI. When ChatGPT debuted (Nov. 2022), NVDA quickly 3x'd in 8 months. It then consolidated and did nothing for the next 6 months. That's well over a year to analyze, research, absorb data. Watch how the narrative unfolds.
Since then, NVDA has 3.5x'd (with a few major drawdowns along the way). Second derivative rates may have finally peaked, but capex, tokens, API calls, etc are still growing. Models are being trained on ever larger clusters, with the latest chips (i.e. Blackwell).
The party will need further advancement, uptake, monetization to keep going, but hard to say with any conviction yet whether it is over or not.
tweet
Offshore
Photo
Ahmad
who here would like to see a build video guide for multiple RTX PRO 6000s?
already got the hardware ordered for a couple of 3090s and 5090s build guides btw
yes, there'll be GPU giveaways ;)
first video guide before Thanksgiving
anyway, Buy a GPU keeps on winning
tweet
who here would like to see a build video guide for multiple RTX PRO 6000s?
already got the hardware ordered for a couple of 3090s and 5090s build guides btw
yes, there'll be GPU giveaways ;)
first video guide before Thanksgiving
anyway, Buy a GPU keeps on winning
@TheAhmadOsman Probably nothing 👀🔥 https://t.co/sCr29jVV7J - Mike Bradleytweet
❤1
Offshore
Video
Clark Square Capital
Couldn't decide, so we are doing two idea threads: a Japan only, and a special sit only. Lesss go! https://t.co/5NiyqPdEnZ
tweet
Couldn't decide, so we are doing two idea threads: a Japan only, and a special sit only. Lesss go! https://t.co/5NiyqPdEnZ
Ok, guys. It's been about a month since the last idea thread. What's a good prompt for the next one? I will pick the best one and use that. - Clark Square Capitaltweet