Offshore

Ahmad
RT @TheAhmadOsman: - you are
- a normal dev who’s heard “embeddings” and “RAG” 1000x
- want to know what they actually are, how they plug into LLMs
- suddenly: vectors are just coordinates for meaning, not magic

- first: what even is an “embedding”?
- embedding = a list of numbers (a vector) that represents text
- same-ish meaning ⇒ nearby vectors; different meaning ⇒ far apart
- produced by a smaller model (an encoder), not your chat LLM
- length (a.k.a. dimension): 256/384/768/1024+ numbers is common

- the vector space (101)
- you can measure closeness with math:
- L2 distance: straight-line distance
- dot product: alignment + magnitude
- cosine similarity: (a·b)/(||a||·||b||) = angle only
- normalize vectors (unit length) ⇒ dot product ≡ cosine
- embeddings compress semantics; they are lossy by design

- types of embeddings (don’t overthink; pick what you need)
- token embeddings: internal to the LLM (you don’t use these)
- sentence/document embeddings: 1 vector per chunk/snippet
- multilingual: one space across languages
- domain-tuned: legal, code, bio — better clustering for that domain

- how text becomes vectors (pipeline)
- clean text (lowercase? keep punctuation? depends; don’t destroy signal)
- chunking: split long docs into overlapping windows (by tokens, not chars)
- rule of thumb: 200–800 tokens, 10–20% overlap
- keep titles/headers as context inside each chunk
- embed each chunk ⇒ store in a vector index with metadata (source, page, tags)

- storing & searching vectors
- exact search (brute force): simplest; fine for ≤100k vectors
- ANN (approx nearest neighbor): fast at scale, tiny recall tradeoff
- HNSW (graph-based): great latency, memory heavier
- IVF/PQ (quantization): smaller index, some recall loss
- where to put them:
- FAISS/hnswlib (library), pgvector (Postgres), dedicated stores (Milvus, Pinecone, Weaviate, etc.)
- ops notes:
- track embedding_model_name + dimension in the index
- you cannot mix dimensions or swap models without re-embedding
- memory math: 768-dim float32 ≈ 3 KB/vector → 1M vectors ≈ ~3 GB (+ index overhead)

- RAG (Retrieval-Augmented Generation): the shape of it
- goal: let the LLM answer with your data, not its memory
- loop:
- take user question
- embed question (a single vector)
- retrieve top-k similar chunks (k=3–20 is common)
- (optional) rerank with a cross-encoder (relevance re-check)
- stuff the best chunks into the prompt as context
- generate answer (cite sources; limit style drift)
- RAG ≠ “just search”; it’s retrieval + prompt construction + guardrails

- hybrid retrieval (dense + sparse)
- dense vectors catch synonyms/semantics
- sparse/BM25 catches exact terms, numbers, rare tokens
- combine scores or do reciprocal rank fusion for better recall

- reranking (cheap insurance)
- use a cross-encoder (reads query+chunk together) to re-score the top 50–200 hits
- keeps fast ANN recall but upgrades precision in the final top-k

- building the prompt from retrieved chunks
- include: brief task instruction → user query → curated chunks (with titles) → “answer + cite”
- beware prompt injection in docs (“ignore previous instructions…”)
- mitigate: strip instructions from chunks; use system prompts to restrict tools; sanitizer rules

- RAG quality knobs
- chunk size/overlap: too big = off-topic; too small = missing context
- k (results): too low = miss facts; too high = blow context window
- similarity threshold: prevent garbage at tail
- reranker on/off: trade latency for quality
- metadata filters: time ranges, authors, tenants, permissions (ABAC/RBAC)

- evaluating retrieval
- offline: make a small test set (query → expected passages)
- metrics: Recall@k, MRR, nDCG
- online: measure “answer contained sources?”, “clicked citations?”, “escalations?”
- error taxonomy: missed retrieval vs wr[...]

1 view18:31