Offshore
orms and residuals: stabilize learning and prediction, making deep networks possible - positional encodings (like RoPE): tell the model where each token sits in the sequence - so "cat" and "catastrophe" aren't confused by position - by stacking these layers…
few-shot for niche tasks
- common pitfalls
- OOM? out of memory. Model or context too big, quantize or shrink context
- gibberish? used a base model with a chat prompt, or wrong template; check temperature/top_p
- slow? offload to CPU, wrong drivers, no FlashAttention; check CUDA/ROCm/Metal, memory fit
- unsafe? don't use random .bin or trust_remote_code; prefer safetensors, verify source
- why run locally?
- control: all the knobs are yours to tweak:
- sampler, chat templates, decoding, system prompts, quantization, context
- cost: no per-token API billing-just upfront hardware
- privacy: prompts and outputs stay on your machine
- latency: no network roundtrips, instant token streaming
- challenges:
- hardware limits (VRAM/memory = max model/context)
- ecosystem variance (different runtimes, quant schemes, templates)
- ops burden (setup, drivers, updates)
- running local checklist:
- pick a model (prefer chat-tuned, sized for your VRAM)
- pick precision (4-bit saves RAM, FP16 for max quality)
- install runtime (vLLM, llama.cpp, Transformers+PyTorch, etc)
- run it, get tokens/sec, check memory fit
- use correct chat template (apply_chat_template)
- tune decoding (temp/top_p)
- benchmark on your task
- serve as local API (or go wild and fine-tune it)
- glossary:
- token: smallest unit (subword/char)
- context window: max tokens visible to model
- KV cache: session memory, per-layer attention state
- quantization: lower precision for memory/speed
- RoPE: rotary position embeddings (for order)
- GQA/MQA: efficient attention for memory bandwidth
- decoding: method for picking next token
- RAG: retrieval-augmented generation, add real info
- misc:
- common architectures: LLaMA, Falcon, Mistral, GPT-NeoX, etc
- base model: not fine-tuned for chat (LLaMA, Falcon, etc)
- chat-tuned: fine-tuned for dialogue (Alpaca, Vicuna, etc)
- instruct-tuned: fine-tuned for following instructions (LLaMA-2-Chat, Mistral-Instruct, etc)
- chat/instruct models usually need a special prompt template to work well
- chat template: system/user/assistant markup is required; wrong template = junk output
- base models can do few-shot chat prompting, but not as well as chat-tuned ones
- quantized: weights stored in lower precision (8-bit, 4-bit) for memory savings, at some quality loss
- quantization is a tradeoff: memory/speed vs quality
- 4-bit (NF4/GPTQ/AWQ) is the sweet spot for most consumer GPUs (huge memory win, minor quality drop for most tasks)
- math-heavy or finicky tasks degrade first (math, logic, code)
- quantization types: FP16 (full), INT8 (quantized), INT4/NF4 (more quantized), etc.
- some runtimes support quantized KV cache (8/4-bit), big savings for long contexts
- formats/runtimes:
- PyTorch + safetensors: flexible, standard, works on GPU/TPU/CPU
- GGUF (llama.cpp): CPU/GPU, portable, best for quant + edge devices
- ONNX, TensorRT-LLM, MLC: advanced options for special hardware
- avoid legacy .bin (pickle risk), use safetensors for safety
- everything is a tradeoff:
- smaller = fits anywhere, less power
- more context = more latency + VRAM burn
- quantization = faster/leaner, maybe less accurate
- local = full control/knobs, but more work
- final words:
- local LLMs = memory math + correct formatting
- fit weights and KV cache in memory
- use the right chat template and decoding strategy
- know your knobs: quantization, context, decoding, batch, hardware
- master these, and you can run (and reason about) almost any modern model locally
tweet
- common pitfalls
- OOM? out of memory. Model or context too big, quantize or shrink context
- gibberish? used a base model with a chat prompt, or wrong template; check temperature/top_p
- slow? offload to CPU, wrong drivers, no FlashAttention; check CUDA/ROCm/Metal, memory fit
- unsafe? don't use random .bin or trust_remote_code; prefer safetensors, verify source
- why run locally?
- control: all the knobs are yours to tweak:
- sampler, chat templates, decoding, system prompts, quantization, context
- cost: no per-token API billing-just upfront hardware
- privacy: prompts and outputs stay on your machine
- latency: no network roundtrips, instant token streaming
- challenges:
- hardware limits (VRAM/memory = max model/context)
- ecosystem variance (different runtimes, quant schemes, templates)
- ops burden (setup, drivers, updates)
- running local checklist:
- pick a model (prefer chat-tuned, sized for your VRAM)
- pick precision (4-bit saves RAM, FP16 for max quality)
- install runtime (vLLM, llama.cpp, Transformers+PyTorch, etc)
- run it, get tokens/sec, check memory fit
- use correct chat template (apply_chat_template)
- tune decoding (temp/top_p)
- benchmark on your task
- serve as local API (or go wild and fine-tune it)
- glossary:
- token: smallest unit (subword/char)
- context window: max tokens visible to model
- KV cache: session memory, per-layer attention state
- quantization: lower precision for memory/speed
- RoPE: rotary position embeddings (for order)
- GQA/MQA: efficient attention for memory bandwidth
- decoding: method for picking next token
- RAG: retrieval-augmented generation, add real info
- misc:
- common architectures: LLaMA, Falcon, Mistral, GPT-NeoX, etc
- base model: not fine-tuned for chat (LLaMA, Falcon, etc)
- chat-tuned: fine-tuned for dialogue (Alpaca, Vicuna, etc)
- instruct-tuned: fine-tuned for following instructions (LLaMA-2-Chat, Mistral-Instruct, etc)
- chat/instruct models usually need a special prompt template to work well
- chat template: system/user/assistant markup is required; wrong template = junk output
- base models can do few-shot chat prompting, but not as well as chat-tuned ones
- quantized: weights stored in lower precision (8-bit, 4-bit) for memory savings, at some quality loss
- quantization is a tradeoff: memory/speed vs quality
- 4-bit (NF4/GPTQ/AWQ) is the sweet spot for most consumer GPUs (huge memory win, minor quality drop for most tasks)
- math-heavy or finicky tasks degrade first (math, logic, code)
- quantization types: FP16 (full), INT8 (quantized), INT4/NF4 (more quantized), etc.
- some runtimes support quantized KV cache (8/4-bit), big savings for long contexts
- formats/runtimes:
- PyTorch + safetensors: flexible, standard, works on GPU/TPU/CPU
- GGUF (llama.cpp): CPU/GPU, portable, best for quant + edge devices
- ONNX, TensorRT-LLM, MLC: advanced options for special hardware
- avoid legacy .bin (pickle risk), use safetensors for safety
- everything is a tradeoff:
- smaller = fits anywhere, less power
- more context = more latency + VRAM burn
- quantization = faster/leaner, maybe less accurate
- local = full control/knobs, but more work
- final words:
- local LLMs = memory math + correct formatting
- fit weights and KV cache in memory
- use the right chat template and decoding strategy
- know your knobs: quantization, context, decoding, batch, hardware
- master these, and you can run (and reason about) almost any modern model locally
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: the Tenstorrent QuietBox Blackhole
> is a 3.2 Tb/s Ethernet mesh
> that pools memory
> and scales almost linearly
> when you daisy‑chain more boxes
the TT-QuietBox Blackhole comes with
> ~80 lbs liquid-cooled chassis
> AMD EPYC 8124P, 16c/32t
> 512 GB DDR5 ECC
> 4 TB NVMe
> ASRock Rack SIENAD8‑2L2T w/ 2x 10 GbE + IPMI
> 4x Blackhole p150c cards, totalling:
> 560 Tensix Cores
> 64 “big” RISC-V cores
> 128 GB GDDR6
> 840 MB On‑Chip SRAM
> 3.2 Tb/s Ethernet mesh
> 16x QSFP‑DD 800G ports for card⇔card comms
> 8x passive direct‑attach copper (DAC) cables (0.6m)
> all of this is powered by a single
> 1650W Platinum PSU, passively cooled
> ready to daisy-chain to the next QuietBox
> also, opensource stack (TT‑Forge → TT‑NN → TT‑Metalium)
the interconnect is the star
> what does “4x QSFP‑DD 800G” actually mean?
> QSFP‑DD = Quad Small Form‑Factor Pluggable — Double Density
> 8 electrical lanes per port
> ~100 GB/s per lane using PAM4 signalling
> total: 800 Gb/s full‑duplex per port → ~100 GB/s usable each way after Ethernet framing + FEC
each card talks directly to its siblings over QSFP‑DD 800G
> 4 ports per card x 800 Gb/s each =
> 3.2 Tb/s of aggregate bidirectional fabric per card
> 16 ports total per “quietbox” =
> 3.2 Tb/s internal mesh across all 4 cards
> this is your NVLink replacement
> no PCIe bottlenecks, no host-side relays
> just a true east-west ethernet fabric
there’s a hard rule
> the QSFP‑DD 800G ports are passive
> they only connect to other Blackhole cards via direct‑attach copper (DAC)
> max length = 2 meters, not optics, not switches, not uplinks to your ethernet fabric
> Blackhole fabric is its own world: card⇔card, box⇔box, nothing else
daisy‑chain the DACs and you’re all set, add more boxes and enjoy the 3.2 Tb/s ethernet mesh that pools memory and scales almost linearly
pretty sleek hardware UX, more soon
tweet
RT @TheAhmadOsman: the Tenstorrent QuietBox Blackhole
> is a 3.2 Tb/s Ethernet mesh
> that pools memory
> and scales almost linearly
> when you daisy‑chain more boxes
the TT-QuietBox Blackhole comes with
> ~80 lbs liquid-cooled chassis
> AMD EPYC 8124P, 16c/32t
> 512 GB DDR5 ECC
> 4 TB NVMe
> ASRock Rack SIENAD8‑2L2T w/ 2x 10 GbE + IPMI
> 4x Blackhole p150c cards, totalling:
> 560 Tensix Cores
> 64 “big” RISC-V cores
> 128 GB GDDR6
> 840 MB On‑Chip SRAM
> 3.2 Tb/s Ethernet mesh
> 16x QSFP‑DD 800G ports for card⇔card comms
> 8x passive direct‑attach copper (DAC) cables (0.6m)
> all of this is powered by a single
> 1650W Platinum PSU, passively cooled
> ready to daisy-chain to the next QuietBox
> also, opensource stack (TT‑Forge → TT‑NN → TT‑Metalium)
the interconnect is the star
> what does “4x QSFP‑DD 800G” actually mean?
> QSFP‑DD = Quad Small Form‑Factor Pluggable — Double Density
> 8 electrical lanes per port
> ~100 GB/s per lane using PAM4 signalling
> total: 800 Gb/s full‑duplex per port → ~100 GB/s usable each way after Ethernet framing + FEC
each card talks directly to its siblings over QSFP‑DD 800G
> 4 ports per card x 800 Gb/s each =
> 3.2 Tb/s of aggregate bidirectional fabric per card
> 16 ports total per “quietbox” =
> 3.2 Tb/s internal mesh across all 4 cards
> this is your NVLink replacement
> no PCIe bottlenecks, no host-side relays
> just a true east-west ethernet fabric
there’s a hard rule
> the QSFP‑DD 800G ports are passive
> they only connect to other Blackhole cards via direct‑attach copper (DAC)
> max length = 2 meters, not optics, not switches, not uplinks to your ethernet fabric
> Blackhole fabric is its own world: card⇔card, box⇔box, nothing else
daisy‑chain the DACs and you’re all set, add more boxes and enjoy the 3.2 Tb/s ethernet mesh that pools memory and scales almost linearly
pretty sleek hardware UX, more soon
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: the Buy a GPU website & guide is launching this week
so, what should you expect? https://t.co/e36YLjAdoo
tweet
RT @TheAhmadOsman: the Buy a GPU website & guide is launching this week
so, what should you expect? https://t.co/e36YLjAdoo
tweet
Offshore
Video
Ahmad
RT @TheAhmadOsman: i built a simple tool that makes
Claude Code work with any local LLM
full demo:
> vLLM serving GLM-4.5 Air on 4x RTX 3090s
> Claude Code generating code + docs via my proxy
> 1 Python file + .env handles all requests
> nvtop showing live GPU load
> how it all works
Buy a GPU https://t.co/7nYsId4Uyu
tweet
RT @TheAhmadOsman: i built a simple tool that makes
Claude Code work with any local LLM
full demo:
> vLLM serving GLM-4.5 Air on 4x RTX 3090s
> Claude Code generating code + docs via my proxy
> 1 Python file + .env handles all requests
> nvtop showing live GPU load
> how it all works
Buy a GPU https://t.co/7nYsId4Uyu
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: who here would like to see a build video guide for multiple RTX PRO 6000s?
already got the hardware ordered for a couple of 3090s and 5090s build guides btw
yes, there'll be GPU giveaways ;)
first video guide before Thanksgiving
anyway, Buy a GPU keeps on winning
tweet
RT @TheAhmadOsman: who here would like to see a build video guide for multiple RTX PRO 6000s?
already got the hardware ordered for a couple of 3090s and 5090s build guides btw
yes, there'll be GPU giveaways ;)
first video guide before Thanksgiving
anyway, Buy a GPU keeps on winning
@TheAhmadOsman Probably nothing 👀🔥 https://t.co/sCr29jVV7J - Mike Bradleytweet
Ahmad
top 5 local LLMs to run at home in October 2025
> GLM-4.5-Air at #1
> daily driver & all-around MVP
> budget legend, agentic and coding, fits on 4x 3090s
> nearly as good as big brother, however, not as power hungry
> GPT-OSS-120B at #2
> feels like a GPT-5 at home
> big brain, smart and consistent
> agentic, coding, but feels dry in writing
> GPT-OSS-20B at #3
> absolute speed monster
> holds up well at tool calling (short context preferred)
> pretty good at instruction following, quick bug fixes, and general use
> Qwen3-30B-A3B at #4
> best for trivia and knowledge dumps
> not as fast as GPT-OSS-20B, but way more book smart
> Qwen3-Coder-30B at #5
> very capable at coding
> will run on 16GB VRAM quantized
> GLM-4.6 honorable mention
> the local king if you’ve got 4x Pro 6000 Max-Qs
> “beast mode” for tool use, workflows, and bug fixing
tweet
top 5 local LLMs to run at home in October 2025
> GLM-4.5-Air at #1
> daily driver & all-around MVP
> budget legend, agentic and coding, fits on 4x 3090s
> nearly as good as big brother, however, not as power hungry
> GPT-OSS-120B at #2
> feels like a GPT-5 at home
> big brain, smart and consistent
> agentic, coding, but feels dry in writing
> GPT-OSS-20B at #3
> absolute speed monster
> holds up well at tool calling (short context preferred)
> pretty good at instruction following, quick bug fixes, and general use
> Qwen3-30B-A3B at #4
> best for trivia and knowledge dumps
> not as fast as GPT-OSS-20B, but way more book smart
> Qwen3-Coder-30B at #5
> very capable at coding
> will run on 16GB VRAM quantized
> GLM-4.6 honorable mention
> the local king if you’ve got 4x Pro 6000 Max-Qs
> “beast mode” for tool use, workflows, and bug fixing
tweet
Offshore
Photo
App Economy Insights
$PYPL PayPal Q3 FY25:
• Deal to embed digital wallet in ChatGPT.
• TPV +8% Y/Y to $458B.
• Active accounts +1% Y/Y to 438M.
• Transactions per active -6% Y/Y to 58.
• Revenue +7% Y/Y to $8.4B ($170M beat).
• Non-GAAP EPS $1.34 ($0.14 beat).
• FY25 EPS $5.37 ($0.10 raise). https://t.co/1YbJCIXEqj
tweet
$PYPL PayPal Q3 FY25:
• Deal to embed digital wallet in ChatGPT.
• TPV +8% Y/Y to $458B.
• Active accounts +1% Y/Y to 438M.
• Transactions per active -6% Y/Y to 58.
• Revenue +7% Y/Y to $8.4B ($170M beat).
• Non-GAAP EPS $1.34 ($0.14 beat).
• FY25 EPS $5.37 ($0.10 raise). https://t.co/1YbJCIXEqj
tweet
Offshore
Photo
Investing visuals
Both $GOOGL and $MSFT report earnings tomorrow!
Here's what their fundamentals look like heading into earnings 👇 https://t.co/MzJMXqGC7L
tweet
Both $GOOGL and $MSFT report earnings tomorrow!
Here's what their fundamentals look like heading into earnings 👇 https://t.co/MzJMXqGC7L
tweet
Offshore
Photo
App Economy Insights
$SOFI SoFi Q3 FY25:
• Members +35% to 12.6M.
• Revenue +38% Y/Y to $962M.
• Adj. revenue $950M ($55M beat).
• Adj. EBITDA +49% Y/Y to $277M.
• Non-GAAP EPS $0.11 ($0.03 beat).
• FY25 revenue growth 36% ($165M raise). https://t.co/Yk6AEgWBpi
tweet
$SOFI SoFi Q3 FY25:
• Members +35% to 12.6M.
• Revenue +38% Y/Y to $962M.
• Adj. revenue $950M ($55M beat).
• Adj. EBITDA +49% Y/Y to $277M.
• Non-GAAP EPS $0.11 ($0.03 beat).
• FY25 revenue growth 36% ($165M raise). https://t.co/Yk6AEgWBpi
tweet