Clark Square Capital
RT @ClarkSquareCap: Idea thread #1!
What's your favorite Japanese stock? (Any market cap/style).
Add a sentence explaining why you like it + valuation.
As usual, I will compile the responses and share them with everyone.
Please retweet for visibility. Thx in advance! 🙏
tweet
RT @ClarkSquareCap: Idea thread #1!
What's your favorite Japanese stock? (Any market cap/style).
Add a sentence explaining why you like it + valuation.
As usual, I will compile the responses and share them with everyone.
Please retweet for visibility. Thx in advance! 🙏
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: My house has 33 GPUs.
> 21x RTX 3090s
> 4x RTX 4090s
> 4x RTX 5090s
> 4x Tenstorrent Blackhole p150a
Before AGI arrives:
Acquire GPUs.
Go into debt if you must.
But whatever you do, secure the GPUs. https://t.co/8U89OStknt
tweet
RT @TheAhmadOsman: My house has 33 GPUs.
> 21x RTX 3090s
> 4x RTX 4090s
> 4x RTX 5090s
> 4x Tenstorrent Blackhole p150a
Before AGI arrives:
Acquire GPUs.
Go into debt if you must.
But whatever you do, secure the GPUs. https://t.co/8U89OStknt
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: > today this guy axes FAIR at Meta
> so this is a quick recap of his origin story
> and why he should not be the one
> making that decision
> Alexandr Wang, born January 1997
> age 19, drop out of MIT
> co-found Scale AI
> "what if we label data, but mid?"
> convince every LLM company that this is fine
> 2016–2023
> flood the market with barely-labeled goat photos and out-of-context Reddit takes
> call it “foundational data”
> raise billions
> valuation hits $7.3B
> everyone claps
> 2025
> sell Scale AI to Meta for $14B
> not a typo.
> fourteen. billion. dollars.
> join Meta as Chief AI Officer
> rename division to Meta Superintelligence Labs
> start saying things like “AGI by 2027” in interviews
> meanwhile, researchers:
> "the data from Scale is trash"
> models hallucinate goat facts and mislabel wheelchairs as motorcycles
> AI alignment folks are malding
> i am Alexandr. unbothered. moisturized. thriving.
> ranked #1 in Times Top Grifters of All Time
> beat out SBF, Elizabeth Holmes, and your favorite VC
> literally built an empire out of copy-pasted Amazon Mechanical Turk tasks
> mfw I labeled 4chan posts for pennies and turned it into a 14B exit
> mfw I am now leading Meta's quest for godlike AI
> mfw data quality was never part of the business model
> never bet against the grind
tweet
RT @TheAhmadOsman: > today this guy axes FAIR at Meta
> so this is a quick recap of his origin story
> and why he should not be the one
> making that decision
> Alexandr Wang, born January 1997
> age 19, drop out of MIT
> co-found Scale AI
> "what if we label data, but mid?"
> convince every LLM company that this is fine
> 2016–2023
> flood the market with barely-labeled goat photos and out-of-context Reddit takes
> call it “foundational data”
> raise billions
> valuation hits $7.3B
> everyone claps
> 2025
> sell Scale AI to Meta for $14B
> not a typo.
> fourteen. billion. dollars.
> join Meta as Chief AI Officer
> rename division to Meta Superintelligence Labs
> start saying things like “AGI by 2027” in interviews
> meanwhile, researchers:
> "the data from Scale is trash"
> models hallucinate goat facts and mislabel wheelchairs as motorcycles
> AI alignment folks are malding
> i am Alexandr. unbothered. moisturized. thriving.
> ranked #1 in Times Top Grifters of All Time
> beat out SBF, Elizabeth Holmes, and your favorite VC
> literally built an empire out of copy-pasted Amazon Mechanical Turk tasks
> mfw I labeled 4chan posts for pennies and turned it into a 14B exit
> mfw I am now leading Meta's quest for godlike AI
> mfw data quality was never part of the business model
> never bet against the grind
tweet
Ahmad
RT @TheAhmadOsman: pro tip:
tell codex-cli or claude code to
generate relevant pre-commit hooks for your project
tweet
RT @TheAhmadOsman: pro tip:
tell codex-cli or claude code to
generate relevant pre-commit hooks for your project
tweet
Ahmad
RT @TheAhmadOsman: > be you
> want to actually learn how LLMs work
> sick of “just start with linear algebra and come back in 5 years”
> decide to build my own roadmap
> no fluff. no detours. no 200-hour generic ML playlists
> just the stuff that actually gets you from “what’s a token?” to “I trained a mini-GPT with LoRA adapters and FlashAttention”
> goal: build, fine-tune, and ship LLMs
> not vibe with them. not "learn the theory" forever
> build them
> you will:
> > build an autograd engine from scratch
> > write a mini-GPT from scratch
> > implement LoRA and fine-tune a model on real data
> > hate CUDA at least once
> > cry
> > keep going
> 5 phases
> if you already know something? skip
> if you're lost? rewatch
> if you’re stuck? use DeepResearch
> this is a roadmap, not a leash
> by the end: you either built the thing or you didn’t
> phase 0: foundations
> > if matrix multiplication is scary, you’re not ready yet
> > watch 3Blue1Brown’s linear algebra series
> > MIT 18.06 with Strang, yes, he’s still the GOAT
> > code Micrograd from scratch (Karpathy)
> > train a mini-MLP on MNIST
> > no frameworks, no shortcuts, no mercy
> phase 1: transformers
> > the name is scary
> > it’s just stacked matrix multiplies and attention blocks
> > Jay Alammar + 3Blue1Brown for the “aha”
> > Stanford CS224N for the theory
> > read "Attention Is All You Need" only AFTER building mental models
> > Karpathy's "Let's Build GPT" will break your brain in a good way
> > project: build a decoder-only GPT from scratch
> > bonus: swap tokenizers, try BPE/SentencePiece
> phase 2: scaling
> > LLMs got good by scaling, not magic
> > Kaplan paper -> Chinchilla paper
> > learn Data, Tensor, Pipeline parallelism
> > spin up multi-GPU jobs using HuggingFace Accelerate
> > run into VRAM issues
> > fix them
> > welcome to real training hell
> phase 3: alignment & fine-tuning
> > RLHF: OpenAI blog -> Ouyang paper
> > SFT -> reward model -> PPO (don’t get lost here)
> > Anthropic's Constitutional AI = smart constraints
> > LoRA/QLoRA: read, implement, inject into HuggingFace models
> > fine-tune on real data
> > project: fine-tune gpt2 or distilbert with your own adapters
> > not toy examples. real use cases or bust
> phase 4: production
> this is the part people skip to, but you earned it
> inference optimization: FlashAttention, quantization, sub-second latency
> read the paper, test with quantized models
> resources:
> math/coding:
> > 3Blue1Brown, MIT 18.06, Goodfellow’s book
> PyTorch:
> > Karpathy, Zero to Mastery
> > transformers:
> > Alammar, Karpathy, CS224N, Vaswani et al
> > scaling:
> > Kaplan, Chinchilla, HuggingFace Accelerate
> > alignment:
> > OpenAI, Anthropic, LoRA, QLoRA
> > inference:
> > FlashAttention
> the endgame:
> > understand how these models actually work
> > see through hype
> > ignore LinkedIn noise
> > build tooling
> > train real stuff
> > ship your own stack
> > look at a paper and think “yeah I get it”
> > build your own AI assistant, infra, whatever
> make it all the way through?
> ship something real?
> DM me.
> I wanna see what you built.
> happy hacking.
tweet
RT @TheAhmadOsman: > be you
> want to actually learn how LLMs work
> sick of “just start with linear algebra and come back in 5 years”
> decide to build my own roadmap
> no fluff. no detours. no 200-hour generic ML playlists
> just the stuff that actually gets you from “what’s a token?” to “I trained a mini-GPT with LoRA adapters and FlashAttention”
> goal: build, fine-tune, and ship LLMs
> not vibe with them. not "learn the theory" forever
> build them
> you will:
> > build an autograd engine from scratch
> > write a mini-GPT from scratch
> > implement LoRA and fine-tune a model on real data
> > hate CUDA at least once
> > cry
> > keep going
> 5 phases
> if you already know something? skip
> if you're lost? rewatch
> if you’re stuck? use DeepResearch
> this is a roadmap, not a leash
> by the end: you either built the thing or you didn’t
> phase 0: foundations
> > if matrix multiplication is scary, you’re not ready yet
> > watch 3Blue1Brown’s linear algebra series
> > MIT 18.06 with Strang, yes, he’s still the GOAT
> > code Micrograd from scratch (Karpathy)
> > train a mini-MLP on MNIST
> > no frameworks, no shortcuts, no mercy
> phase 1: transformers
> > the name is scary
> > it’s just stacked matrix multiplies and attention blocks
> > Jay Alammar + 3Blue1Brown for the “aha”
> > Stanford CS224N for the theory
> > read "Attention Is All You Need" only AFTER building mental models
> > Karpathy's "Let's Build GPT" will break your brain in a good way
> > project: build a decoder-only GPT from scratch
> > bonus: swap tokenizers, try BPE/SentencePiece
> phase 2: scaling
> > LLMs got good by scaling, not magic
> > Kaplan paper -> Chinchilla paper
> > learn Data, Tensor, Pipeline parallelism
> > spin up multi-GPU jobs using HuggingFace Accelerate
> > run into VRAM issues
> > fix them
> > welcome to real training hell
> phase 3: alignment & fine-tuning
> > RLHF: OpenAI blog -> Ouyang paper
> > SFT -> reward model -> PPO (don’t get lost here)
> > Anthropic's Constitutional AI = smart constraints
> > LoRA/QLoRA: read, implement, inject into HuggingFace models
> > fine-tune on real data
> > project: fine-tune gpt2 or distilbert with your own adapters
> > not toy examples. real use cases or bust
> phase 4: production
> this is the part people skip to, but you earned it
> inference optimization: FlashAttention, quantization, sub-second latency
> read the paper, test with quantized models
> resources:
> math/coding:
> > 3Blue1Brown, MIT 18.06, Goodfellow’s book
> PyTorch:
> > Karpathy, Zero to Mastery
> > transformers:
> > Alammar, Karpathy, CS224N, Vaswani et al
> > scaling:
> > Kaplan, Chinchilla, HuggingFace Accelerate
> > alignment:
> > OpenAI, Anthropic, LoRA, QLoRA
> > inference:
> > FlashAttention
> the endgame:
> > understand how these models actually work
> > see through hype
> > ignore LinkedIn noise
> > build tooling
> > train real stuff
> > ship your own stack
> > look at a paper and think “yeah I get it”
> > build your own AI assistant, infra, whatever
> make it all the way through?
> ship something real?
> DM me.
> I wanna see what you built.
> happy hacking.
tweet
Offshore
Video
Ahmad
RT @TheAhmadOsman: can’t write code because Cursor and Codex are both down thanks to the aws-us-east-1 outage?
tired of Anthropic’s weekly limits and nerfed models?
with one command and a few GPUs,
you can route Claude Code to your own local LLM with ZERO downtime
Buy a GPU https://t.co/aj8r201V83
tweet
RT @TheAhmadOsman: can’t write code because Cursor and Codex are both down thanks to the aws-us-east-1 outage?
tired of Anthropic’s weekly limits and nerfed models?
with one command and a few GPUs,
you can route Claude Code to your own local LLM with ZERO downtime
Buy a GPU https://t.co/aj8r201V83
i built a simple tool that makes
Claude Code work with any local LLM
full demo:
> vLLM serving GLM-4.5 Air on 4x RTX 3090s
> Claude Code generating code + docs via my proxy
> 1 Python file + .env handles all requests
> nvtop showing live GPU load
> how it all works
Buy a GPU https://t.co/7nYsId4Uyu - Ahmadtweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: missed but not forgotten :(
i am sorry Google fumbled you, Gemini 2.5 Pro 03-25, you were the bestest of models https://t.co/tBwqOWJJrB
tweet
RT @TheAhmadOsman: missed but not forgotten :(
i am sorry Google fumbled you, Gemini 2.5 Pro 03-25, you were the bestest of models https://t.co/tBwqOWJJrB
tweet
Ahmad
RT @TheAhmadOsman: - you are
- a random CS grad with 0 clue how LLMs work
- get tired of people gatekeeping with big words and tiny GPUs
- decide to go full monk mode
- 2 years later i can explain attention mechanisms at parties and ruin them
- here’s the forbidden knowledge map
- top to bottom, how LLMs *actually* work
- start at the beginning
- text → tokens
- tokens → embeddings
- you are now a floating point number in 4D space
- vibe accordingly
- positional embeddings:
- absolute: “i am position 5”
- rotary (RoPE): “i am a sine wave”
- alibi: “i scale attention by distance like a hater”
- attention is all you need
- self-attention: “who am i allowed to pay attention to?”
- multihead: “what if i do that 8 times in parallel?”
- QKV: query, key, value
- sounds like a crypto scam
- actually the core of intelligence
- transformers:
- take your inputs
- smash them through attention layers
- normalize, activate, repeat
- dump the logits
- congratulations, you just inferred a token
- sampling tricks for the final output:
- temperature: how chaotic you want to be
- top-k: only sample from the top K options
- top-p: sample from the smallest group of tokens whose probabilities sum to p
- beam search? never ask about beam search
- kv cache = cheat code
- saves past keys & values
- lets you skip reprocessing old tokens
- turns a 90B model from “help me I’m melting” to “real-time genius”
- long context hacks:
- sliding window: move the attention like a scanner
- infini attention: attend sparsely, like a laser sniper
- memory layers: store thoughts like a diary with read access
- mixture of experts (MoE):
- not all weights matter
- route tokens to different sub-networks
- only activate ~3B params out of 80B
- “only the experts reply” energy
- grouped query attention (GQA):
- fewer keys/values than queries
- improves inference speed
- “i want to be fast without being dumb”
- normalization & activations:
- layernorm, RMSnorm
- gelu, silu, relu
- they all sound like failed Pokémon
- but they make the network stable and smooth
- training goals:
- causal LM: guess the next word
- masked LM: guess the missing word
- span prediction, fill-in-the-middle, etc
- LLMs trained on the art of guessing and got good at it
- tuning flavors:
- finetuning: new weights
- instruction tuning: “please act helpful”
- rlhf: reinforcement from vibes and clickbait prompts
- dpo: direct preference optimization — basically “do what humans upvote”
- scaling laws:
- more data, more parameters, more compute
- loss goes down predictably
- intelligence is now a budget line item
- bonus round:
- quantization:
- post-training quantization (PTQ)
- quant-aware training (QAT)
- models shrink, inference gets cheaper
- gguf, awq, gptq — all just zip files with extra spice
- training vs inference stacks:
- deepspeed, megatron, fschat — for pain
- vllm, tgi, tensorRT-LLM — for speed
- everyone has a repo
- nobody reads the docs
- synthetic data:
- generate your own training set
- model teaches itself
- feedback loop of knowledge and hallucination
- welcome to the ouroboros era
- final boss secret:
- you can learn *all of this* in ~2 years
- no PhD
- no 10x compute
- just relentless curiosity, good bookmarks, and late nights
- the elite don’t want you to know this
- but now that you do
- choose to act
- start now
- build the models
tweet
RT @TheAhmadOsman: - you are
- a random CS grad with 0 clue how LLMs work
- get tired of people gatekeeping with big words and tiny GPUs
- decide to go full monk mode
- 2 years later i can explain attention mechanisms at parties and ruin them
- here’s the forbidden knowledge map
- top to bottom, how LLMs *actually* work
- start at the beginning
- text → tokens
- tokens → embeddings
- you are now a floating point number in 4D space
- vibe accordingly
- positional embeddings:
- absolute: “i am position 5”
- rotary (RoPE): “i am a sine wave”
- alibi: “i scale attention by distance like a hater”
- attention is all you need
- self-attention: “who am i allowed to pay attention to?”
- multihead: “what if i do that 8 times in parallel?”
- QKV: query, key, value
- sounds like a crypto scam
- actually the core of intelligence
- transformers:
- take your inputs
- smash them through attention layers
- normalize, activate, repeat
- dump the logits
- congratulations, you just inferred a token
- sampling tricks for the final output:
- temperature: how chaotic you want to be
- top-k: only sample from the top K options
- top-p: sample from the smallest group of tokens whose probabilities sum to p
- beam search? never ask about beam search
- kv cache = cheat code
- saves past keys & values
- lets you skip reprocessing old tokens
- turns a 90B model from “help me I’m melting” to “real-time genius”
- long context hacks:
- sliding window: move the attention like a scanner
- infini attention: attend sparsely, like a laser sniper
- memory layers: store thoughts like a diary with read access
- mixture of experts (MoE):
- not all weights matter
- route tokens to different sub-networks
- only activate ~3B params out of 80B
- “only the experts reply” energy
- grouped query attention (GQA):
- fewer keys/values than queries
- improves inference speed
- “i want to be fast without being dumb”
- normalization & activations:
- layernorm, RMSnorm
- gelu, silu, relu
- they all sound like failed Pokémon
- but they make the network stable and smooth
- training goals:
- causal LM: guess the next word
- masked LM: guess the missing word
- span prediction, fill-in-the-middle, etc
- LLMs trained on the art of guessing and got good at it
- tuning flavors:
- finetuning: new weights
- instruction tuning: “please act helpful”
- rlhf: reinforcement from vibes and clickbait prompts
- dpo: direct preference optimization — basically “do what humans upvote”
- scaling laws:
- more data, more parameters, more compute
- loss goes down predictably
- intelligence is now a budget line item
- bonus round:
- quantization:
- post-training quantization (PTQ)
- quant-aware training (QAT)
- models shrink, inference gets cheaper
- gguf, awq, gptq — all just zip files with extra spice
- training vs inference stacks:
- deepspeed, megatron, fschat — for pain
- vllm, tgi, tensorRT-LLM — for speed
- everyone has a repo
- nobody reads the docs
- synthetic data:
- generate your own training set
- model teaches itself
- feedback loop of knowledge and hallucination
- welcome to the ouroboros era
- final boss secret:
- you can learn *all of this* in ~2 years
- no PhD
- no 10x compute
- just relentless curiosity, good bookmarks, and late nights
- the elite don’t want you to know this
- but now that you do
- choose to act
- start now
- build the models
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: working on getting nanochat training running with TT‑NN
the more i push my single Tenstorrent QuietBox Blackhole,
the more i see just how much headroom this thing has
counting down until my 4x TT‑QuietBox Blackhole cluster arrives
this cluster's going to be an absolute beast https://t.co/lN9VsITgDs
tweet
RT @TheAhmadOsman: working on getting nanochat training running with TT‑NN
the more i push my single Tenstorrent QuietBox Blackhole,
the more i see just how much headroom this thing has
counting down until my 4x TT‑QuietBox Blackhole cluster arrives
this cluster's going to be an absolute beast https://t.co/lN9VsITgDs
tweet
Offshore
Photo
Ahmad
RT @TheAhmadOsman: https://t.co/ealbNXzGbX
tweet
RT @TheAhmadOsman: https://t.co/ealbNXzGbX
𝕏 premium should add a “see who viewed your profile” feature - Natetweet
Ahmad
RT @TheAhmadOsman: GLM 4.5 > KIMI K2 > QWEN 3 235B NON-THINKING > Qwen 3 CODER 480B
For Agentic coding tools
GLM 4.5 with Claude Code is the closest thing to Opus 4 imo
tweet
RT @TheAhmadOsman: GLM 4.5 > KIMI K2 > QWEN 3 235B NON-THINKING > Qwen 3 CODER 480B
For Agentic coding tools
GLM 4.5 with Claude Code is the closest thing to Opus 4 imo
tweet