Offshore
Photo
God of Prompt
RT @godofprompt: ๐จ Holy shitโฆ Stanford just published the most uncomfortable paper on LLM reasoning Iโve read in a long time.
This isnโt a flashy new model or a leaderboard win. Itโs a systematic teardown of how and why large language models keep failing at reasoning even when benchmarks say theyโre doing great.
The paper does one very smart thing upfront: it introduces a clean taxonomy instead of more anecdotes. The authors split reasoning into non-embodied and embodied.
Non-embodied reasoning is what most benchmarks test and itโs further divided into informal reasoning (intuition, social judgment, commonsense heuristics) and formal reasoning (logic, math, code, symbolic manipulation).
Embodied reasoning is where models must reason about the physical world, space, causality, and action under real constraints.
Across all three, the same failure patterns keep showing up.
> First are fundamental failures baked into current architectures. Models generate answers that look coherent but collapse under light logical pressure. They shortcut, pattern-match, or hallucinate steps instead of executing a consistent reasoning process.
> Second are application-specific failures. A model that looks strong on math benchmarks can quietly fall apart in scientific reasoning, planning, or multi-step decision making. Performance does not transfer nearly as well as leaderboards imply.
> Third are robustness failures. Tiny changes in wording, ordering, or context can flip an answer entirely. The reasoning wasnโt stable to begin with; it just happened to work for that phrasing.
One of the most disturbing findings is how often models produce unfaithful reasoning. They give the correct final answer while providing explanations that are logically wrong, incomplete, or fabricated.
This is worse than being wrong, because it trains users to trust explanations that donโt correspond to the actual decision process.
Embodied reasoning is where things really fall apart. LLMs systematically fail at physical commonsense, spatial reasoning, and basic physics because they have no grounded experience.
Even in text-only settings, as soon as a task implicitly depends on real-world dynamics, failures become predictable and repeatable.
The authors donโt just criticize. They outline mitigation paths: inference-time scaling, analogical memory, external verification, and evaluations that deliberately inject known failure cases instead of optimizing for leaderboard performance.
But theyโre very clear that none of these are silver bullets yet.
The takeaway isnโt that LLMs canโt reason.
Itโs more uncomfortable than that.
LLMs reason just enough to sound convincing, but not enough to be reliable.
And unless we start measuring how models fail not just how often they succeed weโll keep deploying systems that pass benchmarks, fail silently in production, and explain themselves with total confidence while doing the wrong thing.
Thatโs the real warning shot in this paper.
Paper: Large Language Model Reasoning Failures
tweet
RT @godofprompt: ๐จ Holy shitโฆ Stanford just published the most uncomfortable paper on LLM reasoning Iโve read in a long time.
This isnโt a flashy new model or a leaderboard win. Itโs a systematic teardown of how and why large language models keep failing at reasoning even when benchmarks say theyโre doing great.
The paper does one very smart thing upfront: it introduces a clean taxonomy instead of more anecdotes. The authors split reasoning into non-embodied and embodied.
Non-embodied reasoning is what most benchmarks test and itโs further divided into informal reasoning (intuition, social judgment, commonsense heuristics) and formal reasoning (logic, math, code, symbolic manipulation).
Embodied reasoning is where models must reason about the physical world, space, causality, and action under real constraints.
Across all three, the same failure patterns keep showing up.
> First are fundamental failures baked into current architectures. Models generate answers that look coherent but collapse under light logical pressure. They shortcut, pattern-match, or hallucinate steps instead of executing a consistent reasoning process.
> Second are application-specific failures. A model that looks strong on math benchmarks can quietly fall apart in scientific reasoning, planning, or multi-step decision making. Performance does not transfer nearly as well as leaderboards imply.
> Third are robustness failures. Tiny changes in wording, ordering, or context can flip an answer entirely. The reasoning wasnโt stable to begin with; it just happened to work for that phrasing.
One of the most disturbing findings is how often models produce unfaithful reasoning. They give the correct final answer while providing explanations that are logically wrong, incomplete, or fabricated.
This is worse than being wrong, because it trains users to trust explanations that donโt correspond to the actual decision process.
Embodied reasoning is where things really fall apart. LLMs systematically fail at physical commonsense, spatial reasoning, and basic physics because they have no grounded experience.
Even in text-only settings, as soon as a task implicitly depends on real-world dynamics, failures become predictable and repeatable.
The authors donโt just criticize. They outline mitigation paths: inference-time scaling, analogical memory, external verification, and evaluations that deliberately inject known failure cases instead of optimizing for leaderboard performance.
But theyโre very clear that none of these are silver bullets yet.
The takeaway isnโt that LLMs canโt reason.
Itโs more uncomfortable than that.
LLMs reason just enough to sound convincing, but not enough to be reliable.
And unless we start measuring how models fail not just how often they succeed weโll keep deploying systems that pass benchmarks, fail silently in production, and explain themselves with total confidence while doing the wrong thing.
Thatโs the real warning shot in this paper.
Paper: Large Language Model Reasoning Failures
tweet
Offshore
Photo
The Transcript
LYFT CFO: "We delivered record financial performance in 2025 across all metrics, including all-time-high cash flow generation exceeding $1.1 billion...we remain right on track to hit our long-term targets."
$LYFT: -15% AH https://t.co/JNjkPeJDK9
tweet
LYFT CFO: "We delivered record financial performance in 2025 across all metrics, including all-time-high cash flow generation exceeding $1.1 billion...we remain right on track to hit our long-term targets."
$LYFT: -15% AH https://t.co/JNjkPeJDK9
tweet
Offshore
Photo
God of Prompt
RT @alex_prompter: ๐จ The guy who built Anthropicโs defenses against AI bioterrorism just quit.
Mrinank Sharma led Anthropicโs Safeguards Research Team. His job was literally making sure Claude doesnโt help bad actors do bad things.
His resignation letter: โThe world is in peril. And not just from AI, or bioweapons, but from a whole series of interconnected crises.โ
He also said he โrepeatedly seen how hard it is to truly let our values govern our actionsโ inside the organization.
This is the company that positioned itself as the โsafeโ AI lab. The one founded specifically because OpenAI wasnโt careful enough.
Now their safety lead is walking away, saying the pressure to โset aside what matters mostโ is real.
Heโs leaving to study poetry. Not joining a competitor. Not starting a startup. Poetry.
When your AI safety researcher chooses poems over production, that tells you something about whatโs happening behind closed doors.
tweet
RT @alex_prompter: ๐จ The guy who built Anthropicโs defenses against AI bioterrorism just quit.
Mrinank Sharma led Anthropicโs Safeguards Research Team. His job was literally making sure Claude doesnโt help bad actors do bad things.
His resignation letter: โThe world is in peril. And not just from AI, or bioweapons, but from a whole series of interconnected crises.โ
He also said he โrepeatedly seen how hard it is to truly let our values govern our actionsโ inside the organization.
This is the company that positioned itself as the โsafeโ AI lab. The one founded specifically because OpenAI wasnโt careful enough.
Now their safety lead is walking away, saying the pressure to โset aside what matters mostโ is real.
Heโs leaving to study poetry. Not joining a competitor. Not starting a startup. Poetry.
When your AI safety researcher chooses poems over production, that tells you something about whatโs happening behind closed doors.
Today is my last day at Anthropic. I resigned.
Here is the letter I shared with my colleagues, explaining my decision. https://t.co/Qe4QyAFmxL - mrinanktweet
Offshore
Photo
The Transcript
Ford CEO: "..a strong 2025 in a dynamic and often volatile environment."
$F: +1.5% AH https://t.co/DGOZHaa76H
tweet
Ford CEO: "..a strong 2025 in a dynamic and often volatile environment."
$F: +1.5% AH https://t.co/DGOZHaa76H
tweet
Offshore
Photo
Fiscal.ai
Cloudflare just added $353 million in new Remaining Performance Obligations.
That's their largest quarterly increase ever.
$NET https://t.co/ysJMNkpFgg
tweet
Cloudflare just added $353 million in new Remaining Performance Obligations.
That's their largest quarterly increase ever.
$NET https://t.co/ysJMNkpFgg
tweet
Offshore
Photo
App Economy Insights
RT @EconomyApp: A quick look at the memory crunch:
๐ฒ $QCOM: Memory Wall
๐ฎ $SONY: Hardware Retreat
โ๏ธ $ARM: Cloud AI Engine Ignites
Full breakdown of the AI ripple effect.
https://t.co/XthMtTFjEX
tweet
RT @EconomyApp: A quick look at the memory crunch:
๐ฒ $QCOM: Memory Wall
๐ฎ $SONY: Hardware Retreat
โ๏ธ $ARM: Cloud AI Engine Ignites
Full breakdown of the AI ripple effect.
https://t.co/XthMtTFjEX
tweet
Offshore
Video
God of Prompt
RT @rryssf_: ๐ฆ OpenClaw has 114,000+ GitHub stars and the whole tech world is losing its mind over it.
But here's what nobody's showing you: the setup process that made 90% of people quit before their first agent sent a single message.
Node.js configs, gateway daemons, Tailscale tunnels, security hardening...
There's now a way to skip all of it.
Here's what i found:
A plug-and-play approach that turns autonomous agents into something you can spin up in minutes and wire into any API you want.
https://t.co/FDCVmoBR3o
For say a daily ai newsletter:
tweet
RT @rryssf_: ๐ฆ OpenClaw has 114,000+ GitHub stars and the whole tech world is losing its mind over it.
But here's what nobody's showing you: the setup process that made 90% of people quit before their first agent sent a single message.
Node.js configs, gateway daemons, Tailscale tunnels, security hardening...
There's now a way to skip all of it.
Here's what i found:
A plug-and-play approach that turns autonomous agents into something you can spin up in minutes and wire into any API you want.
https://t.co/FDCVmoBR3o
For say a daily ai newsletter:
tweet
Offshore
Photo
Benjamin Hernandez๐
$ELAB touched $2.02 in late trading, +18% booked. The market rewarded patience todayโearly birds will catch the next move at the bell.
๐ Live setups: https://t.co/71FIJIdBXe
Message me "HI" to get my trade plan for the open.
$SOFI $HOOD $PLTR $GME $SNDK
tweet
$ELAB touched $2.02 in late trading, +18% booked. The market rewarded patience todayโearly birds will catch the next move at the bell.
๐ Live setups: https://t.co/71FIJIdBXe
Message me "HI" to get my trade plan for the open.
$SOFI $HOOD $PLTR $GME $SNDK
๐ Deep Value Recovery: $JZXN
Recommendation: $JZXN
near $2.18 Even after a 63% rally, $JZXN remains fundamentally undervalued relative to its $1B token acquisition plans.
One-line why: This is a technical "mean reversion" play to the 200-day EMA near $1.65. https://t.co/J3Mm5EADUe - Benjamin Hernandez๐tweet
Offshore
Video
Dimitry Nakhla | Babylon Capitalยฎ
RT @DimitryNakhla: Chris Hohn on why Aerospace sits firmly in his investable universe:
โAerospace is a sector weโve come to understand where the barriers to entry are multipleโฆ hard assets, contracts, network effectsโฆ intellectual property, contracts, installed base, regulatory switching costs.โ
___
๐๐ก๐ ๐ฅ๐๐ฌ๐ฌ๐จ๐ง:
๐๐๐ ๐ข๐ค๐จ๐ฉ ๐๐ช๐ง๐๐๐ก๐ ๐๐ช๐จ๐๐ฃ๐๐จ๐จ๐๐จ ๐๐ค๐ฃโ๐ฉ ๐ง๐๐ก๐ฎ ๐ค๐ฃ ๐ค๐ฃ๐ ๐ข๐ค๐๐ฉ โ ๐ฉ๐๐๐ฎ ๐จ๐ฉ๐๐๐ ๐ข๐ช๐ก๐ฉ๐๐ฅ๐ก๐ ๐๐๐ง๐ง๐๐๐ง๐จ ๐ฉ๐ค ๐๐ฃ๐ฉ๐ง๐ฎ. ๐๐๐๐ ๐ก๐๐ฎ๐๐ง ๐ข๐๐ ๐๐จ ๐๐๐จ๐ง๐ช๐ฅ๐ฉ๐๐ค๐ฃ ๐๐๐ง๐๐๐ง; ๐ฉ๐ค๐๐๐ฉ๐๐๐ง, ๐ฉ๐๐๐ฎ ๐๐ง๐๐๐ฉ๐ ๐ฃ๐๐๐ง-๐๐ข๐ข๐ช๐ฃ๐๐ฉ๐ฎ.
___
Why multiple barriers matter:
๐๐๐ซ๐ ๐๐ฌ๐ฌ๐๐ญ๐ฌ โ capital intensity discourages new entrants
๐๐จ๐ง๐ญ๐ซ๐๐๐ญ๐ฌ โ long-dated agreements with OEMs & airlines
๐๐๐ญ๐ฐ๐จ๐ซ๐ค ๐๐๐๐๐๐ญ๐ฌ โ scale advantages in service, parts, and support
๐๐ง๐ญ๐๐ฅ๐ฅ๐๐๐ญ๐ฎ๐๐ฅ ๐ฉ๐ซ๐จ๐ฉ๐๐ซ๐ญ๐ฒ โ decades of engineering know-how that canโt be replicated quickly
๐๐ง๐ฌ๐ญ๐๐ฅ๐ฅ๐๐ ๐๐๐ฌ๐ โ once equipment is flying, customers canโt easily switch
๐๐๐ ๐ฎ๐ฅ๐๐ญ๐ข๐จ๐ง & ๐๐๐ซ๐ญ๐ข๐๐ข๐๐๐ญ๐ข๐จ๐ง โ enormous time, cost, and risk to gain approval
๐๐ฐ๐ข๐ญ๐๐ก๐ข๐ง๐ ๐๐จ๐ฌ๐ญ๐ฌ โ safety, reliability, and downtime risks deter change
๐๐๐๐ก ๐ฅ๐๐ฒ๐๐ซ ๐ฆ๐๐ค๐๐ฌ ๐๐ข๐ฌ๐ซ๐ฎ๐ฉ๐ญ๐ข๐จ๐ง ๐ก๐๐ซ๐๐๐ซ.
___
5 High-Quality Aerospace businesses worth adding to your watchlist:
1. $GE GE Aerospace
3-Year CAGR: +58%
2. $HWM Howmet Aerospace
3-Year CAGR: +76%
3. $TDG TransDigm Group
3-Year CAGR: +20%
4. $HEI Heico
3-Year CAGR: +23%
5. $RTX RTX Corporation
3-Year CAGR: +27%
When investors talk about โdisruption risk,โ sectors with layered moats like aerospace are often underestimated. Patience โ and respect for barriers โ tends to be rewarded.
___
Video: Norges Bank Investment Mangement | Investment Conference 2025 (07/23/2025)
tweet
RT @DimitryNakhla: Chris Hohn on why Aerospace sits firmly in his investable universe:
โAerospace is a sector weโve come to understand where the barriers to entry are multipleโฆ hard assets, contracts, network effectsโฆ intellectual property, contracts, installed base, regulatory switching costs.โ
___
๐๐ก๐ ๐ฅ๐๐ฌ๐ฌ๐จ๐ง:
๐๐๐ ๐ข๐ค๐จ๐ฉ ๐๐ช๐ง๐๐๐ก๐ ๐๐ช๐จ๐๐ฃ๐๐จ๐จ๐๐จ ๐๐ค๐ฃโ๐ฉ ๐ง๐๐ก๐ฎ ๐ค๐ฃ ๐ค๐ฃ๐ ๐ข๐ค๐๐ฉ โ ๐ฉ๐๐๐ฎ ๐จ๐ฉ๐๐๐ ๐ข๐ช๐ก๐ฉ๐๐ฅ๐ก๐ ๐๐๐ง๐ง๐๐๐ง๐จ ๐ฉ๐ค ๐๐ฃ๐ฉ๐ง๐ฎ. ๐๐๐๐ ๐ก๐๐ฎ๐๐ง ๐ข๐๐ ๐๐จ ๐๐๐จ๐ง๐ช๐ฅ๐ฉ๐๐ค๐ฃ ๐๐๐ง๐๐๐ง; ๐ฉ๐ค๐๐๐ฉ๐๐๐ง, ๐ฉ๐๐๐ฎ ๐๐ง๐๐๐ฉ๐ ๐ฃ๐๐๐ง-๐๐ข๐ข๐ช๐ฃ๐๐ฉ๐ฎ.
___
Why multiple barriers matter:
๐๐๐ซ๐ ๐๐ฌ๐ฌ๐๐ญ๐ฌ โ capital intensity discourages new entrants
๐๐จ๐ง๐ญ๐ซ๐๐๐ญ๐ฌ โ long-dated agreements with OEMs & airlines
๐๐๐ญ๐ฐ๐จ๐ซ๐ค ๐๐๐๐๐๐ญ๐ฌ โ scale advantages in service, parts, and support
๐๐ง๐ญ๐๐ฅ๐ฅ๐๐๐ญ๐ฎ๐๐ฅ ๐ฉ๐ซ๐จ๐ฉ๐๐ซ๐ญ๐ฒ โ decades of engineering know-how that canโt be replicated quickly
๐๐ง๐ฌ๐ญ๐๐ฅ๐ฅ๐๐ ๐๐๐ฌ๐ โ once equipment is flying, customers canโt easily switch
๐๐๐ ๐ฎ๐ฅ๐๐ญ๐ข๐จ๐ง & ๐๐๐ซ๐ญ๐ข๐๐ข๐๐๐ญ๐ข๐จ๐ง โ enormous time, cost, and risk to gain approval
๐๐ฐ๐ข๐ญ๐๐ก๐ข๐ง๐ ๐๐จ๐ฌ๐ญ๐ฌ โ safety, reliability, and downtime risks deter change
๐๐๐๐ก ๐ฅ๐๐ฒ๐๐ซ ๐ฆ๐๐ค๐๐ฌ ๐๐ข๐ฌ๐ซ๐ฎ๐ฉ๐ญ๐ข๐จ๐ง ๐ก๐๐ซ๐๐๐ซ.
___
5 High-Quality Aerospace businesses worth adding to your watchlist:
1. $GE GE Aerospace
3-Year CAGR: +58%
2. $HWM Howmet Aerospace
3-Year CAGR: +76%
3. $TDG TransDigm Group
3-Year CAGR: +20%
4. $HEI Heico
3-Year CAGR: +23%
5. $RTX RTX Corporation
3-Year CAGR: +27%
When investors talk about โdisruption risk,โ sectors with layered moats like aerospace are often underestimated. Patience โ and respect for barriers โ tends to be rewarded.
___
Video: Norges Bank Investment Mangement | Investment Conference 2025 (07/23/2025)
tweet
Offshore
Photo
DAIR.AI
RT @omarsar0: This team has been publishing some really interesting work on diffusion LLMs.
LLaDA 2.1 is a 100B discrete diffusion LLM with a draft-then-edit approach.
It hits a peak speed of 892 tokens/s on complex coding tasks.
Autoregressive models commit to every token permanently but LLaDA 2.1 can go back and fix mistakes mid-generation. The error handling capabilities are worth looking into.
tweet
RT @omarsar0: This team has been publishing some really interesting work on diffusion LLMs.
LLaDA 2.1 is a 100B discrete diffusion LLM with a draft-then-edit approach.
It hits a peak speed of 892 tokens/s on complex coding tasks.
Autoregressive models commit to every token permanently but LLaDA 2.1 can go back and fix mistakes mid-generation. The error handling capabilities are worth looking into.
What if an LLM could EDIT its own tokens in real-time, not just generate them? ๐คฏ
Introducing LLaDA2.1 โ a diffusion model that breaks from autoregressive dominance. It drafts fast, then fixes its own mistakes on the fly with Token-to-Token editing.
The result? 892 tokens/sec on a 100B model. ๐ฅ
โก 892 TPS on HumanEval+ (coding)
โก 801 TPS on BigCodeBench
๐ง Real-time self-correction via T2T editing
โ
@lmsysorg SGLang Day 0 support โ production-ready now
A "non-consensus" architecture now challenging the mainstream. Open-sourced TODAY. ๐
#LLaDA #TokenEditing #OpenSource #LLM #dLLM - Ant Open Sourcetweet