π Paper #Paper #Multimodal #ImageGeneration #VideoGeneration
Lance: Unified Multimodal Modeling by Multi-Task Synergy
π€ Fengyi Fu, Mengqi Huang, Shaojin Wu et al.
π― Task
Unified multimodal understanding and generation
π‘ Idea
Instead of one shared visual path or bolted-on modules, Lance uses a shared interleaved multimodal context with dual MoE streams: one expert for text+semantic understanding, one for VAE-latent generation, plus modality-aware RoPE and staged multi-task training.
β¨ Why it's interesting
With only 3B activated params and a 128-GPU budget, it substantially outperforms prior open-source unified models on image and video generation while keeping strong understanding.
π» Repo
β bytedance/Lance β 314 stars
π paper
via @Papers.Data.Code
Lance: Unified Multimodal Modeling by Multi-Task Synergy
π€ Fengyi Fu, Mengqi Huang, Shaojin Wu et al.
π― Task
Unified multimodal understanding and generation
π‘ Idea
Instead of one shared visual path or bolted-on modules, Lance uses a shared interleaved multimodal context with dual MoE streams: one expert for text+semantic understanding, one for VAE-latent generation, plus modality-aware RoPE and staged multi-task training.
β¨ Why it's interesting
With only 3B activated params and a 128-GPU budget, it substantially outperforms prior open-source unified models on image and video generation while keeping strong understanding.
π» Repo
β bytedance/Lance β 314 stars
π paper
via @Papers.Data.Code
GitHub
GitHub - bytedance/Lance: A 3B-active-parameter native unified multimodal model for image and video understanding, generation,β¦
A 3B-active-parameter native unified multimodal model for image and video understanding, generation, and editing. - bytedance/Lance
π Paper #Paper #Multimodal #ComputerUseAgents #Benchmarking
OpenComputer: Verifiable Software Worlds for Computer-Use Agents
π€ Jinbiao Wei, Qianran Ma, Yilun Zhao et al.
π― Task
Computer-use agent evaluation and benchmark generation
π‘ Idea
Instead of screenshot or LLM-judge evaluation, uses app-specific state verifiers over real software, then self-refines them from execution disagreements to synthesize and score realistic desktop tasks automatically.
β¨ Why it's interesting
Covers 33 apps and 1,000 tasks. Verifiers align better with humans than LLM judges. Best agent hits 68.3% success; open models drop sharply vs OSWorld.
π» Repo
β echo0715/OpenComputer
π paper
via @Papers.Data.Code
OpenComputer: Verifiable Software Worlds for Computer-Use Agents
π€ Jinbiao Wei, Qianran Ma, Yilun Zhao et al.
π― Task
Computer-use agent evaluation and benchmark generation
π‘ Idea
Instead of screenshot or LLM-judge evaluation, uses app-specific state verifiers over real software, then self-refines them from execution disagreements to synthesize and score realistic desktop tasks automatically.
β¨ Why it's interesting
Covers 33 apps and 1,000 tasks. Verifiers align better with humans than LLM judges. Best agent hits 68.3% success; open models drop sharply vs OSWorld.
π» Repo
β echo0715/OpenComputer
π paper
via @Papers.Data.Code
GitHub
GitHub - echo0715/OpenComputer
Contribute to echo0715/OpenComputer development by creating an account on GitHub.
π Paper #Paper #LLM #ReinforcementLearning #LongContext
GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment
π€ Minxuan Lv, Tiehua Mei, Tanlong Du et al.
π― Task
Long-context reinforcement learning for LLMs
π‘ Idea
Instead of retrieval-path-heavy QA data and uniform rewards, it trains on 9 long-context capability tasks with task-native metrics, then replaces vanilla GRPO's prompt-level scaling with task-mean normalization plus difficulty-adaptive reweighting.
β¨ Why it's interesting
On Qwen3-30B-A3B, average long-context score rises from 60.1 to 69.8; TMN-Reweight reaches 63.0 on 4B vs 62.2 with vanilla GRPO.
π» Repo
β xiaoxuanNLP/GoLongRL
π paper
via @Papers.Data.Code
GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment
π€ Minxuan Lv, Tiehua Mei, Tanlong Du et al.
π― Task
Long-context reinforcement learning for LLMs
π‘ Idea
Instead of retrieval-path-heavy QA data and uniform rewards, it trains on 9 long-context capability tasks with task-native metrics, then replaces vanilla GRPO's prompt-level scaling with task-mean normalization plus difficulty-adaptive reweighting.
β¨ Why it's interesting
On Qwen3-30B-A3B, average long-context score rises from 60.1 to 69.8; TMN-Reweight reaches 63.0 on 4B vs 62.2 with vanilla GRPO.
π» Repo
β xiaoxuanNLP/GoLongRL
π paper
via @Papers.Data.Code
GitHub
GitHub - xiaoxuanNLP/GoLongRL: GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment
GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment - xiaoxuanNLP/GoLongRL
π₯ Repo #Repo #LLM #Pretraining #HierarchicalReasoningModel
Hrm Text
π€ sapientinc
π― Task
efficient foundation model pretraining
π‘ Idea
Pretrain HRM text generation models from scratch on 8-16 H100s with built-in data packing, distributed training, benchmark evaluation, and checkpoint export to Transformers format.
β¨ Why it's interesting
Claims 130-600x less compute and 150-900x less data; reference runs train 0.6B-1B models in 46-50 hours on 8-16 H100s.
π» Repo
β sapientinc/HRM-Text β 580 stars (+580 3d)
Python
via @Papers.Data.Code
Hrm Text
π€ sapientinc
π― Task
efficient foundation model pretraining
π‘ Idea
Pretrain HRM text generation models from scratch on 8-16 H100s with built-in data packing, distributed training, benchmark evaluation, and checkpoint export to Transformers format.
β¨ Why it's interesting
Claims 130-600x less compute and 150-900x less data; reference runs train 0.6B-1B models in 46-50 hours on 8-16 H100s.
π» Repo
β sapientinc/HRM-Text β 580 stars (+580 3d)
Python
via @Papers.Data.Code
GitHub
GitHub - sapientinc/HRM-Text: HRM-Text is a 1B text generation model based on the HRM architecture, strengthened by task completionβ¦
HRM-Text is a 1B text generation model based on the HRM architecture, strengthened by task completion and latent space reasoning. - sapientinc/HRM-Text
π Paper #Paper #CV #VideoGeneration #Quantization
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
π€ Yukang Chen, Luozhou Wang, Wei Huang et al.
π― Task
Long video generation infrastructure
π‘ Idea
Instead of complex multi-stage long-video pipelines, it directly fine-tunes an AR diffusion model and co-designs sequence parallelism with teacher forcing. Balanced SP pairs clean/noisy chunks per rank, while end-to-end NVFP4 enables W4A4 inference, KV-cache compression, and async decoding.
β¨ Why it's interesting
Up to 2.15x faster training and 1.84x faster inference; 45.7 FPS, 21.9 ms/frame, and memory cut from 35.4 GB to 19.4 GB.
π» Repo
β NVlabs/LongLive β 1.4k stars
Python
π paper
via @Papers.Data.Code
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
π€ Yukang Chen, Luozhou Wang, Wei Huang et al.
π― Task
Long video generation infrastructure
π‘ Idea
Instead of complex multi-stage long-video pipelines, it directly fine-tunes an AR diffusion model and co-designs sequence parallelism with teacher forcing. Balanced SP pairs clean/noisy chunks per rank, while end-to-end NVFP4 enables W4A4 inference, KV-cache compression, and async decoding.
β¨ Why it's interesting
Up to 2.15x faster training and 1.84x faster inference; 45.7 FPS, 21.9 ms/frame, and memory cut from 35.4 GB to 19.4 GB.
π» Repo
β NVlabs/LongLive β 1.4k stars
Python
π paper
via @Papers.Data.Code
GitHub
GitHub - NVlabs/LongLive: LongLive 2.0: Infra - Long Video Gen
LongLive 2.0: Infra - Long Video Gen. Contribute to NVlabs/LongLive development by creating an account on GitHub.
π Paper #Paper #LLM #ReinforcementLearning #Reasoning
You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
π€ Zhepei Wei, Xinyu Zhu, Wei-Lin Chen et al.
π― Task
LLM RL checkpoint extrapolation
π‘ Idea
Instead of running full RLVR, estimate each tensor's dominant rank-1 update direction from early checkpoints and linearly extrapolate its coefficient. Unlike raw weight or logit extrapolation, it uses the low-rank RLVR geometry as a denoised predictor.
β¨ Why it's interesting
With 15-20% of RLVR steps, RELEX matches or nears full RLVR on MATH: 71.6 vs 71.5, 85.6 vs 85.5, 87.4 vs 88.5 across 3 models.
π» Repo
β weizhepei/RELEX
π paper
via @Papers.Data.Code
You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
π€ Zhepei Wei, Xinyu Zhu, Wei-Lin Chen et al.
π― Task
LLM RL checkpoint extrapolation
π‘ Idea
Instead of running full RLVR, estimate each tensor's dominant rank-1 update direction from early checkpoints and linearly extrapolate its coefficient. Unlike raw weight or logit extrapolation, it uses the low-rank RLVR geometry as a denoised predictor.
β¨ Why it's interesting
With 15-20% of RLVR steps, RELEX matches or nears full RLVR on MATH: 71.6 vs 71.5, 85.6 vs 85.5, 87.4 vs 88.5 across 3 models.
π» Repo
β weizhepei/RELEX
π paper
via @Papers.Data.Code
arXiv.org
You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1...
Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter...
π Paper #Paper #LLM #Reasoning #ReinforcementLearning
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
π€ Guobin Shen, Xiang Cheng, Chenxiao Zhao et al.
π― Task
Reasoning reinforcement learning for math and code
π‘ Idea
Instead of pulling the policy toward a privileged self-teacher that rewards shortcut tokens and suppresses deliberation, AntiSD reverses the signal: ascend student-teacher JSD, with an entropy gate to stop once teacher confidence collapses.
β¨ Why it's interesting
Across 5 models (4B-30B), it matches GRPO in 2-10x fewer steps and improves final avg accuracy by up to 11.5 points.
π» Repo
β FloyedShen/AntiSD β 11 stars
Python
π paper
via @Papers.Data.Code
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
π€ Guobin Shen, Xiang Cheng, Chenxiao Zhao et al.
π― Task
Reasoning reinforcement learning for math and code
π‘ Idea
Instead of pulling the policy toward a privileged self-teacher that rewards shortcut tokens and suppresses deliberation, AntiSD reverses the signal: ascend student-teacher JSD, with an entropy gate to stop once teacher confidence collapses.
β¨ Why it's interesting
Across 5 models (4B-30B), it matches GRPO in 2-10x fewer steps and improves final avg accuracy by up to 11.5 points.
π» Repo
β FloyedShen/AntiSD β 11 stars
Python
π paper
via @Papers.Data.Code
GitHub
GitHub - FloyedShen/AntiSD: Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information - FloyedShen/AntiSD
π Dataset #Dataset #LLM #MultiTurnDialogue #UserModeling
ThoughtTrace
π€ SCAI-JHU
π― Task
User modeling in multi-turn dialogue
π‘ Idea
2,155 real-world conversations from 1,058 users across 20 LLMs, with 10,174 message-level thought annotations: 7 reason types on user turns and 5 reaction types on assistant turns.
β¨ Why it's interesting
Makes latent user intent and satisfaction measurable from real chats; authors show gains for behavior prediction (+41.7%) and alignment (+25.6% win rate).
β 2,155 conversations, 10,174 thought annotations
π dataset π paper π repo
via @Papers.Data.Code
ThoughtTrace
π€ SCAI-JHU
π― Task
User modeling in multi-turn dialogue
π‘ Idea
2,155 real-world conversations from 1,058 users across 20 LLMs, with 10,174 message-level thought annotations: 7 reason types on user turns and 5 reaction types on assistant turns.
β¨ Why it's interesting
Makes latent user intent and satisfaction measurable from real chats; authors show gains for behavior prediction (+41.7%) and alignment (+25.6% win rate).
β 2,155 conversations, 10,174 thought annotations
π dataset π paper π repo
via @Papers.Data.Code
π Paper #Paper #Audio #SpeechRecognition #Robustness
Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation
π€ Zhifei Xie, Kaiyu Pang, Haobin Zhang et al.
π― Task
Robust automatic speech recognition
π‘ Idea
Instead of training on isolated mild noise, it scales to 54 physically plausible compound acoustic scenarios and trains ASR progressively from acoustic perception to semantic recovery, then uses WER-gated token- vs sentence-level rewards to handle both local errors and hallucinated/omitted transcripts.
β¨ Why it's interesting
Beats prior SOTA on adverse ASR: 45.69% vs 54.01% on VOiCES R4-B-F, 21.49% vs 29.34% on NOIZEUS Sta-0; >30% relative WER drop on compound scenarios.
π» Repo
β xzf-thu/Mega-ASR
π paper
via @Papers.Data.Code
Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation
π€ Zhifei Xie, Kaiyu Pang, Haobin Zhang et al.
π― Task
Robust automatic speech recognition
π‘ Idea
Instead of training on isolated mild noise, it scales to 54 physically plausible compound acoustic scenarios and trains ASR progressively from acoustic perception to semantic recovery, then uses WER-gated token- vs sentence-level rewards to handle both local errors and hallucinated/omitted transcripts.
β¨ Why it's interesting
Beats prior SOTA on adverse ASR: 45.69% vs 54.01% on VOiCES R4-B-F, 21.49% vs 29.34% on NOIZEUS Sta-0; >30% relative WER drop on compound scenarios.
π» Repo
β xzf-thu/Mega-ASR
π paper
via @Papers.Data.Code
GitHub
GitHub - xzf-thu/Mega-ASR: First foundation ASR built for the real world - 7 atomic acoustic conditions, 54 compound scenariosβ¦
First foundation ASR built for the real world - 7 atomic acoustic conditions, 54 compound scenarios, 2.6M samples, and up to ~30% gains over SOTA where every other model falls apart. **You'...
π Weekly Digest Β· May 16 β May 23
#WeeklyDigest
π Papers
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
#Reasoning #ReinforcementLearning #TestTimeScaling
Unified SFT-RL scaling βΆ reaches IMO gold line
β Learn more...
Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation
#VideoGeneration #DiffusionModels #Distillation
Causal Forcing++ distillation βΆ real-time 1-2 step video
β Learn more...
Self-Distilled Agentic Reinforcement Learning
#ReinforcementLearning #KnowledgeDistillation #LLMAgents
Gated self-distillation RL βΆ beats GRPO on LLM agents
β Learn more...
OpenComputer: Verifiable Software Worlds for Computer-Use Agents
#ComputerUseAgents #Benchmarking #Evaluation
Verifier-grounded desktop tasks βΆ auditable agent evaluation
β Learn more...
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
#VideoGeneration #Quantization #ParallelTraining
NVFP4 long video stack βΆ faster training, inference, lower memory
β Learn more...
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
#LongContextModeling #VisionLanguageModels #DocumentVQA
Long-doc VQA pretraining βΆ extends LVLMs to 128K+
β Learn more...
π» Repos
sapientinc/HRM-Text β
#Pretraining #HierarchicalReasoningModel #Flashattention
HRM text pretraining βΆ trains 0.6B-1B on 8-16 H100s
β Learn more...
facebookresearch/vggt-omega β
#DepthEstimation #CameraPose #3DReconstruction
Multi-image feed-forward model βΆ infers camera pose and depth
β Learn more...
yyfz/Warp-as-History β
#VideoGeneration #CameraControl #Lora
Warped history conditioning βΆ camera-controlled video generation
β Learn more...
π Datasets
Orchard
#SoftwareEngineering #ToolUse #GuiAgent
Dual agent trajectories βΆ train and evaluate coding GUI agents
β Learn more...
ThoughtTrace
#MultiTurnDialogue #UserModeling #Alignment
ThoughtTrace dataset βΆ measures latent intent
β Learn more...
β‘οΈ Tomorrow β NLP & LLM Monthly
via @Papers.Data.Code
#WeeklyDigest
π Papers
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
#Reasoning #ReinforcementLearning #TestTimeScaling
Unified SFT-RL scaling βΆ reaches IMO gold line
β Learn more...
Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation
#VideoGeneration #DiffusionModels #Distillation
Causal Forcing++ distillation βΆ real-time 1-2 step video
β Learn more...
Self-Distilled Agentic Reinforcement Learning
#ReinforcementLearning #KnowledgeDistillation #LLMAgents
Gated self-distillation RL βΆ beats GRPO on LLM agents
β Learn more...
OpenComputer: Verifiable Software Worlds for Computer-Use Agents
#ComputerUseAgents #Benchmarking #Evaluation
Verifier-grounded desktop tasks βΆ auditable agent evaluation
β Learn more...
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
#VideoGeneration #Quantization #ParallelTraining
NVFP4 long video stack βΆ faster training, inference, lower memory
β Learn more...
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
#LongContextModeling #VisionLanguageModels #DocumentVQA
Long-doc VQA pretraining βΆ extends LVLMs to 128K+
β Learn more...
π» Repos
sapientinc/HRM-Text β
#Pretraining #HierarchicalReasoningModel #Flashattention
HRM text pretraining βΆ trains 0.6B-1B on 8-16 H100s
β Learn more...
facebookresearch/vggt-omega β
#DepthEstimation #CameraPose #3DReconstruction
Multi-image feed-forward model βΆ infers camera pose and depth
β Learn more...
yyfz/Warp-as-History β
#VideoGeneration #CameraControl #Lora
Warped history conditioning βΆ camera-controlled video generation
β Learn more...
π Datasets
Orchard
#SoftwareEngineering #ToolUse #GuiAgent
Dual agent trajectories βΆ train and evaluate coding GUI agents
β Learn more...
ThoughtTrace
#MultiTurnDialogue #UserModeling #Alignment
ThoughtTrace dataset βΆ measures latent intent
β Learn more...
β‘οΈ Tomorrow β NLP & LLM Monthly
via @Papers.Data.Code
π Monthly Β· NLP & LLM Β· Apr 24 β May 24
#MonthlyDigest #NLP #LLM
π Papers
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
#Reasoning #ReinforcementLearning #TestTimeScaling
Unified SFT-RL scaling βΆ reaches IMO gold line
β Learn more...
Self-Distilled Agentic Reinforcement Learning
#ReinforcementLearning #KnowledgeDistillation #LLMAgents
Gated self-distillation RL βΆ beats GRPO on LLM agents
β Learn more...
OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories
#SearchAgents #SupervisedFineTuning #ToolUse
10.6k hard trajectories βΆ SOTA ~30B search agents
β Learn more...
Ξ΄-mem: Efficient Online Memory for Large Language Models
#MemoryMechanisms #Attention #ParameterEfficientTuning
Online associative memory βΆ steers attention for long-horizon tasks
β Learn more...
You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
#ReinforcementLearning #Reasoning #ModelEditing
Rank-1 RLVR extrapolation βΆ matches full RLVR early
β Learn more...
GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment
#ReinforcementLearning #LongContext #MultitaskLearning
23K long-context RLVR βΆ raises average score to 69.8
β Learn more...
π» Repos
β antirez/ds4
#Metal #KvCache #OpenaiCompatible
Metal local inference βΆ 1M context with disk KV cache
β Learn more...
β sapientinc/HRM-Text
#Pretraining #HierarchicalReasoningModel #Flashattention
HRM text pretraining βΆ trains 0.6B-1B on 8-16 H100s
β Learn more...
β facebookresearch/ProgramBench
#Benchmark #SoftwareEngineering #ReverseEngineering
Program reconstruction benchmark βΆ tests LM reverse engineering
β Learn more...
π Datasets
SWE-chat
#CodingAgent #AgentTraces #HumanAICollaboration
SWE-chat dataset βΆ studies human-agent coding workflows
β Learn more...
Orchard
#SoftwareEngineering #ToolUse #GuiAgent
Dual agent trajectories βΆ train and evaluate coding GUI agents
β Learn more...
ThoughtTrace
#MultiTurnDialogue #UserModeling #Alignment
ThoughtTrace dataset βΆ measures latent intent
β Learn more...
via Papers.Data.Code
#MonthlyDigest #NLP #LLM
π Papers
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
#Reasoning #ReinforcementLearning #TestTimeScaling
Unified SFT-RL scaling βΆ reaches IMO gold line
β Learn more...
Self-Distilled Agentic Reinforcement Learning
#ReinforcementLearning #KnowledgeDistillation #LLMAgents
Gated self-distillation RL βΆ beats GRPO on LLM agents
β Learn more...
OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories
#SearchAgents #SupervisedFineTuning #ToolUse
10.6k hard trajectories βΆ SOTA ~30B search agents
β Learn more...
Ξ΄-mem: Efficient Online Memory for Large Language Models
#MemoryMechanisms #Attention #ParameterEfficientTuning
Online associative memory βΆ steers attention for long-horizon tasks
β Learn more...
You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
#ReinforcementLearning #Reasoning #ModelEditing
Rank-1 RLVR extrapolation βΆ matches full RLVR early
β Learn more...
GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment
#ReinforcementLearning #LongContext #MultitaskLearning
23K long-context RLVR βΆ raises average score to 69.8
β Learn more...
π» Repos
β antirez/ds4
#Metal #KvCache #OpenaiCompatible
Metal local inference βΆ 1M context with disk KV cache
β Learn more...
β sapientinc/HRM-Text
#Pretraining #HierarchicalReasoningModel #Flashattention
HRM text pretraining βΆ trains 0.6B-1B on 8-16 H100s
β Learn more...
β facebookresearch/ProgramBench
#Benchmark #SoftwareEngineering #ReverseEngineering
Program reconstruction benchmark βΆ tests LM reverse engineering
β Learn more...
π Datasets
SWE-chat
#CodingAgent #AgentTraces #HumanAICollaboration
SWE-chat dataset βΆ studies human-agent coding workflows
β Learn more...
Orchard
#SoftwareEngineering #ToolUse #GuiAgent
Dual agent trajectories βΆ train and evaluate coding GUI agents
β Learn more...
ThoughtTrace
#MultiTurnDialogue #UserModeling #Alignment
ThoughtTrace dataset βΆ measures latent intent
β Learn more...
via Papers.Data.Code
β‘ Trends
βΈ Reinforcement learning is shifting toward structured credit assignment and more stable objectives
βΈ Test-time scaling increasingly uses agentic search, verification loops, and multi-agent coordination
βΈ LLM agents are trained on richer long-horizon trajectories for search, research, and tool use
π§ TL;DR
π Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
Unified SFT-RL-test-time scaling reaches gold-level olympiad reasoning on 30B models.
β antirez/ds4
Practical local DeepSeek serving with 1M context and persistent KV cache.
π‘ LLM progress is shifting toward scalable reasoning and agentic interaction optimization.
via Papers.Data.Code
βΈ Reinforcement learning is shifting toward structured credit assignment and more stable objectives
βΈ Test-time scaling increasingly uses agentic search, verification loops, and multi-agent coordination
βΈ LLM agents are trained on richer long-horizon trajectories for search, research, and tool use
π§ TL;DR
π Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
Unified SFT-RL-test-time scaling reaches gold-level olympiad reasoning on 30B models.
β antirez/ds4
Practical local DeepSeek serving with 1M context and persistent KV cache.
π‘ LLM progress is shifting toward scalable reasoning and agentic interaction optimization.
via Papers.Data.Code
π
Monthly digest week starts tomorrow β May2026.
Top papers, repos and datasets land at @papersdatacode_digests MonβWed.
Top papers, repos and datasets land at @papersdatacode_digests MonβWed.