π Weekly Digest | Apr 18 β Apr 25
#WeeklyDigest
π Papers
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
#MultimodalLearning #ImageGeneration #DiffusionModels
Discrete diffusion LLM βΆ unifies understanding and generation
β Learn more...
AgentSPEX: An Agent SPecification and EXecution Language
#LLMAgents #WorkflowOrchestration #ProgramSynthesis
YAML agent workflows βΆ beats baselines on 7 benchmarks
β Learn more...
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
#AutonomousDriving #TrajectoryPrediction #WorldModels
Latent CoT VLA βΆ beats explicit CoT
β Learn more...
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
#VideoGeneration #DiffusionTransformers #HumanObjectInteraction
CoInteract diffusion framework βΆ more stable realistic HOI videos
β Learn more...
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
#TextToImage #FewStepGeneration #FlowMatching
Text-conditioned MeanFlow βΆ 0.90 GenEval in 4 steps
β Learn more...
π» Repos
cosmicstack-labs/mercury-agent β
#AIAgents #TelegramBots #ToolUse
TypeScript CLI Telegram agent βΆ approval-based 24/7 tool use
β Learn more...
yaojingang/geo-citation-lab β
#SearchEvaluation #CitationAnalysis #WebCrawling
Citation analysis dataset βΆ studies search and citation choices
β Learn more...
intertwine/dspy-agent-skills β
#AgentSkills #Dspy #Gepa
DSPy agent skills βΆ coding workflows with tests
β Learn more...
π Datasets
ParseBench
#DocumentParsing #OCR #LayoutDetection
ParseBench dataset βΆ evaluates enterprise document parsers
β Learn more...
Nemotron-Personas-Korea
#SyntheticData #Personas #Korean
Synthetic Korean persona dataset βΆ model training and evaluation
β Learn more...
β‘οΈ Tomorrow β NLP & LLM Monthly
via @Papers.Data.Code
#WeeklyDigest
π Papers
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
#MultimodalLearning #ImageGeneration #DiffusionModels
Discrete diffusion LLM βΆ unifies understanding and generation
β Learn more...
AgentSPEX: An Agent SPecification and EXecution Language
#LLMAgents #WorkflowOrchestration #ProgramSynthesis
YAML agent workflows βΆ beats baselines on 7 benchmarks
β Learn more...
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
#AutonomousDriving #TrajectoryPrediction #WorldModels
Latent CoT VLA βΆ beats explicit CoT
β Learn more...
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
#VideoGeneration #DiffusionTransformers #HumanObjectInteraction
CoInteract diffusion framework βΆ more stable realistic HOI videos
β Learn more...
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
#TextToImage #FewStepGeneration #FlowMatching
Text-conditioned MeanFlow βΆ 0.90 GenEval in 4 steps
β Learn more...
π» Repos
cosmicstack-labs/mercury-agent β
#AIAgents #TelegramBots #ToolUse
TypeScript CLI Telegram agent βΆ approval-based 24/7 tool use
β Learn more...
yaojingang/geo-citation-lab β
#SearchEvaluation #CitationAnalysis #WebCrawling
Citation analysis dataset βΆ studies search and citation choices
β Learn more...
intertwine/dspy-agent-skills β
#AgentSkills #Dspy #Gepa
DSPy agent skills βΆ coding workflows with tests
β Learn more...
π Datasets
ParseBench
#DocumentParsing #OCR #LayoutDetection
ParseBench dataset βΆ evaluates enterprise document parsers
β Learn more...
Nemotron-Personas-Korea
#SyntheticData #Personas #Korean
Synthetic Korean persona dataset βΆ model training and evaluation
β Learn more...
β‘οΈ Tomorrow β NLP & LLM Monthly
via @Papers.Data.Code
π Monthly: Computer Vision | Apr 03 β May 03
#MonthlyDigest #CV
π Papers
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
#TextToVideo #ReinforcementLearning #3DConsistency
RL text-to-video tuning βΆ boosts 3D consistency PSNR
β Learn more...
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
#MultimodalLearning #ImageGeneration #DiffusionModels
Discrete diffusion LLM βΆ unifies multimodal understanding and generation
β Learn more...
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
#MultimodalLearning #ImageGeneration #VisionLanguageModels
Direct patch embeddings βΆ unified multimodal understanding generation
β Learn more...
Vista4D: Video Reshooting with 4D Point Clouds
#NovelViewSynthesis #VideoDiffusion #3DReconstruction
4D point cloud reshooting βΆ best camera and 3D consistency
β Learn more...
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
#VideoGeneration #DiffusionTransformers #HumanObjectInteraction
Diffusion HOI video synthesis βΆ improves stability and contact realism
β Learn more...
π» Repos
facex-engine/facex β
#FaceVerification #Webassembly #CpuInference
Local face embeddings βΆ browser CPU verification
β Learn more...
π Datasets
Sleep Health & Daily Performance Dataset
#Classification #Regression #HealthConditions
Synthetic sleep health dataset βΆ benchmarks 3 prediction tasks
β Learn more...
β‘ Trends
βΈ Unified multimodal models increasingly merge visual understanding, generation, and editing end-to-end.
βΈ Video generation methods add explicit 3D or geometry grounding for consistency.
βΈ Specialized architectural priors target realism in controllable human and camera-centric video synthesis.
π§ TL;DR
π World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
RL adds strong 3D-consistent video generation without architecture or inference changes
β facex-engine/facex β
Tiny local face verification runs fast on browser and CPU
π‘ Vision models are converging toward unified, geometry-grounded, practically deployable generation systems.
via @Papers.Data.Code
#MonthlyDigest #CV
π Papers
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
#TextToVideo #ReinforcementLearning #3DConsistency
RL text-to-video tuning βΆ boosts 3D consistency PSNR
β Learn more...
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
#MultimodalLearning #ImageGeneration #DiffusionModels
Discrete diffusion LLM βΆ unifies multimodal understanding and generation
β Learn more...
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
#MultimodalLearning #ImageGeneration #VisionLanguageModels
Direct patch embeddings βΆ unified multimodal understanding generation
β Learn more...
Vista4D: Video Reshooting with 4D Point Clouds
#NovelViewSynthesis #VideoDiffusion #3DReconstruction
4D point cloud reshooting βΆ best camera and 3D consistency
β Learn more...
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
#VideoGeneration #DiffusionTransformers #HumanObjectInteraction
Diffusion HOI video synthesis βΆ improves stability and contact realism
β Learn more...
π» Repos
facex-engine/facex β
#FaceVerification #Webassembly #CpuInference
Local face embeddings βΆ browser CPU verification
β Learn more...
π Datasets
Sleep Health & Daily Performance Dataset
#Classification #Regression #HealthConditions
Synthetic sleep health dataset βΆ benchmarks 3 prediction tasks
β Learn more...
β‘ Trends
βΈ Unified multimodal models increasingly merge visual understanding, generation, and editing end-to-end.
βΈ Video generation methods add explicit 3D or geometry grounding for consistency.
βΈ Specialized architectural priors target realism in controllable human and camera-centric video synthesis.
π§ TL;DR
π World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
RL adds strong 3D-consistent video generation without architecture or inference changes
β facex-engine/facex β
Tiny local face verification runs fast on browser and CPU
π‘ Vision models are converging toward unified, geometry-grounded, practically deployable generation systems.
via @Papers.Data.Code
π Paper #Paper #Multimodal #VideoGeneration #DiffusionModels
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
π€ Houyuan Chen, Hong Li, Xianghao Kong et al.
π― Task
Multimodal video generation
π‘ Idea
Stochastic condition masking enables omni-directional generation; decoupled gated LoRA adds per-modality adapters only for targets; cross-modal self-attention shares keys/values across modalities for alignment.
β¨ Why it's interesting
Competitive with state of the art across tasks; robust in-the-wild with <1k training videos.
π» Repo
β houyuanchen111/UniVidX β 44 stars
π paper
via @Papers.Data.Code
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
π€ Houyuan Chen, Hong Li, Xianghao Kong et al.
π― Task
Multimodal video generation
π‘ Idea
Stochastic condition masking enables omni-directional generation; decoupled gated LoRA adds per-modality adapters only for targets; cross-modal self-attention shares keys/values across modalities for alignment.
β¨ Why it's interesting
Competitive with state of the art across tasks; robust in-the-wild with <1k training videos.
π» Repo
β houyuanchen111/UniVidX β 44 stars
π paper
via @Papers.Data.Code
GitHub
GitHub - houyuanchen111/UniVidX: [SIGGRAPH 2026 / TOG] Official code of the paper "UniVidX: A Unified Multimodal Framework forβ¦
[SIGGRAPH 2026 / TOG] Official code of the paper "UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors". - houyuanchen111/UniVidX
π Paper #Paper #Tabular #DiffusionModels #DecisionTrees
Trees to Flows and Back: Unifying Decision Trees and Diffusion Models
π€ Sai Niranjan Ramachandran, Suvrit Sra
π― Task
Tabular generation and tree-to-network distillation
π‘ Idea
Treeβflow correspondence maps refined tree partitions to PF-ODEs and diffusion dynamics to hierarchies; GTSM unifies boosting and score matching. TREEFLOW conditions flows on tree paths, and DSM-TREE distills full tree decisions.
β¨ Why it's interesting
TREEFLOW is 2Γ faster; best TSTR on 3/5 and best Wasserstein on 4/5 benchmarks.
π paper
via @Papers.Data.Code
Trees to Flows and Back: Unifying Decision Trees and Diffusion Models
π€ Sai Niranjan Ramachandran, Suvrit Sra
π― Task
Tabular generation and tree-to-network distillation
π‘ Idea
Treeβflow correspondence maps refined tree partitions to PF-ODEs and diffusion dynamics to hierarchies; GTSM unifies boosting and score matching. TREEFLOW conditions flows on tree paths, and DSM-TREE distills full tree decisions.
β¨ Why it's interesting
TREEFLOW is 2Γ faster; best TSTR on 3/5 and best Wasserstein on 4/5 benchmarks.
π paper
via @Papers.Data.Code
arXiv.org
Trees to Flows and Back: Unifying Decision Trees and Diffusion Models
Decision trees and diffusion models are ostensibly disparate model classes, one discrete and hierarchical, the other continuous and dynamic. This work unifies the two by establishing a crisp...
π Weekly Digest | May 02 β May 09
#WeeklyDigest
π Papers
MolmoAct2: Action Reasoning Models for Real-world Deployment
#VisionLanguageAction #EmbodiedReasoning #ImitationLearning
Vision-language-action model βΆ beats VLA baselines on 7 benchmarks
β Learn more...
OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents
#AgenticRL #ToolUse #MultimodalReasoning
Failure-aware multimodal search RL βΆ +13.8 points on 7 benchmarks
β Learn more...
OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories
#SearchAgents #SupervisedFineTuning #ToolUse
10.6k hard trajectories βΆ SOTA ~30B search agents
β Learn more...
Heterogeneous Scientific Foundation Model Collaboration
#AgentSystems #FoundationModels #ScientificAI
LLM-FM agent interface βΆ scientific tasks on structured data
β Learn more...
PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World
#3DGeneration #ArticulatedObjects #DiffusionModels
Two-stage 3D generation βΆ simulation-ready articulated assets
β Learn more...
MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons
#MotionCapture #PoseEstimation #3DHumanPoseEstimation
Monocular motion capture βΆ predicts animation-ready rotations
β Learn more...
π» Repos
PKU-YuanGroup/TIDE β
#DiscreteDiffusion #Distillation #CodeGeneration
Cross-architecture distillation βΆ 0.6B student, lower cost
β Learn more...
Vinayak-VG/GenWildSplat β
#3DReconstruction #GaussianSplatting #NovelViewSynthesis
Sparse-view 3D reconstruction βΆ 3D Gaussian splat in 3s
β Learn more...
YanFangCS/GenLIP β
#VisionEncoder #AutoregressivePretraining #OCR
Autoregressive ViT pretraining βΆ strong Doc and OCR gains
β Learn more...
π Datasets
SWE-chat
#CodingAgent #AgentTraces #HumanAICollaboration
SWE-chat dataset βΆ studies human-agent coding workflows
β Learn more...
gpic
#ImageGeneration #PermissiveLicense #ImageText
Permissive 100M image corpus βΆ visual generation research
β Learn more...
β‘οΈ Tomorrow β Multimodal & Agents Monthly
via @Papers.Data.Code
#WeeklyDigest
π Papers
MolmoAct2: Action Reasoning Models for Real-world Deployment
#VisionLanguageAction #EmbodiedReasoning #ImitationLearning
Vision-language-action model βΆ beats VLA baselines on 7 benchmarks
β Learn more...
OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents
#AgenticRL #ToolUse #MultimodalReasoning
Failure-aware multimodal search RL βΆ +13.8 points on 7 benchmarks
β Learn more...
OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories
#SearchAgents #SupervisedFineTuning #ToolUse
10.6k hard trajectories βΆ SOTA ~30B search agents
β Learn more...
Heterogeneous Scientific Foundation Model Collaboration
#AgentSystems #FoundationModels #ScientificAI
LLM-FM agent interface βΆ scientific tasks on structured data
β Learn more...
PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World
#3DGeneration #ArticulatedObjects #DiffusionModels
Two-stage 3D generation βΆ simulation-ready articulated assets
β Learn more...
MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons
#MotionCapture #PoseEstimation #3DHumanPoseEstimation
Monocular motion capture βΆ predicts animation-ready rotations
β Learn more...
π» Repos
PKU-YuanGroup/TIDE β
#DiscreteDiffusion #Distillation #CodeGeneration
Cross-architecture distillation βΆ 0.6B student, lower cost
β Learn more...
Vinayak-VG/GenWildSplat β
#3DReconstruction #GaussianSplatting #NovelViewSynthesis
Sparse-view 3D reconstruction βΆ 3D Gaussian splat in 3s
β Learn more...
YanFangCS/GenLIP β
#VisionEncoder #AutoregressivePretraining #OCR
Autoregressive ViT pretraining βΆ strong Doc and OCR gains
β Learn more...
π Datasets
SWE-chat
#CodingAgent #AgentTraces #HumanAICollaboration
SWE-chat dataset βΆ studies human-agent coding workflows
β Learn more...
gpic
#ImageGeneration #PermissiveLicense #ImageText
Permissive 100M image corpus βΆ visual generation research
β Learn more...
β‘οΈ Tomorrow β Multimodal & Agents Monthly
via @Papers.Data.Code
π Monthly: Multimodal & Agents | Apr 10 β May 10
#MonthlyDigest #Multimodal #Agents
π Papers
MolmoAct2: Action Reasoning Models for Real-world Deployment
#VisionLanguageAction #EmbodiedReasoning #ImitationLearning
Vision-language-action model βΆ beats VLA baselines on 7 benchmarks
β Learn more...
OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents
#AgenticRL #ToolUse #MultimodalReasoning
Failure-aware multimodal search RL βΆ +13.8 points on 7 benchmarks
β Learn more...
Heterogeneous Scientific Foundation Model Collaboration
#AgentSystems #FoundationModels #ScientificAI
LLM-FM agent interface βΆ scientific tasks on structured data
β Learn more...
Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL
#ReinforcementLearning #KnowledgeDistillation #Reasoning
PRISM pre-alignment βΆ boosts accuracy over SFTβRLVR
β Learn more...
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
#VideoGeneration #DiffusionModels #ParameterEfficientFineTuning
Unified video diffusion βΆ multimodal pixel-aligned generation
β Learn more...
RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments
#SLAM #OpenVocabulary #VisualLanguageModels
Tightly coupled VLM SLAM βΆ open-vocabulary 3D in dynamics
β Learn more...
π» Repos
YanFangCS/GenLIP β
#VisionEncoder #AutoregressivePretraining #OCR
Autoregressive ViT pretraining βΆ strong Doc and OCR gains
β Learn more...
RockeyCoss/LeapAlign_Code β
#TextToImage #FlowMatching #PreferenceOptimization
Two-step leap trajectory βΆ preference-aligns flow-matching T2I
β Learn more...
π Datasets
gpic
#ImageGeneration #PermissiveLicense #ImageText
Permissive 100M image corpus βΆ visual generation research
β Learn more...
MathNet v0 β Olympiad Math Reasoning & Retrieval
#CompetitionMath #Multimodal #Retrieval
Multilingual Olympiad math dataset βΆ reasoning and retrieval benchmark
β Learn more...
β‘ Trends
βΈ Agents increasingly orchestrate external tools or specialized models through structured interfaces.
βΈ Multimodal RL training adds intermediate alignment stages to preserve reasoning quality.
βΈ Shared action tokenization is emerging for embodied control and cross-embodiment transfer.
π§ TL;DR
π MolmoAct2: Action Reasoning Models for Real-world Deployment
Open VLA model beats strong baselines with adaptive low-latency embodied reasoning.
π‘ Multimodal agents are shifting toward tool-grounded, efficient, real-world deployment.
via @Papers.Data.Code
#MonthlyDigest #Multimodal #Agents
π Papers
MolmoAct2: Action Reasoning Models for Real-world Deployment
#VisionLanguageAction #EmbodiedReasoning #ImitationLearning
Vision-language-action model βΆ beats VLA baselines on 7 benchmarks
β Learn more...
OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents
#AgenticRL #ToolUse #MultimodalReasoning
Failure-aware multimodal search RL βΆ +13.8 points on 7 benchmarks
β Learn more...
Heterogeneous Scientific Foundation Model Collaboration
#AgentSystems #FoundationModels #ScientificAI
LLM-FM agent interface βΆ scientific tasks on structured data
β Learn more...
Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL
#ReinforcementLearning #KnowledgeDistillation #Reasoning
PRISM pre-alignment βΆ boosts accuracy over SFTβRLVR
β Learn more...
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
#VideoGeneration #DiffusionModels #ParameterEfficientFineTuning
Unified video diffusion βΆ multimodal pixel-aligned generation
β Learn more...
RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments
#SLAM #OpenVocabulary #VisualLanguageModels
Tightly coupled VLM SLAM βΆ open-vocabulary 3D in dynamics
β Learn more...
π» Repos
YanFangCS/GenLIP β
#VisionEncoder #AutoregressivePretraining #OCR
Autoregressive ViT pretraining βΆ strong Doc and OCR gains
β Learn more...
RockeyCoss/LeapAlign_Code β
#TextToImage #FlowMatching #PreferenceOptimization
Two-step leap trajectory βΆ preference-aligns flow-matching T2I
β Learn more...
π Datasets
gpic
#ImageGeneration #PermissiveLicense #ImageText
Permissive 100M image corpus βΆ visual generation research
β Learn more...
MathNet v0 β Olympiad Math Reasoning & Retrieval
#CompetitionMath #Multimodal #Retrieval
Multilingual Olympiad math dataset βΆ reasoning and retrieval benchmark
β Learn more...
β‘ Trends
βΈ Agents increasingly orchestrate external tools or specialized models through structured interfaces.
βΈ Multimodal RL training adds intermediate alignment stages to preserve reasoning quality.
βΈ Shared action tokenization is emerging for embodied control and cross-embodiment transfer.
π§ TL;DR
π MolmoAct2: Action Reasoning Models for Real-world Deployment
Open VLA model beats strong baselines with adaptive low-latency embodied reasoning.
π‘ Multimodal agents are shifting toward tool-grounded, efficient, real-world deployment.
via @Papers.Data.Code
π Paper #Paper #CV #VideoGeneration #DiffusionModels
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
π€ Yuchao Gu, Guian Fang, Yuxin Jiang et al.
π― Task
Any-step video generation
π‘ Idea
Instead of endpoint consistency maps for fixed few-step sampling, it learns arbitrary-time flow-map transitions along the full ODE path, then uses shortcut backward simulation for on-policy distillation to cut discretization error and causal exposure bias.
β¨ Why it's interesting
On 14B T2V, it gets 84.05 VBench at 4 NFEs and 84.41 at 32; beats Krea-Realtime-14B's 83.25 at 4 and rCM-14B's 83.73 at 4.
π» Repo
β NVlabs/AnyFlow β 202 stars
β NVLabs/AnyFlow β 202 stars
π paper
via @Papers.Data.Code
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
π€ Yuchao Gu, Guian Fang, Yuxin Jiang et al.
π― Task
Any-step video generation
π‘ Idea
Instead of endpoint consistency maps for fixed few-step sampling, it learns arbitrary-time flow-map transitions along the full ODE path, then uses shortcut backward simulation for on-policy distillation to cut discretization error and causal exposure bias.
β¨ Why it's interesting
On 14B T2V, it gets 84.05 VBench at 4 NFEs and 84.41 at 32; beats Krea-Realtime-14B's 83.25 at 4 and rCM-14B's 83.73 at 4.
π» Repo
β NVlabs/AnyFlow β 202 stars
β NVLabs/AnyFlow β 202 stars
π paper
via @Papers.Data.Code
GitHub
GitHub - NVlabs/AnyFlow
Contribute to NVlabs/AnyFlow development by creating an account on GitHub.
π Weekly Digest Β· May 09 β May 16
#WeeklyDigest
π Papers
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
#VideoGeneration #DiffusionModels #Distillation
Any-step video diffusion βΆ 84.05 VBench at 4 NFEs
β Learn more...
Flow-OPD: On-Policy Distillation for Flow Matching Models
#TextToImage #KnowledgeDistillation #ReinforcementLearning
On-policy flow distillation βΆ boosts GenEval and OCR
β Learn more...
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
#VisionLanguageModels #ImageGeneration #MixtureOfExperts
NEO-unify multimodal model βΆ unifies understanding and generation
β Learn more...
Ξ΄-mem: Efficient Online Memory for Large Language Models
#MemoryMechanisms #Attention #ParameterEfficientTuning
Online associative memory βΆ steers attention for long-horizon tasks
β Learn more...
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
#TestTimeScaling #Reasoning #AgenticSearch
Offline replay controller βΆ improves accuracy-cost tradeoffs
β Learn more...
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
#ReinforcementLearning #AgentTraining #LongContextReasoning
Rubric-guided meta-RL βΆ stagewise credit for research agents
β Learn more...
π» Repos
antirez/ds4 β
#Metal #KvCache #OpenaiCompatible
Metal local inference βΆ 1M context with disk KV cache
β Learn more...
facebookresearch/ProgramBench β
#Benchmark #SoftwareEngineering #ReverseEngineering
Program reconstruction benchmark βΆ tests LM reverse engineering
β Learn more...
sparolab/KISS-IMU β
#InertialOdometry #SelfSupervised #LidarPseudoLabels
Self-supervised IMU odometry βΆ denoises raw IMU with LiDAR labels
β Learn more...
π Datasets
AI Index Data: Growth, Talent (Cambridge/Harvard)
#GlobalAI #CountryIndicators #LongitudinalData
Global AI panel dataset βΆ cross-country trend forecasting
β Learn more...
giant-permissive-image-corpus
#ImageGeneration #PermissiveLicense #ImageDataset
Permissive image corpus βΆ trains visual generation
β Learn more...
β‘οΈ Tomorrow β Efficient ML Monthly
via @Papers.Data.Code
#WeeklyDigest
π Papers
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
#VideoGeneration #DiffusionModels #Distillation
Any-step video diffusion βΆ 84.05 VBench at 4 NFEs
β Learn more...
Flow-OPD: On-Policy Distillation for Flow Matching Models
#TextToImage #KnowledgeDistillation #ReinforcementLearning
On-policy flow distillation βΆ boosts GenEval and OCR
β Learn more...
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
#VisionLanguageModels #ImageGeneration #MixtureOfExperts
NEO-unify multimodal model βΆ unifies understanding and generation
β Learn more...
Ξ΄-mem: Efficient Online Memory for Large Language Models
#MemoryMechanisms #Attention #ParameterEfficientTuning
Online associative memory βΆ steers attention for long-horizon tasks
β Learn more...
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
#TestTimeScaling #Reasoning #AgenticSearch
Offline replay controller βΆ improves accuracy-cost tradeoffs
β Learn more...
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
#ReinforcementLearning #AgentTraining #LongContextReasoning
Rubric-guided meta-RL βΆ stagewise credit for research agents
β Learn more...
π» Repos
antirez/ds4 β
#Metal #KvCache #OpenaiCompatible
Metal local inference βΆ 1M context with disk KV cache
β Learn more...
facebookresearch/ProgramBench β
#Benchmark #SoftwareEngineering #ReverseEngineering
Program reconstruction benchmark βΆ tests LM reverse engineering
β Learn more...
sparolab/KISS-IMU β
#InertialOdometry #SelfSupervised #LidarPseudoLabels
Self-supervised IMU odometry βΆ denoises raw IMU with LiDAR labels
β Learn more...
π Datasets
AI Index Data: Growth, Talent (Cambridge/Harvard)
#GlobalAI #CountryIndicators #LongitudinalData
Global AI panel dataset βΆ cross-country trend forecasting
β Learn more...
giant-permissive-image-corpus
#ImageGeneration #PermissiveLicense #ImageDataset
Permissive image corpus βΆ trains visual generation
β Learn more...
β‘οΈ Tomorrow β Efficient ML Monthly
via @Papers.Data.Code
π1
π Monthly Β· Efficient ML Β· Apr 17 β May 17
#MonthlyDigest #EfficientML
π Papers
Trees to Flows and Back: Unifying Decision Trees and Diffusion Models
#DiffusionModels #DecisionTrees #KnowledgeDistillation
Trees and flows βΆ faster tabular generation
β Learn more...
π Datasets
MSR-ACC/TAE25
#QuantumChemistry #AtomizationEnergy #CoupledCluster
Quantum chemistry dataset βΆ trains atomization energy models
β Learn more...
AI Index Data: Growth, Talent (Cambridge/Harvard)
#GlobalAI #CountryIndicators #LongitudinalData
Global AI panel dataset βΆ cross-country trend forecasting
β Learn more...
WHO Global Health Indicators for Prediction
#GlobalHealth #CountryLevel #WorldBank
Global health panel data βΆ cross-country trend analysis
β Learn more...
β‘ Trends
βΈ Longitudinal country-level datasets increasingly target forecasting and cross-country trend analysis.
βΈ Wide, linked, multi-table dataset formats are becoming standard for benchmarking.
βΈ Efficiency gains come from unifying model families and distilling complex systems.
π§ TL;DR
π Trees to Flows and Back: Unifying Decision Trees and Diffusion Models
Unifies trees and diffusion, delivering faster tabular generation and effective distillation.
π‘ Efficiency advances increasingly come from unifying classical structures with generative modeling.
via @Papers.Data.Code
#MonthlyDigest #EfficientML
π Papers
Trees to Flows and Back: Unifying Decision Trees and Diffusion Models
#DiffusionModels #DecisionTrees #KnowledgeDistillation
Trees and flows βΆ faster tabular generation
β Learn more...
π Datasets
MSR-ACC/TAE25
#QuantumChemistry #AtomizationEnergy #CoupledCluster
Quantum chemistry dataset βΆ trains atomization energy models
β Learn more...
AI Index Data: Growth, Talent (Cambridge/Harvard)
#GlobalAI #CountryIndicators #LongitudinalData
Global AI panel dataset βΆ cross-country trend forecasting
β Learn more...
WHO Global Health Indicators for Prediction
#GlobalHealth #CountryLevel #WorldBank
Global health panel data βΆ cross-country trend analysis
β Learn more...
β‘ Trends
βΈ Longitudinal country-level datasets increasingly target forecasting and cross-country trend analysis.
βΈ Wide, linked, multi-table dataset formats are becoming standard for benchmarking.
βΈ Efficiency gains come from unifying model families and distilling complex systems.
π§ TL;DR
π Trees to Flows and Back: Unifying Decision Trees and Diffusion Models
Unifies trees and diffusion, delivering faster tabular generation and effective distillation.
π‘ Efficiency advances increasingly come from unifying classical structures with generative modeling.
via @Papers.Data.Code
π Paper #Paper #CV #VideoGeneration #DiffusionModels
Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation
π€ Min Zhao, Hongzhou Zhu, Kaiwen Zheng et al.
π― Task
Real-time autoregressive video generation
π‘ Idea
Instead of costly AR-teacher ODE trajectory distillation, it initializes few-step AR students with causal consistency distillation: same AR flow-map target, but learned from single online adjacent-step teacher updates, making frame-wise 1-2 step rollout practical.
β¨ Why it's interesting
At frame-wise 2-step, beats 4-step chunk-wise Causal Forcing by +0.1 VBench Total, +0.3 Quality, +0.335 VisionReward; 50% lower first-frame latency, ~4x cheaper Stage 2.
π» Repo
β thu-ml/Causal-Forcing β 665 stars
π paper
via @Papers.Data.Code
Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation
π€ Min Zhao, Hongzhou Zhu, Kaiwen Zheng et al.
π― Task
Real-time autoregressive video generation
π‘ Idea
Instead of costly AR-teacher ODE trajectory distillation, it initializes few-step AR students with causal consistency distillation: same AR flow-map target, but learned from single online adjacent-step teacher updates, making frame-wise 1-2 step rollout practical.
β¨ Why it's interesting
At frame-wise 2-step, beats 4-step chunk-wise Causal Forcing by +0.1 VBench Total, +0.3 Quality, +0.335 VisionReward; 50% lower first-frame latency, ~4x cheaper Stage 2.
π» Repo
β thu-ml/Causal-Forcing β 665 stars
π paper
via @Papers.Data.Code
GitHub
GitHub - thu-ml/Causal-Forcing: [ICML 2026] Official codebase for "Causal Forcing: Autoregressive Diffusion Distillation Done Rightβ¦
[ICML 2026] Official codebase for "Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation" & Causal Forci...
π Weekly Digest Β· May 16 β May 23
#WeeklyDigest
π Papers
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
#Reasoning #ReinforcementLearning #TestTimeScaling
Unified SFT-RL scaling βΆ reaches IMO gold line
β Learn more...
Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation
#VideoGeneration #DiffusionModels #Distillation
Causal Forcing++ distillation βΆ real-time 1-2 step video
β Learn more...
Self-Distilled Agentic Reinforcement Learning
#ReinforcementLearning #KnowledgeDistillation #LLMAgents
Gated self-distillation RL βΆ beats GRPO on LLM agents
β Learn more...
OpenComputer: Verifiable Software Worlds for Computer-Use Agents
#ComputerUseAgents #Benchmarking #Evaluation
Verifier-grounded desktop tasks βΆ auditable agent evaluation
β Learn more...
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
#VideoGeneration #Quantization #ParallelTraining
NVFP4 long video stack βΆ faster training, inference, lower memory
β Learn more...
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
#LongContextModeling #VisionLanguageModels #DocumentVQA
Long-doc VQA pretraining βΆ extends LVLMs to 128K+
β Learn more...
π» Repos
sapientinc/HRM-Text β
#Pretraining #HierarchicalReasoningModel #Flashattention
HRM text pretraining βΆ trains 0.6B-1B on 8-16 H100s
β Learn more...
facebookresearch/vggt-omega β
#DepthEstimation #CameraPose #3DReconstruction
Multi-image feed-forward model βΆ infers camera pose and depth
β Learn more...
yyfz/Warp-as-History β
#VideoGeneration #CameraControl #Lora
Warped history conditioning βΆ camera-controlled video generation
β Learn more...
π Datasets
Orchard
#SoftwareEngineering #ToolUse #GuiAgent
Dual agent trajectories βΆ train and evaluate coding GUI agents
β Learn more...
ThoughtTrace
#MultiTurnDialogue #UserModeling #Alignment
ThoughtTrace dataset βΆ measures latent intent
β Learn more...
β‘οΈ Tomorrow β NLP & LLM Monthly
via @Papers.Data.Code
#WeeklyDigest
π Papers
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
#Reasoning #ReinforcementLearning #TestTimeScaling
Unified SFT-RL scaling βΆ reaches IMO gold line
β Learn more...
Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation
#VideoGeneration #DiffusionModels #Distillation
Causal Forcing++ distillation βΆ real-time 1-2 step video
β Learn more...
Self-Distilled Agentic Reinforcement Learning
#ReinforcementLearning #KnowledgeDistillation #LLMAgents
Gated self-distillation RL βΆ beats GRPO on LLM agents
β Learn more...
OpenComputer: Verifiable Software Worlds for Computer-Use Agents
#ComputerUseAgents #Benchmarking #Evaluation
Verifier-grounded desktop tasks βΆ auditable agent evaluation
β Learn more...
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
#VideoGeneration #Quantization #ParallelTraining
NVFP4 long video stack βΆ faster training, inference, lower memory
β Learn more...
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
#LongContextModeling #VisionLanguageModels #DocumentVQA
Long-doc VQA pretraining βΆ extends LVLMs to 128K+
β Learn more...
π» Repos
sapientinc/HRM-Text β
#Pretraining #HierarchicalReasoningModel #Flashattention
HRM text pretraining βΆ trains 0.6B-1B on 8-16 H100s
β Learn more...
facebookresearch/vggt-omega β
#DepthEstimation #CameraPose #3DReconstruction
Multi-image feed-forward model βΆ infers camera pose and depth
β Learn more...
yyfz/Warp-as-History β
#VideoGeneration #CameraControl #Lora
Warped history conditioning βΆ camera-controlled video generation
β Learn more...
π Datasets
Orchard
#SoftwareEngineering #ToolUse #GuiAgent
Dual agent trajectories βΆ train and evaluate coding GUI agents
β Learn more...
ThoughtTrace
#MultiTurnDialogue #UserModeling #Alignment
ThoughtTrace dataset βΆ measures latent intent
β Learn more...
β‘οΈ Tomorrow β NLP & LLM Monthly
via @Papers.Data.Code