Papers.Data.Code
18 subscribers
101 links
Only meaningful ML signals: papers, repos & datasets. Selected, not collected. 3–4 posts/day. 📄💻📊
papers.data.code@gmail.com
Download Telegram
📈 Monthly: Multimodal & Agents | Apr 10 – May 10
#MonthlyDigest #Multimodal #Agents

📄 Papers

MolmoAct2: Action Reasoning Models for Real-world Deployment
#VisionLanguageAction #EmbodiedReasoning #ImitationLearning
Vision-language-action model ⟶ beats VLA baselines on 7 benchmarks
→ Learn more...

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents
#AgenticRL #ToolUse #MultimodalReasoning
Failure-aware multimodal search RL ⟶ +13.8 points on 7 benchmarks
→ Learn more...

Heterogeneous Scientific Foundation Model Collaboration
#AgentSystems #FoundationModels #ScientificAI
LLM-FM agent interface ⟶ scientific tasks on structured data
→ Learn more...

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL
#ReinforcementLearning #KnowledgeDistillation #Reasoning
PRISM pre-alignment ⟶ boosts accuracy over SFT→RLVR
→ Learn more...

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
#VideoGeneration #DiffusionModels #ParameterEfficientFineTuning
Unified video diffusion ⟶ multimodal pixel-aligned generation
→ Learn more...

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments
#SLAM #OpenVocabulary #VisualLanguageModels
Tightly coupled VLM SLAM ⟶ open-vocabulary 3D in dynamics
→ Learn more...

💻 Repos

YanFangCS/GenLIP
#VisionEncoder #AutoregressivePretraining #OCR
Autoregressive ViT pretraining ⟶ strong Doc and OCR gains
→ Learn more...

RockeyCoss/LeapAlign_Code
#TextToImage #FlowMatching #PreferenceOptimization
Two-step leap trajectory ⟶ preference-aligns flow-matching T2I
→ Learn more...

📊 Datasets

gpic
#ImageGeneration #PermissiveLicense #ImageText
Permissive 100M image corpus ⟶ visual generation research
→ Learn more...

MathNet v0 — Olympiad Math Reasoning & Retrieval
#CompetitionMath #Multimodal #Retrieval
Multilingual Olympiad math dataset ⟶ reasoning and retrieval benchmark
→ Learn more...

Trends

▸ Agents increasingly orchestrate external tools or specialized models through structured interfaces.
▸ Multimodal RL training adds intermediate alignment stages to preserve reasoning quality.
▸ Shared action tokenization is emerging for embodied control and cross-embodiment transfer.

🧭 TL;DR

📄 MolmoAct2: Action Reasoning Models for Real-world Deployment
Open VLA model beats strong baselines with adaptive low-latency embodied reasoning.

💡 Multimodal agents are shifting toward tool-grounded, efficient, real-world deployment.

via @Papers.Data.Code