ML Research Hub

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

10 Mar 2025 · Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, Xu Yang

Enhancing reasoning in Large Multimodal Models (#LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose \textbf{\method}, a two-stage framework adapting rule-based RL for multimodal reasoning through \textbf{Foundational Reasoning Enhancement (FRE)} followed by \textbf{Multimodal Generalization Training (MGT)}. The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2.5-VL-Instruct-3B demonstrate that \method achieves 4.83\% and 4.5\% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63\% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.

Paper: https://arxiv.org/pdf/2503.07536v1.pdf

code: https://github.com/tidedra/lmm-r1

https://t.me/DataScienceT

🧡

Please open Telegram to view this post

VIEW IN TELEGRAM

👍3🔥1

3.63K views13:07

ML Research Hub

✨Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

📝 Summary:
UniPruneBench is a new benchmark for evaluating visual token pruning in large multimodal models LMMs. It standardizes evaluation across tasks and models, revealing that random pruning is a strong baseline and OCR is most sensitive to pruning. The pruning ratio greatly impacts performance.

🔹 Publication Date: Published on Nov 4

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.02650
• PDF: https://arxiv.org/pdf/2511.02650
• Project Page: https://uniprunebench-lmm.github.io/
• Github: https://github.com/TianfanPeng/VLMUniPruneBench

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#LMMs #VisualCompression #DeepLearning #ComputerVision #AIResearch

269 views06:54

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique

📝 Summary:
MM-CRITIC is a new benchmark evaluating Large Multimodal Models critique abilities across various dimensions and tasks. It uses expert-informed ground answers and GPT-4o for reliable scoring. This benchmark provides a comprehensive assessment of leading LMMs' critique capabilities.

🔹 Publication Date: Published on Nov 12

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.09067
• PDF: https://arxiv.org/pdf/2511.09067

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#LMMs #MultimodalAI #AIEvaluation #Benchmarking #AIResearch

670 views02:00

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Vidi: Large Multimodal Models for Video Understanding and Editing

📝 Summary:
Vidi is a family of Large Multimodal Models for video understanding and editing, excelling at temporal retrieval in long, multimodal videos. It significantly outperforms proprietary models like GPT-4o on the new VUE-TR benchmark, which supports hour-long videos and audio queries.

🔹 Publication Date: Published on Apr 22

🔹 Paper Links:
• arXiv Page: https://arxiv.org/pdf/2504.15681
• PDF: https://arxiv.org/pdf/2504.15681
• Project Page: https://bytedance.github.io/vidi-website/
• Github: https://github.com/bytedance/vidi

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#LMMs #VideoAI #MultimodalAI #AIResearch #DeepLearning

❤4

853 views04:00

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

📝 Summary:
LongVT is an agentic framework that improves long video reasoning. It uses LMMs as tools for global-to-local video cropping and frame resampling to ground answers. This novel approach consistently outperforms existing baselines.

🔹 Publication Date: Published on Nov 25

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20785
• PDF: https://arxiv.org/pdf/2511.20785
• Project Page: https://evolvinglmms-lab.github.io/LongVT/
• Github: https://github.com/EvolvingLMMs-Lab/LongVT

🔹 Models citing this paper:
• https://huggingface.co/longvideotool/LongVT-RFT
• https://huggingface.co/longvideotool/LongVT-SFT
• https://huggingface.co/longvideotool/LongVT-RL

✨ Datasets citing this paper:
• https://huggingface.co/datasets/longvideotool/LongVT-Source
• https://huggingface.co/datasets/longvideotool/LongVT-Parquet

✨ Spaces citing this paper:
• https://huggingface.co/spaces/longvideotool/LongVT-Demo

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoAI #LMMs #AgenticAI #ComputerVision #AIResearch

arXiv.org

LongVT: Incentivizing "Thinking with Long Videos" via...

Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form...

❤1

217 views04:01

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

📝 Summary:
Large Multimodal Models struggle with long video understanding due to context limits. The DIG framework adapts frame selection to query types, using efficient uniform sampling for global queries and specialized selection for localized ones. This approach significantly improves LMM performance on ...

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04000
• PDF: https://arxiv.org/pdf/2512.04000
• Project Page: https://github.com/Jialuo-Li/DIG
• Github: https://github.com/Jialuo-Li/DIG

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoUnderstanding #LMMs #MultimodalAI #DeepLearning #ComputerVision

❤1

295 views20:39

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Latent Implicit Visual Reasoning

📝 Summary:
Large Multimodal Models struggle with visual reasoning due to their text-centric nature and limitations of prior methods. This paper introduces a task-agnostic mechanism for LMMs to discover and use visual reasoning tokens without explicit supervision. The approach achieves state-of-the-art resul...

🔹 Publication Date: Published on Dec 24

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.21218
• PDF: https://arxiv.org/pdf/2512.21218

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#LMMs #VisualReasoning #AI #ComputerVision #DeepLearning

❤1

305 views06:58

✨ Explore Data Science 📝 Write your paper

About

Blog

Apps

Platform