ML Research Hub

✨M3DR: Towards Universal Multilingual Multimodal Document Retrieval

📝 Summary:
M3DR is a framework for multilingual multimodal document retrieval that uses contrastive training to achieve robust cross-lingual and cross-modal alignment. It overcomes English-centric limitations, showing state-of-the-art performance across 22 diverse languages.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.03514
• PDF: https://arxiv.org/pdf/2512.03514
• Project Page: https://www.cognitivelab.in/blog/introducing-netraembed
• Github: https://github.com/adithya-s-k/colpali

🔹 Models citing this paper:
• https://huggingface.co/Cognitive-Lab/NetraEmbed
• https://huggingface.co/Cognitive-Lab/ColNetraEmbed

✨ Datasets citing this paper:
• https://huggingface.co/datasets/Cognitive-Lab/NayanaIR-CrossBench

✨ Spaces citing this paper:
• https://huggingface.co/spaces/AdithyaSK/NetraEmbed

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#MultimodalAI #InformationRetrieval #NLP #CrossLingualAI #MachineLearning

arXiv.org

M3DR: Towards Universal Multilingual Multimodal Document Retrieval

Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric,...

❤1

349 views13:05

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

📝 Summary:
EMMA is an efficient unified architecture for multimodal tasks like understanding, generation, and editing. It uses novel components including an autoencoder, channel-wise concatenation, and mixture-of-experts. EMMA achieves superior performance and efficiency over state-of-the-art unified models.

🔹 Publication Date: Published on Dec 4

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04810
• PDF: https://arxiv.org/pdf/2512.04810
• Project Page: https://emma-umm.github.io/emma/
• Github: https://emma-umm.github.io/emma/

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#MultimodalAI #GenerativeAI #DeepLearning #AIArchitecture #EfficientAI

❤3

309 views14:05

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

📝 Summary:
UnityVideo is a unified framework enhancing video generation by integrating multiple modalities and training paradigms. It uses dynamic noising and a modality switcher for comprehensive world understanding. This improves video quality, consistency, and zero-shot generalization to new data.

🔹 Publication Date: Published on Dec 8

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.07831
• PDF: https://arxiv.org/pdf/2512.07831
• Project Page: https://jackailab.github.io/Projects/UnityVideo/
• Github: https://github.com/dvlab-research/UnityVideo

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoGeneration #MultimodalAI #GenerativeAI #DeepLearning #AIResearch

292 views08:04

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

📝 Summary:
GLM-4.1V-Thinking is a vision-language model using a reasoning-centric training framework. It achieves state-of-the-art multimodal reasoning across various tasks like STEM and long document understanding. The model outperforms larger models and competes with closed-source systems like GPT-4o.

🔹 Publication Date: Published on Jul 1

🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/glm-41v-thinking-towards-versatile-multimodal-reasoning-with-scalable-reinforcement-learning
• PDF: https://arxiv.org/pdf/2507.01006
• Github: https://github.com/THUDM/GLM-4.1V-Thinking

🔹 Models citing this paper:
• https://huggingface.co/zai-org/GLM-4.1V-9B-Thinking
• https://huggingface.co/zai-org/GLM-4.5V
• https://huggingface.co/zai-org/GLM-4.6V-Flash

✨ Spaces citing this paper:
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-Demo
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-API-Demo
• https://huggingface.co/spaces/akhaliq/anycoder

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#GLM41VThinking #MultimodalAI #VisionLanguageModels #ReinforcementLearning #AIResearch

Arxivexplained

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning - Explained Simply

By Wenyi Hong, Wenmeng Yu, Xiaotao Gu et al.. # GLM-4.1V-Thinking: The AI That Actually Thinks Through Visual Problems

**The Problem:** Current A...

302 views11:05

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation

📝 Summary:
This paper highlights the gap between SAM2 and SAM3. SAM2 uses spatial prompts for geometric segmentation, but SAM3 is a concept-driven multimodal model with a unified vision-language architecture. SAM3 represents a new class of foundation model for concept-driven segmentation.

🔹 Publication Date: Published on Dec 4

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.06032
• PDF: https://arxiv.org/pdf/2512.06032
• Github: https://github.com/Applied-AI-Research-Lab/The-SAM2-to-SAM3-Gap-in-the-Segment-Anything-Model-Family

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#ImageSegmentation #FoundationModels #ComputerVision #MultimodalAI #AIResearch

❤1

348 views15:06

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Thinking with Images via Self-Calling Agent

📝 Summary:
sCoT is a novel visual reasoning paradigm that reformulates interleaved multimodal CoT as a language-only CoT with self-calling subagents. It improves reasoning performance and efficiency by avoiding explicit multimodal interleaving and using group-relative policy optimization.

🔹 Publication Date: Published on Dec 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.08511
• PDF: https://arxiv.org/pdf/2512.08511
• Github: https://github.com/YWenxi/think-with-images-through-self-calling

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VisualReasoning #MultimodalAI #LLMs #AIagents #AIResearch

352 views10:03

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

📝 Summary:
DentalGPT is a specialized dental multimodal LLM. It improves fine-grained visual understanding and reasoning using a large dataset and reinforcement learning. DentalGPT achieves superior performance in dental disease classification and VQA.

🔹 Publication Date: Published on Dec 12

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.11558
• PDF: https://arxiv.org/pdf/2512.11558

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#DentalGPT #DentistryAI #LLM #MultimodalAI #HealthcareTech

348 views03:00

✨ Explore Data Science 📝 Write your paper

✨Agent S: An Open Agentic Framework that Uses Computers Like a Human

📝 Summary:
Agent S is an open agentic framework enabling autonomous GUI interaction to automate complex tasks. It employs experience-augmented hierarchical planning and an Agent-Computer Interface with MLLMs for enhanced reasoning. Agent S achieves state-of-the-art performance on OSWorld and demonstrates br...

🔹 Publication Date: Published on Oct 10, 2024

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2410.08164
• PDF: https://arxiv.org/pdf/2410.08164
• Github: https://huggingface.co/collections/ranpox/awesome-computer-use-agents

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#AgenticAI #MultimodalAI #HumanComputerInteraction #Automation #AIResearch

130 views11:07

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation

📝 Summary:
MeViS is a multi-modal dataset for referring motion expression video segmentation, addressing the need to segment and track objects based on their motion descriptions. It provides text and audio annotations for complex videos, enabling research into motion-guided video understanding.

🔹 Publication Date: Published on Dec 11

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.10945
• PDF: https://arxiv.org/pdf/2512.10945
• Project Page: https://henghuiding.com/MeViS/

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoSegmentation #MultiModalAI #ComputerVision #Dataset #MotionUnderstanding

❤2

218 views08:05

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

📝 Summary:
MMRB2 is a new benchmark for multimodal reward models, evaluating them on interleaved image and text tasks using 4,000 expert-annotated preferences. It shows top models like Gemini 3 Pro achieve 75-80% accuracy, still below human performance, highlighting areas for improvement in these models.

🔹 Publication Date: Published on Dec 18

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16899
• PDF: https://arxiv.org/pdf/2512.16899
• Github: https://github.com/facebookresearch/MMRB2/tree/main

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#MultimodalAI #RewardModels #AIbenchmark #MachineLearning #AIResearch

❤1

270 views10:05

✨ Explore Data Science 📝 Write your paper

About

Blog

Apps

Platform