✨M3DR: Towards Universal Multilingual Multimodal Document Retrieval
📝 Summary:
M3DR is a framework for multilingual multimodal document retrieval that uses contrastive training to achieve robust cross-lingual and cross-modal alignment. It overcomes English-centric limitations, showing state-of-the-art performance across 22 diverse languages.
🔹 Publication Date: Published on Dec 3
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.03514
• PDF: https://arxiv.org/pdf/2512.03514
• Project Page: https://www.cognitivelab.in/blog/introducing-netraembed
• Github: https://github.com/adithya-s-k/colpali
🔹 Models citing this paper:
• https://huggingface.co/Cognitive-Lab/NetraEmbed
• https://huggingface.co/Cognitive-Lab/ColNetraEmbed
✨ Datasets citing this paper:
• https://huggingface.co/datasets/Cognitive-Lab/NayanaIR-CrossBench
✨ Spaces citing this paper:
• https://huggingface.co/spaces/AdithyaSK/NetraEmbed
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #InformationRetrieval #NLP #CrossLingualAI #MachineLearning
📝 Summary:
M3DR is a framework for multilingual multimodal document retrieval that uses contrastive training to achieve robust cross-lingual and cross-modal alignment. It overcomes English-centric limitations, showing state-of-the-art performance across 22 diverse languages.
🔹 Publication Date: Published on Dec 3
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.03514
• PDF: https://arxiv.org/pdf/2512.03514
• Project Page: https://www.cognitivelab.in/blog/introducing-netraembed
• Github: https://github.com/adithya-s-k/colpali
🔹 Models citing this paper:
• https://huggingface.co/Cognitive-Lab/NetraEmbed
• https://huggingface.co/Cognitive-Lab/ColNetraEmbed
✨ Datasets citing this paper:
• https://huggingface.co/datasets/Cognitive-Lab/NayanaIR-CrossBench
✨ Spaces citing this paper:
• https://huggingface.co/spaces/AdithyaSK/NetraEmbed
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #InformationRetrieval #NLP #CrossLingualAI #MachineLearning
arXiv.org
M3DR: Towards Universal Multilingual Multimodal Document Retrieval
Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric,...
❤1
✨EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
📝 Summary:
EMMA is an efficient unified architecture for multimodal tasks like understanding, generation, and editing. It uses novel components including an autoencoder, channel-wise concatenation, and mixture-of-experts. EMMA achieves superior performance and efficiency over state-of-the-art unified models.
🔹 Publication Date: Published on Dec 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04810
• PDF: https://arxiv.org/pdf/2512.04810
• Project Page: https://emma-umm.github.io/emma/
• Github: https://emma-umm.github.io/emma/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #GenerativeAI #DeepLearning #AIArchitecture #EfficientAI
📝 Summary:
EMMA is an efficient unified architecture for multimodal tasks like understanding, generation, and editing. It uses novel components including an autoencoder, channel-wise concatenation, and mixture-of-experts. EMMA achieves superior performance and efficiency over state-of-the-art unified models.
🔹 Publication Date: Published on Dec 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04810
• PDF: https://arxiv.org/pdf/2512.04810
• Project Page: https://emma-umm.github.io/emma/
• Github: https://emma-umm.github.io/emma/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #GenerativeAI #DeepLearning #AIArchitecture #EfficientAI
❤3
✨UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
📝 Summary:
UnityVideo is a unified framework enhancing video generation by integrating multiple modalities and training paradigms. It uses dynamic noising and a modality switcher for comprehensive world understanding. This improves video quality, consistency, and zero-shot generalization to new data.
🔹 Publication Date: Published on Dec 8
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.07831
• PDF: https://arxiv.org/pdf/2512.07831
• Project Page: https://jackailab.github.io/Projects/UnityVideo/
• Github: https://github.com/dvlab-research/UnityVideo
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoGeneration #MultimodalAI #GenerativeAI #DeepLearning #AIResearch
📝 Summary:
UnityVideo is a unified framework enhancing video generation by integrating multiple modalities and training paradigms. It uses dynamic noising and a modality switcher for comprehensive world understanding. This improves video quality, consistency, and zero-shot generalization to new data.
🔹 Publication Date: Published on Dec 8
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.07831
• PDF: https://arxiv.org/pdf/2512.07831
• Project Page: https://jackailab.github.io/Projects/UnityVideo/
• Github: https://github.com/dvlab-research/UnityVideo
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoGeneration #MultimodalAI #GenerativeAI #DeepLearning #AIResearch
✨GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
📝 Summary:
GLM-4.1V-Thinking is a vision-language model using a reasoning-centric training framework. It achieves state-of-the-art multimodal reasoning across various tasks like STEM and long document understanding. The model outperforms larger models and competes with closed-source systems like GPT-4o.
🔹 Publication Date: Published on Jul 1
🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/glm-41v-thinking-towards-versatile-multimodal-reasoning-with-scalable-reinforcement-learning
• PDF: https://arxiv.org/pdf/2507.01006
• Github: https://github.com/THUDM/GLM-4.1V-Thinking
🔹 Models citing this paper:
• https://huggingface.co/zai-org/GLM-4.1V-9B-Thinking
• https://huggingface.co/zai-org/GLM-4.5V
• https://huggingface.co/zai-org/GLM-4.6V-Flash
✨ Spaces citing this paper:
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-Demo
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-API-Demo
• https://huggingface.co/spaces/akhaliq/anycoder
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#GLM41VThinking #MultimodalAI #VisionLanguageModels #ReinforcementLearning #AIResearch
📝 Summary:
GLM-4.1V-Thinking is a vision-language model using a reasoning-centric training framework. It achieves state-of-the-art multimodal reasoning across various tasks like STEM and long document understanding. The model outperforms larger models and competes with closed-source systems like GPT-4o.
🔹 Publication Date: Published on Jul 1
🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/glm-41v-thinking-towards-versatile-multimodal-reasoning-with-scalable-reinforcement-learning
• PDF: https://arxiv.org/pdf/2507.01006
• Github: https://github.com/THUDM/GLM-4.1V-Thinking
🔹 Models citing this paper:
• https://huggingface.co/zai-org/GLM-4.1V-9B-Thinking
• https://huggingface.co/zai-org/GLM-4.5V
• https://huggingface.co/zai-org/GLM-4.6V-Flash
✨ Spaces citing this paper:
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-Demo
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-API-Demo
• https://huggingface.co/spaces/akhaliq/anycoder
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#GLM41VThinking #MultimodalAI #VisionLanguageModels #ReinforcementLearning #AIResearch
Arxivexplained
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning - Explained Simply
By Wenyi Hong, Wenmeng Yu, Xiaotao Gu et al.. # GLM-4.1V-Thinking: The AI That Actually Thinks Through Visual Problems
**The Problem:** Current A...
**The Problem:** Current A...
✨The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation
📝 Summary:
This paper highlights the gap between SAM2 and SAM3. SAM2 uses spatial prompts for geometric segmentation, but SAM3 is a concept-driven multimodal model with a unified vision-language architecture. SAM3 represents a new class of foundation model for concept-driven segmentation.
🔹 Publication Date: Published on Dec 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.06032
• PDF: https://arxiv.org/pdf/2512.06032
• Github: https://github.com/Applied-AI-Research-Lab/The-SAM2-to-SAM3-Gap-in-the-Segment-Anything-Model-Family
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ImageSegmentation #FoundationModels #ComputerVision #MultimodalAI #AIResearch
📝 Summary:
This paper highlights the gap between SAM2 and SAM3. SAM2 uses spatial prompts for geometric segmentation, but SAM3 is a concept-driven multimodal model with a unified vision-language architecture. SAM3 represents a new class of foundation model for concept-driven segmentation.
🔹 Publication Date: Published on Dec 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.06032
• PDF: https://arxiv.org/pdf/2512.06032
• Github: https://github.com/Applied-AI-Research-Lab/The-SAM2-to-SAM3-Gap-in-the-Segment-Anything-Model-Family
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ImageSegmentation #FoundationModels #ComputerVision #MultimodalAI #AIResearch
❤1
✨Thinking with Images via Self-Calling Agent
📝 Summary:
sCoT is a novel visual reasoning paradigm that reformulates interleaved multimodal CoT as a language-only CoT with self-calling subagents. It improves reasoning performance and efficiency by avoiding explicit multimodal interleaving and using group-relative policy optimization.
🔹 Publication Date: Published on Dec 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.08511
• PDF: https://arxiv.org/pdf/2512.08511
• Github: https://github.com/YWenxi/think-with-images-through-self-calling
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisualReasoning #MultimodalAI #LLMs #AIagents #AIResearch
📝 Summary:
sCoT is a novel visual reasoning paradigm that reformulates interleaved multimodal CoT as a language-only CoT with self-calling subagents. It improves reasoning performance and efficiency by avoiding explicit multimodal interleaving and using group-relative policy optimization.
🔹 Publication Date: Published on Dec 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.08511
• PDF: https://arxiv.org/pdf/2512.08511
• Github: https://github.com/YWenxi/think-with-images-through-self-calling
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisualReasoning #MultimodalAI #LLMs #AIagents #AIResearch
✨DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry
📝 Summary:
DentalGPT is a specialized dental multimodal LLM. It improves fine-grained visual understanding and reasoning using a large dataset and reinforcement learning. DentalGPT achieves superior performance in dental disease classification and VQA.
🔹 Publication Date: Published on Dec 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.11558
• PDF: https://arxiv.org/pdf/2512.11558
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DentalGPT #DentistryAI #LLM #MultimodalAI #HealthcareTech
📝 Summary:
DentalGPT is a specialized dental multimodal LLM. It improves fine-grained visual understanding and reasoning using a large dataset and reinforcement learning. DentalGPT achieves superior performance in dental disease classification and VQA.
🔹 Publication Date: Published on Dec 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.11558
• PDF: https://arxiv.org/pdf/2512.11558
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DentalGPT #DentistryAI #LLM #MultimodalAI #HealthcareTech
Media is too big
VIEW IN TELEGRAM
✨Agent S: An Open Agentic Framework that Uses Computers Like a Human
📝 Summary:
Agent S is an open agentic framework enabling autonomous GUI interaction to automate complex tasks. It employs experience-augmented hierarchical planning and an Agent-Computer Interface with MLLMs for enhanced reasoning. Agent S achieves state-of-the-art performance on OSWorld and demonstrates br...
🔹 Publication Date: Published on Oct 10, 2024
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2410.08164
• PDF: https://arxiv.org/pdf/2410.08164
• Github: https://huggingface.co/collections/ranpox/awesome-computer-use-agents
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AgenticAI #MultimodalAI #HumanComputerInteraction #Automation #AIResearch
📝 Summary:
Agent S is an open agentic framework enabling autonomous GUI interaction to automate complex tasks. It employs experience-augmented hierarchical planning and an Agent-Computer Interface with MLLMs for enhanced reasoning. Agent S achieves state-of-the-art performance on OSWorld and demonstrates br...
🔹 Publication Date: Published on Oct 10, 2024
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2410.08164
• PDF: https://arxiv.org/pdf/2410.08164
• Github: https://huggingface.co/collections/ranpox/awesome-computer-use-agents
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AgenticAI #MultimodalAI #HumanComputerInteraction #Automation #AIResearch
✨MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
📝 Summary:
MeViS is a multi-modal dataset for referring motion expression video segmentation, addressing the need to segment and track objects based on their motion descriptions. It provides text and audio annotations for complex videos, enabling research into motion-guided video understanding.
🔹 Publication Date: Published on Dec 11
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.10945
• PDF: https://arxiv.org/pdf/2512.10945
• Project Page: https://henghuiding.com/MeViS/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoSegmentation #MultiModalAI #ComputerVision #Dataset #MotionUnderstanding
📝 Summary:
MeViS is a multi-modal dataset for referring motion expression video segmentation, addressing the need to segment and track objects based on their motion descriptions. It provides text and audio annotations for complex videos, enabling research into motion-guided video understanding.
🔹 Publication Date: Published on Dec 11
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.10945
• PDF: https://arxiv.org/pdf/2512.10945
• Project Page: https://henghuiding.com/MeViS/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoSegmentation #MultiModalAI #ComputerVision #Dataset #MotionUnderstanding
❤2
✨Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
📝 Summary:
MMRB2 is a new benchmark for multimodal reward models, evaluating them on interleaved image and text tasks using 4,000 expert-annotated preferences. It shows top models like Gemini 3 Pro achieve 75-80% accuracy, still below human performance, highlighting areas for improvement in these models.
🔹 Publication Date: Published on Dec 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16899
• PDF: https://arxiv.org/pdf/2512.16899
• Github: https://github.com/facebookresearch/MMRB2/tree/main
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #RewardModels #AIbenchmark #MachineLearning #AIResearch
📝 Summary:
MMRB2 is a new benchmark for multimodal reward models, evaluating them on interleaved image and text tasks using 4,000 expert-annotated preferences. It shows top models like Gemini 3 Pro achieve 75-80% accuracy, still below human performance, highlighting areas for improvement in these models.
🔹 Publication Date: Published on Dec 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16899
• PDF: https://arxiv.org/pdf/2512.16899
• Github: https://github.com/facebookresearch/MMRB2/tree/main
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #RewardModels #AIbenchmark #MachineLearning #AIResearch
❤1