ML Research Hub
32.8K subscribers
4.35K photos
267 videos
23 files
4.7K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
Media is too big
VIEW IN TELEGRAM
Agent S: An Open Agentic Framework that Uses Computers Like a Human

📝 Summary:
Agent S is an open agentic framework enabling autonomous GUI interaction to automate complex tasks. It employs experience-augmented hierarchical planning and an Agent-Computer Interface with MLLMs for enhanced reasoning. Agent S achieves state-of-the-art performance on OSWorld and demonstrates br...

🔹 Publication Date: Published on Oct 10, 2024

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2410.08164
• PDF: https://arxiv.org/pdf/2410.08164
• Github: https://huggingface.co/collections/ranpox/awesome-computer-use-agents

==================================

For more data science resources:
https://t.me/DataScienceT

#AgenticAI #MultimodalAI #HumanComputerInteraction #Automation #AIResearch
MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation

📝 Summary:
MeViS is a multi-modal dataset for referring motion expression video segmentation, addressing the need to segment and track objects based on their motion descriptions. It provides text and audio annotations for complex videos, enabling research into motion-guided video understanding.

🔹 Publication Date: Published on Dec 11

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.10945
• PDF: https://arxiv.org/pdf/2512.10945
• Project Page: https://henghuiding.com/MeViS/

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoSegmentation #MultiModalAI #ComputerVision #Dataset #MotionUnderstanding
2
Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

📝 Summary:
MMRB2 is a new benchmark for multimodal reward models, evaluating them on interleaved image and text tasks using 4,000 expert-annotated preferences. It shows top models like Gemini 3 Pro achieve 75-80% accuracy, still below human performance, highlighting areas for improvement in these models.

🔹 Publication Date: Published on Dec 18

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16899
• PDF: https://arxiv.org/pdf/2512.16899
• Github: https://github.com/facebookresearch/MMRB2/tree/main

==================================

For more data science resources:
https://t.me/DataScienceT

#MultimodalAI #RewardModels #AIbenchmark #MachineLearning #AIResearch
1
A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

📝 Summary:
This paper introduces LongShOTBench, a diagnostic benchmark for long-form multimodal video understanding with open-ended questions and agentic tool use. It also presents LongShOTAgent, an agentic system for video analysis. Results show state-of-the-art models struggle significantly, highlighting ...

🔹 Publication Date: Published on Dec 18

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16978
• PDF: https://arxiv.org/pdf/2512.16978
• Project Page: https://mbzuai-oryx.github.io/LongShOT/
• Github: https://github.com/mbzuai-oryx/longshot

Datasets citing this paper:
https://huggingface.co/datasets/MBZUAI/longshot-bench

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoAI #MultimodalAI #AgenticAI #AIbenchmark #AIResearch
CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

📝 Summary:
CASA enhances cross-attention for vision-language models by adding local text-to-text interaction. This approach substantially reduces the performance gap with costly token insertion methods on detailed visual tasks. CASA maintains efficiency and scalability for long-context multimodal applicatio...

🔹 Publication Date: Published on Dec 22

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.19535
• PDF: https://arxiv.org/pdf/2512.19535
• Project Page: https://kyutai.org/casa
• Github: https://github.com/kyutai-labs/casa

🔹 Models citing this paper:
https://huggingface.co/kyutai/CASA-Helium1-VL-2B

Spaces citing this paper:
https://huggingface.co/spaces/kyutai/casa-samples

==================================

For more data science resources:
https://t.me/DataScienceT

#VisionLanguage #MultimodalAI #AttentionMechanisms #EfficientAI #DeepLearning
4
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

📝 Summary:
T2AV-Compass introduces a unified benchmark for text-to-audio-video generation evaluation. It features 500 diverse prompts and a dual-level framework. Evaluations reveal current T2AV models struggle significantly with realism and cross-modal consistency.

🔹 Publication Date: Published on Dec 24

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.21094
• PDF: https://arxiv.org/pdf/2512.21094
• Project Page: https://nju-link.github.io/T2AV-Compass/
• Github: https://github.com/NJU-LINK/T2AV-Compass/

==================================

For more data science resources:
https://t.me/DataScienceT

#TextToAudioVideo #MultimodalAI #AIEvaluation #GenerativeAI #AIResearch
VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos

📝 Summary:
VideoRAG introduces the first RAG framework for long videos, using a dual-channel architecture to integrate textual knowledge grounding and multi-modal context encoding. This enables unlimited-length video processing and significantly outperforms existing methods.

🔹 Publication Date: Published on Feb 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2502.01549
• PDF: https://arxiv.org/pdf/2502.01549
• Github: https://github.com/hkuds/videorag

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoRAG #RAG #LongVideo #AI #MultimodalAI
2
See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

📝 Summary:
Bi-directional Perceptual Shaping BiPS improves vision-language models by using question-conditioned masked views to shape perception during training. It employs two constraints to ensure complete coverage of relevant pixels and enforce fine-grained visual reliance, preventing text-only shortcuts...

🔹 Publication Date: Published on Dec 26

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22120
• PDF: https://arxiv.org/pdf/2512.22120
• Github: https://github.com/zss02/BiPS

==================================

For more data science resources:
https://t.me/DataScienceT

#MultimodalAI #VisionLanguageModels #MachineLearning #AIResearch #DeepLearning
1
Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding

📝 Summary:
Omni-Weather is a new multimodal foundation model that unifies weather generation and understanding in a single architecture. It uses shared self-attention and a Chain-of-Thought dataset for interpretable, high-quality outputs, achieving state-of-the-art performance.

🔹 Publication Date: Published on Dec 25

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.21643
• PDF: https://arxiv.org/pdf/2512.21643

==================================

For more data science resources:
https://t.me/DataScienceT

#WeatherGeneration #FoundationModels #MultimodalAI #AIResearch #DeepLearning
1
This media is not supported in your browser
VIEW IN TELEGRAM
LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

📝 Summary:
LiveTalk enables real-time multimodal interactive video generation from text, image, and audio by improving on-policy diffusion distillation. It reduces inference latency by 20x while maintaining quality, allowing seamless human-AI interaction.

🔹 Publication Date: Published on Dec 29

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.23576
• PDF: https://arxiv.org/pdf/2512.23576
• Github: https://github.com/GAIR-NLP/LiveTalk

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoGeneration #AI #DiffusionModels #RealTimeAI #MultimodalAI
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

📝 Summary:
Dolphin is a novel multimodal model for document image parsing. It uses an analyze-then-parse approach with heterogeneous anchor prompting, achieving state-of-the-art performance and superior efficiency.

🔹 Publication Date: Published on May 20, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.14059
• PDF: https://arxiv.org/pdf/2505.14059
• Github: https://github.com/bytedance/dolphin

==================================

For more data science resources:
https://t.me/DataScienceT

#DocumentParsing #MultimodalAI #DeepLearning #ComputerVision #AI
1
SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

📝 Summary:
SenseNova-MARS empowers Vision-Language Models with interleaved visual reasoning and dynamic tool use like search and cropping via reinforcement learning. It achieves state-of-the-art performance on complex visual tasks, outperforming proprietary models on new and existing benchmarks.

🔹 Publication Date: Published on Dec 30, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24330
• PDF: https://arxiv.org/pdf/2512.24330
• Github: https://github.com/OpenSenseNova/SenseNova-MARS

Datasets citing this paper:
https://huggingface.co/datasets/sensenova/SenseNova-MARS-Data
https://huggingface.co/datasets/sensenova/HR-MMSearch

==================================

For more data science resources:
https://t.me/DataScienceT

#MultimodalAI #ReinforcementLearning #VisionLanguageModels #AgenticAI #ComputerVision
1
OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

📝 Summary:
OmniVCus introduces a system for feedforward multi-subject video customization with multimodal controls. It proposes a data pipeline, VideoCus-Factory, and a diffusion Transformer framework with novel embedding mechanisms. This enables more subjects and precise editing, significantly outperformin...

🔹 Publication Date: Published on Jun 29, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2506.23361
• PDF: https://arxiv.org/pdf/2506.23361
• Project Page: https://caiyuanhao1998.github.io/project/OmniVCus/
• Github: https://github.com/caiyuanhao1998/Open-OmniVCus

🔹 Models citing this paper:
https://huggingface.co/CaiYuanhao/OmniVCus

Datasets citing this paper:
https://huggingface.co/datasets/CaiYuanhao/OmniVCus
https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Test
https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Train

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoGeneration #DiffusionModels #MultimodalAI #DeepLearning #ComputerVision
1
M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models

📝 Summary:
Existing concept erasure methods in diffusion models are vulnerable to non-text inputs. M-ErasureBench is a new multimodal evaluation framework, and IRECE is a module to restore robustness against these attacks, reducing concept reproduction.

🔹 Publication Date: Published on Dec 28, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22877
• PDF: https://arxiv.org/pdf/2512.22877

==================================

For more data science resources:
https://t.me/DataScienceT

#DiffusionModels #ConceptErasure #MultimodalAI #AISafety #MachineLearning
CPPO: Contrastive Perception for Vision Language Policy Optimization

📝 Summary:
CPPO improves vision-language model fine-tuning by detecting perception tokens through entropy shifts. It then applies a Contrastive Perception Loss to enhance multimodal reasoning, outperforming prior methods more efficiently.

🔹 Publication Date: Published on Jan 1

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.00501
• PDF: https://arxiv.org/pdf/2601.00501

==================================

For more data science resources:
https://t.me/DataScienceT

#VisionLanguageModels #MultimodalAI #ContrastiveLearning #DeepLearning #AIResearch
Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

📝 Summary:
This paper introduces IMDD-1M, a large dataset of 1 million industrial defect image-text pairs. It enables training a vision-language foundation model tailored for industrial use. This model achieves comparable performance with less data for specialized tasks, promoting data-efficient quality ins...

🔹 Publication Date: Published on Dec 30, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24160
• PDF: https://arxiv.org/pdf/2512.24160

==================================

For more data science resources:
https://t.me/DataScienceT

#IndustrialAI #VisionLanguageModel #DefectDetection #MultimodalAI #ComputerVision
Afri-MCQA: Multimodal Cultural Question Answering for African Languages

📝 Summary:
Afri-MCQA is the first multimodal cultural QA benchmark for 15 African languages. It shows open-weight LLMs perform poorly, particularly with native language speech and cultural contexts. This highlights the need for speech-first, culturally grounded AI development.

🔹 Publication Date: Published on Jan 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.05699
• PDF: https://arxiv.org/pdf/2601.05699

Datasets citing this paper:
https://huggingface.co/datasets/Atnafu/Afri-MCQA

==================================

For more data science resources:
https://t.me/DataScienceT

#AfricanLanguages #MultimodalAI #LLMs #CulturalAI #SpeechAI