ML Research Hub – Telegram

ML Research Hub

32.8K subscribers

4.17K photos

251 videos

23 files

4.5K links

Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho

Download Telegram

About

Blog

Apps

Platform

ML Research Hub

32.8K subscribers

ML Research Hub

✨MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

📝 Summary:
MVU-Eval is a new comprehensive benchmark for evaluating Multi-Video Understanding in Multimodal Large Language Models. It addresses a critical gap in existing single-video benchmarks and reveals significant performance limitations in current MLLMs for multi-video scenarios.

🔹 Publication Date: Published on Nov 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.07250
• PDF: https://arxiv.org/pdf/2511.07250
• Project Page: https://huggingface.co/datasets/MVU-Eval-Team/MVU-Eval-Data
• Github: https://github.com/NJU-LINK/MVU-Eval

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#MLLMs #VideoUnderstanding #AI #Benchmarking #ComputerVision

171 views05:02

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning

📝 Summary:
TimeSearch-R improves long-form video understanding by optimizing temporal search with reinforcement learning. It uses GRPO-CSV to verify searched frame completeness, leading to improved reasoning. This achieves state-of-the-art performance on multiple video benchmarks.

🔹 Publication Date: Published on Nov 7

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.05489
• PDF: https://arxiv.org/pdf/2511.05489
• Github: https://github.com/Time-Search/TimeSearch-R

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoUnderstanding #ReinforcementLearning #DeepLearning #AIResearch #ComputerVision

292 views02:00

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation

📝 Summary:
EmoVid is a new multimodal, emotion-annotated video dataset designed for creative media like cartoons and movies. It bridges emotion understanding with video generation, significantly improving emotional expression and quality in generated videos. EmoVid establishes a new benchmark for affective ...

🔹 Publication Date: Published on Nov 14

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.11002
• PDF: https://arxiv.org/pdf/2511.11002

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#EmoVid #MultimodalAI #EmotionAI #VideoGeneration #VideoUnderstanding

366 views03:00

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Dynamic Reflections: Probing Video Representations with Text Alignment

📝 Summary:
This work presents the first comprehensive study on video-text representation alignment. It reveals alignment depends on data richness and correlates with downstream task performance, suggesting its value for general video understanding. This introduces video-text alignment as a zero-shot method ...

🔹 Publication Date: Published on Nov 4

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.02767
• PDF: https://arxiv.org/pdf/2511.02767
• Github: https://video-prh.github.io/

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoUnderstanding #TextAlignment #VideoTextAI #ZeroShotLearning #RepresentationLearning

❤1

321 views14:08

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

📝 Summary:
Text-only self-reflection is insufficient for long-form video understanding. REVISOR is a new framework enabling MLLMs to perform multimodal introspective reflection across text and visual modalities. This significantly enhances reasoning for long videos without extra fine-tuning, achieving stron...

🔹 Publication Date: Published on Nov 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13026
• PDF: https://arxiv.org/pdf/2511.13026

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#MultimodalAI #VideoUnderstanding #MLLMs #AIResearch #ComputerVision

176 views04:01

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨VIDEOP2R: Video Understanding from Perception to Reasoning

📝 Summary:
VideoP2R is a novel reinforcement fine-tuning framework for video understanding. It separately models perception and reasoning processes, using a new CoT dataset and a process-aware RL algorithm. This approach achieves state-of-the-art results on video reasoning benchmarks.

🔹 Publication Date: Published on Nov 14

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.11113v1
• PDF: https://arxiv.org/pdf/2511.11113

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoUnderstanding #ReinforcementLearning #AIResearch #ComputerVision #Reasoning

365 views22:24

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

📝 Summary:
TimeViper is a hybrid Mamba-Transformer vision-language model for efficient long video understanding. It introduces a TransV module to compress redundant vision tokens into instruction tokens, enabling it to process over 10,000 frames. This achieves state-of-the-art performance while offering new...

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16595
• PDF: https://arxiv.org/pdf/2511.16595
• Project Page: https://xuboshen.github.io/TimeViper/

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#TimeViper #VisionLanguageModels #VideoUnderstanding #MambaTransformer #DeepLearning

173 views04:07

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System

📝 Summary:
SciEducator is a self-evolving multi-agent system designed for scientific video understanding and education. It integrates professional knowledge and step-wise reasoning to interpret scientific activities and produce multimodal educational content. SciEducator significantly outperforms existing m...

🔹 Publication Date: Published on Nov 22

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.17943
• PDF: https://arxiv.org/pdf/2511.17943

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#MultiAgentSystems #AIEducation #VideoUnderstanding #EdTech #AIResearch

395 views10:07

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

📝 Summary:
Click2Graph is an interactive framework for Panoptic Video Scene Graph Generation. It uses a single user click to segment, track, discover interactions, and predict triplets for temporally consistent scene graphs. This enables user-guided, controllable video scene understanding.

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15948
• PDF: https://arxiv.org/pdf/2511.15948

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoUnderstanding #SceneGraphs #ComputerVision #InteractiveAI #AIResearch

216 views05:05

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨ViDiC: Video Difference Captioning

📝 Summary:
The ViDiC task and ViDiC-1K dataset evaluate MLLMs' ability to describe differences between video pairs, overcoming static image captioning limits. It assesses motion and event evolution, finding significant performance gaps in current models for comparative video understanding.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.03405
• PDF: https://arxiv.org/pdf/2512.03405
• Project Page: https://vidic-1k.github.io/

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoCaptioning #MLLM #VideoUnderstanding #ComputerVision #AIResearch

205 views08:00

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨OneThinker: All-in-one Reasoning Model for Image and Video

📝 Summary:
OneThinker is an all-in-one model unifying image and video understanding across diverse tasks like QA, captioning, and tracking. It employs a new training corpus and RL method for balanced optimization, achieving strong performance and knowledge transfer across 31 benchmarks. This advances toward...

🔹 Publication Date: Published on Dec 2

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.03043
• PDF: https://arxiv.org/pdf/2512.03043
• Project Page: https://github.com/tulerfeng/OneThinker
• Github: https://github.com/tulerfeng/OneThinker

🔹 Models citing this paper:
• https://huggingface.co/OneThink/OneThinker-8B
• https://huggingface.co/OneThink/OneThinker-SFT-Qwen3-8B

✨ Datasets citing this paper:
• https://huggingface.co/datasets/OneThink/OneThinker-train-data
• https://huggingface.co/datasets/OneThink/OneThinker-eval

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#AI #ComputerVision #MultimodalAI #DeepLearning #VideoUnderstanding

OneThinker: All-in-one Reasoning Model for Image and Video

Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train...

173 views08:01

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

📝 Summary:
Large Multimodal Models struggle with long video understanding due to context limits. The DIG framework adapts frame selection to query types, using efficient uniform sampling for global queries and specialized selection for localized ones. This approach significantly improves LMM performance on ...

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04000
• PDF: https://arxiv.org/pdf/2512.04000
• Project Page: https://github.com/Jialuo-Li/DIG
• Github: https://github.com/Jialuo-Li/DIG

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoUnderstanding #LMMs #MultimodalAI #DeepLearning #ComputerVision

❤1

297 views20:39

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation

📝 Summary:
Pyramid Sparse Attention PSA introduces multi-level pooled key-value representations to overcome information loss in traditional sparse attention. It dynamically retains critical information, improving efficiency and performance for video understanding and generation tasks.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04025
• PDF: https://arxiv.org/pdf/2512.04025
• Project Page: https://ziplab.co/PSA/
• Github: https://github.com/ziplab/Pyramid-Sparse-Attention

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#SparseAttention #VideoUnderstanding #VideoGeneration #DeepLearning #ComputerVision

276 views22:39

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

📝 Summary:
The SANTA framework addresses object and action hallucinations in multimodal LLM video captions. It uses self-augmented contrastive alignment to identify potential hallucinations and then aligns regional objects and actions with visual phrases, improving factual accuracy. Experiments show SANTA o...

🔹 Publication Date: Published on Dec 4

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04356
• PDF: https://arxiv.org/pdf/2512.04356
• Project Page: https://kpc0810.github.io/santa/
• Github: https://kpc0810.github.io/santa/

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#MultimodalLLMs #AI #Hallucinations #VideoUnderstanding #ContrastiveLearning

211 views07:06

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

📝 Summary:
Active Video Perception AVP improves long video understanding by actively seeking query-relevant evidence. It uses an iterative plan-observe-reflect process, acquiring compact evidence directly from pixels. This achieves higher accuracy with reduced computational cost.

🔹 Publication Date: Published on Dec 5

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.05774
• PDF: https://arxiv.org/pdf/2512.05774

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoUnderstanding #ActiveLearning #ComputerVision #AIResearch #DeepLearning

307 views18:06

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨LongVideoAgent: Multi-Agent Reasoning with Long Videos

📝 Summary:
A multi-agent framework with a master LLM, grounding agent, and vision agent enhances long-video QA by improving temporal grounding and extracting visual details. This RL-trained system outperforms non-agent baselines on new datasets.

🔹 Publication Date: Published on Dec 23

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.20618
• PDF: https://arxiv.org/pdf/2512.20618
• Github: https://longvideoagent.github.io/

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#MultiAgentSystems #LLM #VideoUnderstanding #ComputerVision #AI

❤1

326 views09:03

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

📝 Summary:
MLLMs struggle with hallucinations on counterfactual videos. DualityForge synthesizes counterfactual video data and QA pairs through diffusion-based editing to address this. This method significantly reduces model hallucinations and improves general performance.

🔹 Publication Date: Published on Dec 30, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24271
• PDF: https://arxiv.org/pdf/2512.24271
• Project Page: https://amap-ml.github.io/Taming-Hallucinations/
• Github: https://github.com/AMAP-ML/Taming-Hallucinations

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#MLLMs #VideoUnderstanding #AIHallucinations #GenerativeAI #MachineLearning

302 views04:01

✨ Explore Data Science 📝 Write your paper