ML Research Hub
32.8K subscribers
4.17K photos
251 videos
23 files
4.5K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

📝 Summary:
MVU-Eval is a new comprehensive benchmark for evaluating Multi-Video Understanding in Multimodal Large Language Models. It addresses a critical gap in existing single-video benchmarks and reveals significant performance limitations in current MLLMs for multi-video scenarios.

🔹 Publication Date: Published on Nov 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.07250
• PDF: https://arxiv.org/pdf/2511.07250
• Project Page: https://huggingface.co/datasets/MVU-Eval-Team/MVU-Eval-Data
• Github: https://github.com/NJU-LINK/MVU-Eval

==================================

For more data science resources:
https://t.me/DataScienceT

#MLLMs #VideoUnderstanding #AI #Benchmarking #ComputerVision
TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning

📝 Summary:
TimeSearch-R improves long-form video understanding by optimizing temporal search with reinforcement learning. It uses GRPO-CSV to verify searched frame completeness, leading to improved reasoning. This achieves state-of-the-art performance on multiple video benchmarks.

🔹 Publication Date: Published on Nov 7

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.05489
• PDF: https://arxiv.org/pdf/2511.05489
• Github: https://github.com/Time-Search/TimeSearch-R

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoUnderstanding #ReinforcementLearning #DeepLearning #AIResearch #ComputerVision
EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation

📝 Summary:
EmoVid is a new multimodal, emotion-annotated video dataset designed for creative media like cartoons and movies. It bridges emotion understanding with video generation, significantly improving emotional expression and quality in generated videos. EmoVid establishes a new benchmark for affective ...

🔹 Publication Date: Published on Nov 14

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.11002
• PDF: https://arxiv.org/pdf/2511.11002

==================================

For more data science resources:
https://t.me/DataScienceT

#EmoVid #MultimodalAI #EmotionAI #VideoGeneration #VideoUnderstanding
Dynamic Reflections: Probing Video Representations with Text Alignment

📝 Summary:
This work presents the first comprehensive study on video-text representation alignment. It reveals alignment depends on data richness and correlates with downstream task performance, suggesting its value for general video understanding. This introduces video-text alignment as a zero-shot method ...

🔹 Publication Date: Published on Nov 4

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.02767
• PDF: https://arxiv.org/pdf/2511.02767
• Github: https://video-prh.github.io/

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoUnderstanding #TextAlignment #VideoTextAI #ZeroShotLearning #RepresentationLearning
1
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

📝 Summary:
Text-only self-reflection is insufficient for long-form video understanding. REVISOR is a new framework enabling MLLMs to perform multimodal introspective reflection across text and visual modalities. This significantly enhances reasoning for long videos without extra fine-tuning, achieving stron...

🔹 Publication Date: Published on Nov 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13026
• PDF: https://arxiv.org/pdf/2511.13026

==================================

For more data science resources:
https://t.me/DataScienceT

#MultimodalAI #VideoUnderstanding #MLLMs #AIResearch #ComputerVision
VIDEOP2R: Video Understanding from Perception to Reasoning

📝 Summary:
VideoP2R is a novel reinforcement fine-tuning framework for video understanding. It separately models perception and reasoning processes, using a new CoT dataset and a process-aware RL algorithm. This approach achieves state-of-the-art results on video reasoning benchmarks.

🔹 Publication Date: Published on Nov 14

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.11113v1
• PDF: https://arxiv.org/pdf/2511.11113

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoUnderstanding #ReinforcementLearning #AIResearch #ComputerVision #Reasoning
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

📝 Summary:
TimeViper is a hybrid Mamba-Transformer vision-language model for efficient long video understanding. It introduces a TransV module to compress redundant vision tokens into instruction tokens, enabling it to process over 10,000 frames. This achieves state-of-the-art performance while offering new...

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16595
• PDF: https://arxiv.org/pdf/2511.16595
• Project Page: https://xuboshen.github.io/TimeViper/

==================================

For more data science resources:
https://t.me/DataScienceT

#TimeViper #VisionLanguageModels #VideoUnderstanding #MambaTransformer #DeepLearning
SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System

📝 Summary:
SciEducator is a self-evolving multi-agent system designed for scientific video understanding and education. It integrates professional knowledge and step-wise reasoning to interpret scientific activities and produce multimodal educational content. SciEducator significantly outperforms existing m...

🔹 Publication Date: Published on Nov 22

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.17943
• PDF: https://arxiv.org/pdf/2511.17943

==================================

For more data science resources:
https://t.me/DataScienceT

#MultiAgentSystems #AIEducation #VideoUnderstanding #EdTech #AIResearch
Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

📝 Summary:
Click2Graph is an interactive framework for Panoptic Video Scene Graph Generation. It uses a single user click to segment, track, discover interactions, and predict triplets for temporally consistent scene graphs. This enables user-guided, controllable video scene understanding.

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15948
• PDF: https://arxiv.org/pdf/2511.15948

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoUnderstanding #SceneGraphs #ComputerVision #InteractiveAI #AIResearch
ViDiC: Video Difference Captioning

📝 Summary:
The ViDiC task and ViDiC-1K dataset evaluate MLLMs' ability to describe differences between video pairs, overcoming static image captioning limits. It assesses motion and event evolution, finding significant performance gaps in current models for comparative video understanding.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.03405
• PDF: https://arxiv.org/pdf/2512.03405
• Project Page: https://vidic-1k.github.io/

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoCaptioning #MLLM #VideoUnderstanding #ComputerVision #AIResearch
OneThinker: All-in-one Reasoning Model for Image and Video

📝 Summary:
OneThinker is an all-in-one model unifying image and video understanding across diverse tasks like QA, captioning, and tracking. It employs a new training corpus and RL method for balanced optimization, achieving strong performance and knowledge transfer across 31 benchmarks. This advances toward...

🔹 Publication Date: Published on Dec 2

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.03043
• PDF: https://arxiv.org/pdf/2512.03043
• Project Page: https://github.com/tulerfeng/OneThinker
• Github: https://github.com/tulerfeng/OneThinker

🔹 Models citing this paper:
https://huggingface.co/OneThink/OneThinker-8B
https://huggingface.co/OneThink/OneThinker-SFT-Qwen3-8B

Datasets citing this paper:
https://huggingface.co/datasets/OneThink/OneThinker-train-data
https://huggingface.co/datasets/OneThink/OneThinker-eval

==================================

For more data science resources:
https://t.me/DataScienceT

#AI #ComputerVision #MultimodalAI #DeepLearning #VideoUnderstanding
Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

📝 Summary:
Large Multimodal Models struggle with long video understanding due to context limits. The DIG framework adapts frame selection to query types, using efficient uniform sampling for global queries and specialized selection for localized ones. This approach significantly improves LMM performance on ...

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04000
• PDF: https://arxiv.org/pdf/2512.04000
• Project Page: https://github.com/Jialuo-Li/DIG
• Github: https://github.com/Jialuo-Li/DIG

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoUnderstanding #LMMs #MultimodalAI #DeepLearning #ComputerVision
1
PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation

📝 Summary:
Pyramid Sparse Attention PSA introduces multi-level pooled key-value representations to overcome information loss in traditional sparse attention. It dynamically retains critical information, improving efficiency and performance for video understanding and generation tasks.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04025
• PDF: https://arxiv.org/pdf/2512.04025
• Project Page: https://ziplab.co/PSA/
• Github: https://github.com/ziplab/Pyramid-Sparse-Attention

==================================

For more data science resources:
https://t.me/DataScienceT

#SparseAttention #VideoUnderstanding #VideoGeneration #DeepLearning #ComputerVision
Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

📝 Summary:
The SANTA framework addresses object and action hallucinations in multimodal LLM video captions. It uses self-augmented contrastive alignment to identify potential hallucinations and then aligns regional objects and actions with visual phrases, improving factual accuracy. Experiments show SANTA o...

🔹 Publication Date: Published on Dec 4

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04356
• PDF: https://arxiv.org/pdf/2512.04356
• Project Page: https://kpc0810.github.io/santa/
• Github: https://kpc0810.github.io/santa/

==================================

For more data science resources:
https://t.me/DataScienceT

#MultimodalLLMs #AI #Hallucinations #VideoUnderstanding #ContrastiveLearning
Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

📝 Summary:
Active Video Perception AVP improves long video understanding by actively seeking query-relevant evidence. It uses an iterative plan-observe-reflect process, acquiring compact evidence directly from pixels. This achieves higher accuracy with reduced computational cost.

🔹 Publication Date: Published on Dec 5

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.05774
• PDF: https://arxiv.org/pdf/2512.05774

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoUnderstanding #ActiveLearning #ComputerVision #AIResearch #DeepLearning
LongVideoAgent: Multi-Agent Reasoning with Long Videos

📝 Summary:
A multi-agent framework with a master LLM, grounding agent, and vision agent enhances long-video QA by improving temporal grounding and extracting visual details. This RL-trained system outperforms non-agent baselines on new datasets.

🔹 Publication Date: Published on Dec 23

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.20618
• PDF: https://arxiv.org/pdf/2512.20618
• Github: https://longvideoagent.github.io/

==================================

For more data science resources:
https://t.me/DataScienceT

#MultiAgentSystems #LLM #VideoUnderstanding #ComputerVision #AI
1
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

📝 Summary:
MLLMs struggle with hallucinations on counterfactual videos. DualityForge synthesizes counterfactual video data and QA pairs through diffusion-based editing to address this. This method significantly reduces model hallucinations and improves general performance.

🔹 Publication Date: Published on Dec 30, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24271
• PDF: https://arxiv.org/pdf/2512.24271
• Project Page: https://amap-ml.github.io/Taming-Hallucinations/
• Github: https://github.com/AMAP-ML/Taming-Hallucinations

==================================

For more data science resources:
https://t.me/DataScienceT

#MLLMs #VideoUnderstanding #AIHallucinations #GenerativeAI #MachineLearning