ML Research Hub
32.8K subscribers
4.34K photos
265 videos
23 files
4.69K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

📝 Summary:
TimeViper is a hybrid Mamba-Transformer vision-language model for efficient long video understanding. It introduces a TransV module to compress redundant vision tokens into instruction tokens, enabling it to process over 10,000 frames. This achieves state-of-the-art performance while offering new...

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16595
• PDF: https://arxiv.org/pdf/2511.16595
• Project Page: https://xuboshen.github.io/TimeViper/

==================================

For more data science resources:
https://t.me/DataScienceT

#TimeViper #VisionLanguageModels #VideoUnderstanding #MambaTransformer #DeepLearning
SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System

📝 Summary:
SciEducator is a self-evolving multi-agent system designed for scientific video understanding and education. It integrates professional knowledge and step-wise reasoning to interpret scientific activities and produce multimodal educational content. SciEducator significantly outperforms existing m...

🔹 Publication Date: Published on Nov 22

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.17943
• PDF: https://arxiv.org/pdf/2511.17943

==================================

For more data science resources:
https://t.me/DataScienceT

#MultiAgentSystems #AIEducation #VideoUnderstanding #EdTech #AIResearch
Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

📝 Summary:
Click2Graph is an interactive framework for Panoptic Video Scene Graph Generation. It uses a single user click to segment, track, discover interactions, and predict triplets for temporally consistent scene graphs. This enables user-guided, controllable video scene understanding.

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15948
• PDF: https://arxiv.org/pdf/2511.15948

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoUnderstanding #SceneGraphs #ComputerVision #InteractiveAI #AIResearch
ViDiC: Video Difference Captioning

📝 Summary:
The ViDiC task and ViDiC-1K dataset evaluate MLLMs' ability to describe differences between video pairs, overcoming static image captioning limits. It assesses motion and event evolution, finding significant performance gaps in current models for comparative video understanding.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.03405
• PDF: https://arxiv.org/pdf/2512.03405
• Project Page: https://vidic-1k.github.io/

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoCaptioning #MLLM #VideoUnderstanding #ComputerVision #AIResearch
OneThinker: All-in-one Reasoning Model for Image and Video

📝 Summary:
OneThinker is an all-in-one model unifying image and video understanding across diverse tasks like QA, captioning, and tracking. It employs a new training corpus and RL method for balanced optimization, achieving strong performance and knowledge transfer across 31 benchmarks. This advances toward...

🔹 Publication Date: Published on Dec 2

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.03043
• PDF: https://arxiv.org/pdf/2512.03043
• Project Page: https://github.com/tulerfeng/OneThinker
• Github: https://github.com/tulerfeng/OneThinker

🔹 Models citing this paper:
https://huggingface.co/OneThink/OneThinker-8B
https://huggingface.co/OneThink/OneThinker-SFT-Qwen3-8B

Datasets citing this paper:
https://huggingface.co/datasets/OneThink/OneThinker-train-data
https://huggingface.co/datasets/OneThink/OneThinker-eval

==================================

For more data science resources:
https://t.me/DataScienceT

#AI #ComputerVision #MultimodalAI #DeepLearning #VideoUnderstanding
Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

📝 Summary:
Large Multimodal Models struggle with long video understanding due to context limits. The DIG framework adapts frame selection to query types, using efficient uniform sampling for global queries and specialized selection for localized ones. This approach significantly improves LMM performance on ...

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04000
• PDF: https://arxiv.org/pdf/2512.04000
• Project Page: https://github.com/Jialuo-Li/DIG
• Github: https://github.com/Jialuo-Li/DIG

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoUnderstanding #LMMs #MultimodalAI #DeepLearning #ComputerVision
1
PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation

📝 Summary:
Pyramid Sparse Attention PSA introduces multi-level pooled key-value representations to overcome information loss in traditional sparse attention. It dynamically retains critical information, improving efficiency and performance for video understanding and generation tasks.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04025
• PDF: https://arxiv.org/pdf/2512.04025
• Project Page: https://ziplab.co/PSA/
• Github: https://github.com/ziplab/Pyramid-Sparse-Attention

==================================

For more data science resources:
https://t.me/DataScienceT

#SparseAttention #VideoUnderstanding #VideoGeneration #DeepLearning #ComputerVision
Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

📝 Summary:
The SANTA framework addresses object and action hallucinations in multimodal LLM video captions. It uses self-augmented contrastive alignment to identify potential hallucinations and then aligns regional objects and actions with visual phrases, improving factual accuracy. Experiments show SANTA o...

🔹 Publication Date: Published on Dec 4

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04356
• PDF: https://arxiv.org/pdf/2512.04356
• Project Page: https://kpc0810.github.io/santa/
• Github: https://kpc0810.github.io/santa/

==================================

For more data science resources:
https://t.me/DataScienceT

#MultimodalLLMs #AI #Hallucinations #VideoUnderstanding #ContrastiveLearning
Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

📝 Summary:
Active Video Perception AVP improves long video understanding by actively seeking query-relevant evidence. It uses an iterative plan-observe-reflect process, acquiring compact evidence directly from pixels. This achieves higher accuracy with reduced computational cost.

🔹 Publication Date: Published on Dec 5

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.05774
• PDF: https://arxiv.org/pdf/2512.05774

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoUnderstanding #ActiveLearning #ComputerVision #AIResearch #DeepLearning
LongVideoAgent: Multi-Agent Reasoning with Long Videos

📝 Summary:
A multi-agent framework with a master LLM, grounding agent, and vision agent enhances long-video QA by improving temporal grounding and extracting visual details. This RL-trained system outperforms non-agent baselines on new datasets.

🔹 Publication Date: Published on Dec 23

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.20618
• PDF: https://arxiv.org/pdf/2512.20618
• Github: https://longvideoagent.github.io/

==================================

For more data science resources:
https://t.me/DataScienceT

#MultiAgentSystems #LLM #VideoUnderstanding #ComputerVision #AI
1
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

📝 Summary:
MLLMs struggle with hallucinations on counterfactual videos. DualityForge synthesizes counterfactual video data and QA pairs through diffusion-based editing to address this. This method significantly reduces model hallucinations and improves general performance.

🔹 Publication Date: Published on Dec 30, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24271
• PDF: https://arxiv.org/pdf/2512.24271
• Project Page: https://amap-ml.github.io/Taming-Hallucinations/
• Github: https://github.com/AMAP-ML/Taming-Hallucinations

==================================

For more data science resources:
https://t.me/DataScienceT

#MLLMs #VideoUnderstanding #AIHallucinations #GenerativeAI #MachineLearning