ML Research Hub

✨TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

📝 Summary:
TimeViper is a hybrid Mamba-Transformer vision-language model for efficient long video understanding. It introduces a TransV module to compress redundant vision tokens into instruction tokens, enabling it to process over 10,000 frames. This achieves state-of-the-art performance while offering new...

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16595
• PDF: https://arxiv.org/pdf/2511.16595
• Project Page: https://xuboshen.github.io/TimeViper/

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#TimeViper #VisionLanguageModels #VideoUnderstanding #MambaTransformer #DeepLearning

173 views04:07

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System

📝 Summary:
SciEducator is a self-evolving multi-agent system designed for scientific video understanding and education. It integrates professional knowledge and step-wise reasoning to interpret scientific activities and produce multimodal educational content. SciEducator significantly outperforms existing m...

🔹 Publication Date: Published on Nov 22

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.17943
• PDF: https://arxiv.org/pdf/2511.17943

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#MultiAgentSystems #AIEducation #VideoUnderstanding #EdTech #AIResearch

397 views10:07

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

📝 Summary:
Click2Graph is an interactive framework for Panoptic Video Scene Graph Generation. It uses a single user click to segment, track, discover interactions, and predict triplets for temporally consistent scene graphs. This enables user-guided, controllable video scene understanding.

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15948
• PDF: https://arxiv.org/pdf/2511.15948

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoUnderstanding #SceneGraphs #ComputerVision #InteractiveAI #AIResearch

219 views05:05

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨ViDiC: Video Difference Captioning

📝 Summary:
The ViDiC task and ViDiC-1K dataset evaluate MLLMs' ability to describe differences between video pairs, overcoming static image captioning limits. It assesses motion and event evolution, finding significant performance gaps in current models for comparative video understanding.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.03405
• PDF: https://arxiv.org/pdf/2512.03405
• Project Page: https://vidic-1k.github.io/

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoCaptioning #MLLM #VideoUnderstanding #ComputerVision #AIResearch

205 views08:00

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨OneThinker: All-in-one Reasoning Model for Image and Video

📝 Summary:
OneThinker is an all-in-one model unifying image and video understanding across diverse tasks like QA, captioning, and tracking. It employs a new training corpus and RL method for balanced optimization, achieving strong performance and knowledge transfer across 31 benchmarks. This advances toward...

🔹 Publication Date: Published on Dec 2

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.03043
• PDF: https://arxiv.org/pdf/2512.03043
• Project Page: https://github.com/tulerfeng/OneThinker
• Github: https://github.com/tulerfeng/OneThinker

🔹 Models citing this paper:
• https://huggingface.co/OneThink/OneThinker-8B
• https://huggingface.co/OneThink/OneThinker-SFT-Qwen3-8B

✨ Datasets citing this paper:
• https://huggingface.co/datasets/OneThink/OneThinker-train-data
• https://huggingface.co/datasets/OneThink/OneThinker-eval

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#AI #ComputerVision #MultimodalAI #DeepLearning #VideoUnderstanding

arXiv.org

OneThinker: All-in-one Reasoning Model for Image and Video

Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train...

173 views08:01

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

📝 Summary:
Large Multimodal Models struggle with long video understanding due to context limits. The DIG framework adapts frame selection to query types, using efficient uniform sampling for global queries and specialized selection for localized ones. This approach significantly improves LMM performance on ...

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04000
• PDF: https://arxiv.org/pdf/2512.04000
• Project Page: https://github.com/Jialuo-Li/DIG
• Github: https://github.com/Jialuo-Li/DIG

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoUnderstanding #LMMs #MultimodalAI #DeepLearning #ComputerVision

❤1

300 views20:39

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation

📝 Summary:
Pyramid Sparse Attention PSA introduces multi-level pooled key-value representations to overcome information loss in traditional sparse attention. It dynamically retains critical information, improving efficiency and performance for video understanding and generation tasks.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04025
• PDF: https://arxiv.org/pdf/2512.04025
• Project Page: https://ziplab.co/PSA/
• Github: https://github.com/ziplab/Pyramid-Sparse-Attention

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#SparseAttention #VideoUnderstanding #VideoGeneration #DeepLearning #ComputerVision

277 views22:39

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

📝 Summary:
The SANTA framework addresses object and action hallucinations in multimodal LLM video captions. It uses self-augmented contrastive alignment to identify potential hallucinations and then aligns regional objects and actions with visual phrases, improving factual accuracy. Experiments show SANTA o...

🔹 Publication Date: Published on Dec 4

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04356
• PDF: https://arxiv.org/pdf/2512.04356
• Project Page: https://kpc0810.github.io/santa/
• Github: https://kpc0810.github.io/santa/

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#MultimodalLLMs #AI #Hallucinations #VideoUnderstanding #ContrastiveLearning

212 views07:06

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

📝 Summary:
Active Video Perception AVP improves long video understanding by actively seeking query-relevant evidence. It uses an iterative plan-observe-reflect process, acquiring compact evidence directly from pixels. This achieves higher accuracy with reduced computational cost.

🔹 Publication Date: Published on Dec 5

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.05774
• PDF: https://arxiv.org/pdf/2512.05774

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoUnderstanding #ActiveLearning #ComputerVision #AIResearch #DeepLearning

313 views18:06

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨LongVideoAgent: Multi-Agent Reasoning with Long Videos

📝 Summary:
A multi-agent framework with a master LLM, grounding agent, and vision agent enhances long-video QA by improving temporal grounding and extracting visual details. This RL-trained system outperforms non-agent baselines on new datasets.

🔹 Publication Date: Published on Dec 23

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.20618
• PDF: https://arxiv.org/pdf/2512.20618
• Github: https://longvideoagent.github.io/

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#MultiAgentSystems #LLM #VideoUnderstanding #ComputerVision #AI

❤1

342 views09:03

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

📝 Summary:
MLLMs struggle with hallucinations on counterfactual videos. DualityForge synthesizes counterfactual video data and QA pairs through diffusion-based editing to address this. This method significantly reduces model hallucinations and improves general performance.

🔹 Publication Date: Published on Dec 30, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24271
• PDF: https://arxiv.org/pdf/2512.24271
• Project Page: https://amap-ml.github.io/Taming-Hallucinations/
• Github: https://github.com/AMAP-ML/Taming-Hallucinations

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#MLLMs #VideoUnderstanding #AIHallucinations #GenerativeAI #MachineLearning

329 views04:01

✨ Explore Data Science 📝 Write your paper

About

Blog

Apps

Platform