ML Research Hub

✨VideoSSR: Video Self-Supervised Reinforcement Learning

📝 Summary:
VideoSSR is a novel self-supervised reinforcement learning framework that leverages intrinsic video information to generate high-quality training data. It uses three pretext tasks and the VideoSSR-30K dataset, improving MLLM performance across 17 benchmarks by over 5%.

🔹 Publication Date: Published on Nov 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.06281
• PDF: https://arxiv.org/pdf/2511.06281
• Project Page: https://github.com/lcqysl/VideoSSR
• Github: https://github.com/lcqysl/VideoSSR

🔹 Models citing this paper:
• https://huggingface.co/yhx12/VideoSSR

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#ReinforcementLearning #SelfSupervisedLearning #VideoAI #MachineLearning #DeepLearning

249 views04:01

✨ Explore Data Science 📝 Write your paper

✨UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

📝 Summary:
UniVA is an open-source multi-agent framework that unifies video understanding, segmentation, editing, and generation. It uses a Plan-and-Act architecture with hierarchical memory to enable complex, iterative video workflows. This system aims to advance agentic video intelligence.

🔹 Publication Date: Published on Nov 11

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.08521
• PDF: https://arxiv.org/pdf/2511.08521
• Project Page: https://univa.online/
• Github: https://github.com/univa-agent/univa

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoAI #AIagents #GenerativeAI #ComputerVision #OpenSource

328 views11:02

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

📝 Summary:
VANS is a new model for Video-Next-Event Prediction VNEP that generates dynamic, visually and semantically accurate video responses. It uses reinforcement learning to align a Vision-Language Model with a Video Diffusion Model, achieving state-of-the-art performance.

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16669
• PDF: https://arxiv.org/pdf/2511.16669
• Project Page: https://video-as-answer.github.io/
• Github: https://github.com/KlingTeam/VANS

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoAI #GenerativeAI #MachineLearning #ComputerVision #DeepLearning

253 views04:04

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

📝 Summary:
QTSplus is a query-aware token selector for long-video multimodal language models. It dynamically selects the most important visual tokens based on a text query, significantly compressing vision data and reducing latency. This method maintains overall accuracy and enhances temporal understanding ...

🔹 Publication Date: Published on Nov 14

🔹 Paper Links:
• arXiv Page: https://huggingface.co/collections/AlpachinoNLP/qtsplus
• PDF: https://arxiv.org/pdf/2511.11910
• Project Page: https://qtsplus.github.io/
• Github: https://github.com/Siyou-Li/QTSplus

🔹 Models citing this paper:
• https://huggingface.co/AlpachinoNLP/QTSplus-3B
• https://huggingface.co/AlpachinoNLP/QTSplus-3B-FT

✨ Spaces citing this paper:
• https://huggingface.co/spaces/AlpachinoNLP/QTSplus-3B

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#MultimodalAI #VideoAI #LLM #Tokenization #ComputerVision

huggingface.co

QTSplus - a AlpachinoNLP Collection

Official models and datasets for paper(https://arxiv.org/abs/2511.11910)

383 views04:04

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

📝 Summary:
Video diffusion transformers struggle with video length extrapolation due to attention dispersion, causing quality degradation and repetition. UltraViCo suppresses attention for tokens beyond the training window, improving quality and reducing repetition. This extends the extrapolation limit from...

🔹 Publication Date: Published on Nov 25

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20123
• PDF: https://arxiv.org/pdf/2511.20123
• Project Page: https://thu-ml.github.io/UltraViCo.github.io/
• Github: https://github.com/thu-ml/DiT-Extrapolation

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoAI #DiffusionModels #Transformers #GenerativeAI #DeepLearning

295 views06:04

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Vidi: Large Multimodal Models for Video Understanding and Editing

📝 Summary:
Vidi is a family of Large Multimodal Models for video understanding and editing, excelling at temporal retrieval in long, multimodal videos. It significantly outperforms proprietary models like GPT-4o on the new VUE-TR benchmark, which supports hour-long videos and audio queries.

🔹 Publication Date: Published on Apr 22

🔹 Paper Links:
• arXiv Page: https://arxiv.org/pdf/2504.15681
• PDF: https://arxiv.org/pdf/2504.15681
• Project Page: https://bytedance.github.io/vidi-website/
• Github: https://github.com/bytedance/vidi

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#LMMs #VideoAI #MultimodalAI #AIResearch #DeepLearning

❤4

857 views04:00

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

📝 Summary:
LongVT is an agentic framework that improves long video reasoning. It uses LMMs as tools for global-to-local video cropping and frame resampling to ground answers. This novel approach consistently outperforms existing baselines.

🔹 Publication Date: Published on Nov 25

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20785
• PDF: https://arxiv.org/pdf/2511.20785
• Project Page: https://evolvinglmms-lab.github.io/LongVT/
• Github: https://github.com/EvolvingLMMs-Lab/LongVT

🔹 Models citing this paper:
• https://huggingface.co/longvideotool/LongVT-RFT
• https://huggingface.co/longvideotool/LongVT-SFT
• https://huggingface.co/longvideotool/LongVT-RL

✨ Datasets citing this paper:
• https://huggingface.co/datasets/longvideotool/LongVT-Source
• https://huggingface.co/datasets/longvideotool/LongVT-Parquet

✨ Spaces citing this paper:
• https://huggingface.co/spaces/longvideotool/LongVT-Demo

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoAI #LMMs #AgenticAI #ComputerVision #AIResearch

arXiv.org

LongVT: Incentivizing "Thinking with Long Videos" via...

Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form...

❤1

218 views04:01

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

📝 Summary:
InternVideo-Next proposes a two-stage Encoder-Predictor-Decoder framework for general video representation learning without text supervision. It uses a conditional diffusion decoder to bridge pixel fidelity with semantics in Stage 1, then a latent world model in Stage 2 to learn world knowledge a...

🔹 Publication Date: Published on Dec 1

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01342
• PDF: https://arxiv.org/pdf/2512.01342

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoFoundationModels #VideoAI #DeepLearning #UnsupervisedLearning #DiffusionModels

121 views05:05

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

📝 Summary:
This paper introduces LongShOTBench, a diagnostic benchmark for long-form multimodal video understanding with open-ended questions and agentic tool use. It also presents LongShOTAgent, an agentic system for video analysis. Results show state-of-the-art models struggle significantly, highlighting ...

🔹 Publication Date: Published on Dec 18

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16978
• PDF: https://arxiv.org/pdf/2512.16978
• Project Page: https://mbzuai-oryx.github.io/LongShOT/
• Github: https://github.com/mbzuai-oryx/longshot

✨ Datasets citing this paper:
• https://huggingface.co/datasets/MBZUAI/longshot-bench

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoAI #MultimodalAI #AgenticAI #AIbenchmark #AIResearch

328 views13:05

✨ Explore Data Science 📝 Write your paper

About

Blog

Apps

Platform