✨VideoSSR: Video Self-Supervised Reinforcement Learning
📝 Summary:
VideoSSR is a novel self-supervised reinforcement learning framework that leverages intrinsic video information to generate high-quality training data. It uses three pretext tasks and the VideoSSR-30K dataset, improving MLLM performance across 17 benchmarks by over 5%.
🔹 Publication Date: Published on Nov 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.06281
• PDF: https://arxiv.org/pdf/2511.06281
• Project Page: https://github.com/lcqysl/VideoSSR
• Github: https://github.com/lcqysl/VideoSSR
🔹 Models citing this paper:
• https://huggingface.co/yhx12/VideoSSR
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ReinforcementLearning #SelfSupervisedLearning #VideoAI #MachineLearning #DeepLearning
📝 Summary:
VideoSSR is a novel self-supervised reinforcement learning framework that leverages intrinsic video information to generate high-quality training data. It uses three pretext tasks and the VideoSSR-30K dataset, improving MLLM performance across 17 benchmarks by over 5%.
🔹 Publication Date: Published on Nov 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.06281
• PDF: https://arxiv.org/pdf/2511.06281
• Project Page: https://github.com/lcqysl/VideoSSR
• Github: https://github.com/lcqysl/VideoSSR
🔹 Models citing this paper:
• https://huggingface.co/yhx12/VideoSSR
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ReinforcementLearning #SelfSupervisedLearning #VideoAI #MachineLearning #DeepLearning
Media is too big
VIEW IN TELEGRAM
✨UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
📝 Summary:
UniVA is an open-source multi-agent framework that unifies video understanding, segmentation, editing, and generation. It uses a Plan-and-Act architecture with hierarchical memory to enable complex, iterative video workflows. This system aims to advance agentic video intelligence.
🔹 Publication Date: Published on Nov 11
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.08521
• PDF: https://arxiv.org/pdf/2511.08521
• Project Page: https://univa.online/
• Github: https://github.com/univa-agent/univa
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoAI #AIagents #GenerativeAI #ComputerVision #OpenSource
📝 Summary:
UniVA is an open-source multi-agent framework that unifies video understanding, segmentation, editing, and generation. It uses a Plan-and-Act architecture with hierarchical memory to enable complex, iterative video workflows. This system aims to advance agentic video intelligence.
🔹 Publication Date: Published on Nov 11
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.08521
• PDF: https://arxiv.org/pdf/2511.08521
• Project Page: https://univa.online/
• Github: https://github.com/univa-agent/univa
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoAI #AIagents #GenerativeAI #ComputerVision #OpenSource
✨Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
📝 Summary:
VANS is a new model for Video-Next-Event Prediction VNEP that generates dynamic, visually and semantically accurate video responses. It uses reinforcement learning to align a Vision-Language Model with a Video Diffusion Model, achieving state-of-the-art performance.
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16669
• PDF: https://arxiv.org/pdf/2511.16669
• Project Page: https://video-as-answer.github.io/
• Github: https://github.com/KlingTeam/VANS
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoAI #GenerativeAI #MachineLearning #ComputerVision #DeepLearning
📝 Summary:
VANS is a new model for Video-Next-Event Prediction VNEP that generates dynamic, visually and semantically accurate video responses. It uses reinforcement learning to align a Vision-Language Model with a Video Diffusion Model, achieving state-of-the-art performance.
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16669
• PDF: https://arxiv.org/pdf/2511.16669
• Project Page: https://video-as-answer.github.io/
• Github: https://github.com/KlingTeam/VANS
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoAI #GenerativeAI #MachineLearning #ComputerVision #DeepLearning
✨Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models
📝 Summary:
QTSplus is a query-aware token selector for long-video multimodal language models. It dynamically selects the most important visual tokens based on a text query, significantly compressing vision data and reducing latency. This method maintains overall accuracy and enhances temporal understanding ...
🔹 Publication Date: Published on Nov 14
🔹 Paper Links:
• arXiv Page: https://huggingface.co/collections/AlpachinoNLP/qtsplus
• PDF: https://arxiv.org/pdf/2511.11910
• Project Page: https://qtsplus.github.io/
• Github: https://github.com/Siyou-Li/QTSplus
🔹 Models citing this paper:
• https://huggingface.co/AlpachinoNLP/QTSplus-3B
• https://huggingface.co/AlpachinoNLP/QTSplus-3B-FT
✨ Spaces citing this paper:
• https://huggingface.co/spaces/AlpachinoNLP/QTSplus-3B
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #VideoAI #LLM #Tokenization #ComputerVision
📝 Summary:
QTSplus is a query-aware token selector for long-video multimodal language models. It dynamically selects the most important visual tokens based on a text query, significantly compressing vision data and reducing latency. This method maintains overall accuracy and enhances temporal understanding ...
🔹 Publication Date: Published on Nov 14
🔹 Paper Links:
• arXiv Page: https://huggingface.co/collections/AlpachinoNLP/qtsplus
• PDF: https://arxiv.org/pdf/2511.11910
• Project Page: https://qtsplus.github.io/
• Github: https://github.com/Siyou-Li/QTSplus
🔹 Models citing this paper:
• https://huggingface.co/AlpachinoNLP/QTSplus-3B
• https://huggingface.co/AlpachinoNLP/QTSplus-3B-FT
✨ Spaces citing this paper:
• https://huggingface.co/spaces/AlpachinoNLP/QTSplus-3B
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #VideoAI #LLM #Tokenization #ComputerVision
huggingface.co
QTSplus - a AlpachinoNLP Collection
Official models and datasets for paper(https://arxiv.org/abs/2511.11910)
✨UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers
📝 Summary:
Video diffusion transformers struggle with video length extrapolation due to attention dispersion, causing quality degradation and repetition. UltraViCo suppresses attention for tokens beyond the training window, improving quality and reducing repetition. This extends the extrapolation limit from...
🔹 Publication Date: Published on Nov 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20123
• PDF: https://arxiv.org/pdf/2511.20123
• Project Page: https://thu-ml.github.io/UltraViCo.github.io/
• Github: https://github.com/thu-ml/DiT-Extrapolation
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoAI #DiffusionModels #Transformers #GenerativeAI #DeepLearning
📝 Summary:
Video diffusion transformers struggle with video length extrapolation due to attention dispersion, causing quality degradation and repetition. UltraViCo suppresses attention for tokens beyond the training window, improving quality and reducing repetition. This extends the extrapolation limit from...
🔹 Publication Date: Published on Nov 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20123
• PDF: https://arxiv.org/pdf/2511.20123
• Project Page: https://thu-ml.github.io/UltraViCo.github.io/
• Github: https://github.com/thu-ml/DiT-Extrapolation
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoAI #DiffusionModels #Transformers #GenerativeAI #DeepLearning
✨Vidi: Large Multimodal Models for Video Understanding and Editing
📝 Summary:
Vidi is a family of Large Multimodal Models for video understanding and editing, excelling at temporal retrieval in long, multimodal videos. It significantly outperforms proprietary models like GPT-4o on the new VUE-TR benchmark, which supports hour-long videos and audio queries.
🔹 Publication Date: Published on Apr 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/pdf/2504.15681
• PDF: https://arxiv.org/pdf/2504.15681
• Project Page: https://bytedance.github.io/vidi-website/
• Github: https://github.com/bytedance/vidi
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#LMMs #VideoAI #MultimodalAI #AIResearch #DeepLearning
📝 Summary:
Vidi is a family of Large Multimodal Models for video understanding and editing, excelling at temporal retrieval in long, multimodal videos. It significantly outperforms proprietary models like GPT-4o on the new VUE-TR benchmark, which supports hour-long videos and audio queries.
🔹 Publication Date: Published on Apr 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/pdf/2504.15681
• PDF: https://arxiv.org/pdf/2504.15681
• Project Page: https://bytedance.github.io/vidi-website/
• Github: https://github.com/bytedance/vidi
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#LMMs #VideoAI #MultimodalAI #AIResearch #DeepLearning
❤4
✨LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
📝 Summary:
LongVT is an agentic framework that improves long video reasoning. It uses LMMs as tools for global-to-local video cropping and frame resampling to ground answers. This novel approach consistently outperforms existing baselines.
🔹 Publication Date: Published on Nov 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20785
• PDF: https://arxiv.org/pdf/2511.20785
• Project Page: https://evolvinglmms-lab.github.io/LongVT/
• Github: https://github.com/EvolvingLMMs-Lab/LongVT
🔹 Models citing this paper:
• https://huggingface.co/longvideotool/LongVT-RFT
• https://huggingface.co/longvideotool/LongVT-SFT
• https://huggingface.co/longvideotool/LongVT-RL
✨ Datasets citing this paper:
• https://huggingface.co/datasets/longvideotool/LongVT-Source
• https://huggingface.co/datasets/longvideotool/LongVT-Parquet
✨ Spaces citing this paper:
• https://huggingface.co/spaces/longvideotool/LongVT-Demo
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoAI #LMMs #AgenticAI #ComputerVision #AIResearch
📝 Summary:
LongVT is an agentic framework that improves long video reasoning. It uses LMMs as tools for global-to-local video cropping and frame resampling to ground answers. This novel approach consistently outperforms existing baselines.
🔹 Publication Date: Published on Nov 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20785
• PDF: https://arxiv.org/pdf/2511.20785
• Project Page: https://evolvinglmms-lab.github.io/LongVT/
• Github: https://github.com/EvolvingLMMs-Lab/LongVT
🔹 Models citing this paper:
• https://huggingface.co/longvideotool/LongVT-RFT
• https://huggingface.co/longvideotool/LongVT-SFT
• https://huggingface.co/longvideotool/LongVT-RL
✨ Datasets citing this paper:
• https://huggingface.co/datasets/longvideotool/LongVT-Source
• https://huggingface.co/datasets/longvideotool/LongVT-Parquet
✨ Spaces citing this paper:
• https://huggingface.co/spaces/longvideotool/LongVT-Demo
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoAI #LMMs #AgenticAI #ComputerVision #AIResearch
arXiv.org
LongVT: Incentivizing "Thinking with Long Videos" via...
Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form...
❤1
✨InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
📝 Summary:
InternVideo-Next proposes a two-stage Encoder-Predictor-Decoder framework for general video representation learning without text supervision. It uses a conditional diffusion decoder to bridge pixel fidelity with semantics in Stage 1, then a latent world model in Stage 2 to learn world knowledge a...
🔹 Publication Date: Published on Dec 1
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01342
• PDF: https://arxiv.org/pdf/2512.01342
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoFoundationModels #VideoAI #DeepLearning #UnsupervisedLearning #DiffusionModels
📝 Summary:
InternVideo-Next proposes a two-stage Encoder-Predictor-Decoder framework for general video representation learning without text supervision. It uses a conditional diffusion decoder to bridge pixel fidelity with semantics in Stage 1, then a latent world model in Stage 2 to learn world knowledge a...
🔹 Publication Date: Published on Dec 1
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01342
• PDF: https://arxiv.org/pdf/2512.01342
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoFoundationModels #VideoAI #DeepLearning #UnsupervisedLearning #DiffusionModels
✨A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos
📝 Summary:
This paper introduces LongShOTBench, a diagnostic benchmark for long-form multimodal video understanding with open-ended questions and agentic tool use. It also presents LongShOTAgent, an agentic system for video analysis. Results show state-of-the-art models struggle significantly, highlighting ...
🔹 Publication Date: Published on Dec 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16978
• PDF: https://arxiv.org/pdf/2512.16978
• Project Page: https://mbzuai-oryx.github.io/LongShOT/
• Github: https://github.com/mbzuai-oryx/longshot
✨ Datasets citing this paper:
• https://huggingface.co/datasets/MBZUAI/longshot-bench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoAI #MultimodalAI #AgenticAI #AIbenchmark #AIResearch
📝 Summary:
This paper introduces LongShOTBench, a diagnostic benchmark for long-form multimodal video understanding with open-ended questions and agentic tool use. It also presents LongShOTAgent, an agentic system for video analysis. Results show state-of-the-art models struggle significantly, highlighting ...
🔹 Publication Date: Published on Dec 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16978
• PDF: https://arxiv.org/pdf/2512.16978
• Project Page: https://mbzuai-oryx.github.io/LongShOT/
• Github: https://github.com/mbzuai-oryx/longshot
✨ Datasets citing this paper:
• https://huggingface.co/datasets/MBZUAI/longshot-bench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoAI #MultimodalAI #AgenticAI #AIbenchmark #AIResearch