ML Research Hub
32.8K subscribers
4.11K photos
241 videos
23 files
4.43K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
VideoSSR: Video Self-Supervised Reinforcement Learning

📝 Summary:
VideoSSR is a novel self-supervised reinforcement learning framework that leverages intrinsic video information to generate high-quality training data. It uses three pretext tasks and the VideoSSR-30K dataset, improving MLLM performance across 17 benchmarks by over 5%.

🔹 Publication Date: Published on Nov 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.06281
• PDF: https://arxiv.org/pdf/2511.06281
• Project Page: https://github.com/lcqysl/VideoSSR
• Github: https://github.com/lcqysl/VideoSSR

🔹 Models citing this paper:
https://huggingface.co/yhx12/VideoSSR

==================================

For more data science resources:
https://t.me/DataScienceT

#ReinforcementLearning #SelfSupervisedLearning #VideoAI #MachineLearning #DeepLearning
Media is too big
VIEW IN TELEGRAM
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

📝 Summary:
UniVA is an open-source multi-agent framework that unifies video understanding, segmentation, editing, and generation. It uses a Plan-and-Act architecture with hierarchical memory to enable complex, iterative video workflows. This system aims to advance agentic video intelligence.

🔹 Publication Date: Published on Nov 11

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.08521
• PDF: https://arxiv.org/pdf/2511.08521
• Project Page: https://univa.online/
• Github: https://github.com/univa-agent/univa

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoAI #AIagents #GenerativeAI #ComputerVision #OpenSource
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

📝 Summary:
VANS is a new model for Video-Next-Event Prediction VNEP that generates dynamic, visually and semantically accurate video responses. It uses reinforcement learning to align a Vision-Language Model with a Video Diffusion Model, achieving state-of-the-art performance.

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16669
• PDF: https://arxiv.org/pdf/2511.16669
• Project Page: https://video-as-answer.github.io/
• Github: https://github.com/KlingTeam/VANS

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoAI #GenerativeAI #MachineLearning #ComputerVision #DeepLearning
Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

📝 Summary:
QTSplus is a query-aware token selector for long-video multimodal language models. It dynamically selects the most important visual tokens based on a text query, significantly compressing vision data and reducing latency. This method maintains overall accuracy and enhances temporal understanding ...

🔹 Publication Date: Published on Nov 14

🔹 Paper Links:
• arXiv Page: https://huggingface.co/collections/AlpachinoNLP/qtsplus
• PDF: https://arxiv.org/pdf/2511.11910
• Project Page: https://qtsplus.github.io/
• Github: https://github.com/Siyou-Li/QTSplus

🔹 Models citing this paper:
https://huggingface.co/AlpachinoNLP/QTSplus-3B
https://huggingface.co/AlpachinoNLP/QTSplus-3B-FT

Spaces citing this paper:
https://huggingface.co/spaces/AlpachinoNLP/QTSplus-3B

==================================

For more data science resources:
https://t.me/DataScienceT

#MultimodalAI #VideoAI #LLM #Tokenization #ComputerVision
UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

📝 Summary:
Video diffusion transformers struggle with video length extrapolation due to attention dispersion, causing quality degradation and repetition. UltraViCo suppresses attention for tokens beyond the training window, improving quality and reducing repetition. This extends the extrapolation limit from...

🔹 Publication Date: Published on Nov 25

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20123
• PDF: https://arxiv.org/pdf/2511.20123
• Project Page: https://thu-ml.github.io/UltraViCo.github.io/
• Github: https://github.com/thu-ml/DiT-Extrapolation

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoAI #DiffusionModels #Transformers #GenerativeAI #DeepLearning
Vidi: Large Multimodal Models for Video Understanding and Editing

📝 Summary:
Vidi is a family of Large Multimodal Models for video understanding and editing, excelling at temporal retrieval in long, multimodal videos. It significantly outperforms proprietary models like GPT-4o on the new VUE-TR benchmark, which supports hour-long videos and audio queries.

🔹 Publication Date: Published on Apr 22

🔹 Paper Links:
• arXiv Page: https://arxiv.org/pdf/2504.15681
• PDF: https://arxiv.org/pdf/2504.15681
• Project Page: https://bytedance.github.io/vidi-website/
• Github: https://github.com/bytedance/vidi

==================================

For more data science resources:
https://t.me/DataScienceT

#LMMs #VideoAI #MultimodalAI #AIResearch #DeepLearning
4
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

📝 Summary:
LongVT is an agentic framework that improves long video reasoning. It uses LMMs as tools for global-to-local video cropping and frame resampling to ground answers. This novel approach consistently outperforms existing baselines.

🔹 Publication Date: Published on Nov 25

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20785
• PDF: https://arxiv.org/pdf/2511.20785
• Project Page: https://evolvinglmms-lab.github.io/LongVT/
• Github: https://github.com/EvolvingLMMs-Lab/LongVT

🔹 Models citing this paper:
https://huggingface.co/longvideotool/LongVT-RFT
https://huggingface.co/longvideotool/LongVT-SFT
https://huggingface.co/longvideotool/LongVT-RL

Datasets citing this paper:
https://huggingface.co/datasets/longvideotool/LongVT-Source
https://huggingface.co/datasets/longvideotool/LongVT-Parquet

Spaces citing this paper:
https://huggingface.co/spaces/longvideotool/LongVT-Demo

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoAI #LMMs #AgenticAI #ComputerVision #AIResearch
1
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

📝 Summary:
InternVideo-Next proposes a two-stage Encoder-Predictor-Decoder framework for general video representation learning without text supervision. It uses a conditional diffusion decoder to bridge pixel fidelity with semantics in Stage 1, then a latent world model in Stage 2 to learn world knowledge a...

🔹 Publication Date: Published on Dec 1

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01342
• PDF: https://arxiv.org/pdf/2512.01342

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoFoundationModels #VideoAI #DeepLearning #UnsupervisedLearning #DiffusionModels
A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

📝 Summary:
This paper introduces LongShOTBench, a diagnostic benchmark for long-form multimodal video understanding with open-ended questions and agentic tool use. It also presents LongShOTAgent, an agentic system for video analysis. Results show state-of-the-art models struggle significantly, highlighting ...

🔹 Publication Date: Published on Dec 18

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16978
• PDF: https://arxiv.org/pdf/2512.16978
• Project Page: https://mbzuai-oryx.github.io/LongShOT/
• Github: https://github.com/mbzuai-oryx/longshot

Datasets citing this paper:
https://huggingface.co/datasets/MBZUAI/longshot-bench

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoAI #MultimodalAI #AgenticAI #AIbenchmark #AIResearch