✨UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
📝 Summary:
UnityVideo is a unified framework enhancing video generation by integrating multiple modalities and training paradigms. It uses dynamic noising and a modality switcher for comprehensive world understanding. This improves video quality, consistency, and zero-shot generalization to new data.
🔹 Publication Date: Published on Dec 8
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.07831
• PDF: https://arxiv.org/pdf/2512.07831
• Project Page: https://jackailab.github.io/Projects/UnityVideo/
• Github: https://github.com/dvlab-research/UnityVideo
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoGeneration #MultimodalAI #GenerativeAI #DeepLearning #AIResearch
📝 Summary:
UnityVideo is a unified framework enhancing video generation by integrating multiple modalities and training paradigms. It uses dynamic noising and a modality switcher for comprehensive world understanding. This improves video quality, consistency, and zero-shot generalization to new data.
🔹 Publication Date: Published on Dec 8
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.07831
• PDF: https://arxiv.org/pdf/2512.07831
• Project Page: https://jackailab.github.io/Projects/UnityVideo/
• Github: https://github.com/dvlab-research/UnityVideo
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoGeneration #MultimodalAI #GenerativeAI #DeepLearning #AIResearch
✨GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
📝 Summary:
GLM-4.1V-Thinking is a vision-language model using a reasoning-centric training framework. It achieves state-of-the-art multimodal reasoning across various tasks like STEM and long document understanding. The model outperforms larger models and competes with closed-source systems like GPT-4o.
🔹 Publication Date: Published on Jul 1
🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/glm-41v-thinking-towards-versatile-multimodal-reasoning-with-scalable-reinforcement-learning
• PDF: https://arxiv.org/pdf/2507.01006
• Github: https://github.com/THUDM/GLM-4.1V-Thinking
🔹 Models citing this paper:
• https://huggingface.co/zai-org/GLM-4.1V-9B-Thinking
• https://huggingface.co/zai-org/GLM-4.5V
• https://huggingface.co/zai-org/GLM-4.6V-Flash
✨ Spaces citing this paper:
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-Demo
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-API-Demo
• https://huggingface.co/spaces/akhaliq/anycoder
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#GLM41VThinking #MultimodalAI #VisionLanguageModels #ReinforcementLearning #AIResearch
📝 Summary:
GLM-4.1V-Thinking is a vision-language model using a reasoning-centric training framework. It achieves state-of-the-art multimodal reasoning across various tasks like STEM and long document understanding. The model outperforms larger models and competes with closed-source systems like GPT-4o.
🔹 Publication Date: Published on Jul 1
🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/glm-41v-thinking-towards-versatile-multimodal-reasoning-with-scalable-reinforcement-learning
• PDF: https://arxiv.org/pdf/2507.01006
• Github: https://github.com/THUDM/GLM-4.1V-Thinking
🔹 Models citing this paper:
• https://huggingface.co/zai-org/GLM-4.1V-9B-Thinking
• https://huggingface.co/zai-org/GLM-4.5V
• https://huggingface.co/zai-org/GLM-4.6V-Flash
✨ Spaces citing this paper:
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-Demo
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-API-Demo
• https://huggingface.co/spaces/akhaliq/anycoder
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#GLM41VThinking #MultimodalAI #VisionLanguageModels #ReinforcementLearning #AIResearch
Arxivexplained
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning - Explained Simply
By Wenyi Hong, Wenmeng Yu, Xiaotao Gu et al.. # GLM-4.1V-Thinking: The AI That Actually Thinks Through Visual Problems
**The Problem:** Current A...
**The Problem:** Current A...
✨The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation
📝 Summary:
This paper highlights the gap between SAM2 and SAM3. SAM2 uses spatial prompts for geometric segmentation, but SAM3 is a concept-driven multimodal model with a unified vision-language architecture. SAM3 represents a new class of foundation model for concept-driven segmentation.
🔹 Publication Date: Published on Dec 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.06032
• PDF: https://arxiv.org/pdf/2512.06032
• Github: https://github.com/Applied-AI-Research-Lab/The-SAM2-to-SAM3-Gap-in-the-Segment-Anything-Model-Family
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ImageSegmentation #FoundationModels #ComputerVision #MultimodalAI #AIResearch
📝 Summary:
This paper highlights the gap between SAM2 and SAM3. SAM2 uses spatial prompts for geometric segmentation, but SAM3 is a concept-driven multimodal model with a unified vision-language architecture. SAM3 represents a new class of foundation model for concept-driven segmentation.
🔹 Publication Date: Published on Dec 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.06032
• PDF: https://arxiv.org/pdf/2512.06032
• Github: https://github.com/Applied-AI-Research-Lab/The-SAM2-to-SAM3-Gap-in-the-Segment-Anything-Model-Family
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ImageSegmentation #FoundationModels #ComputerVision #MultimodalAI #AIResearch
❤1
✨Thinking with Images via Self-Calling Agent
📝 Summary:
sCoT is a novel visual reasoning paradigm that reformulates interleaved multimodal CoT as a language-only CoT with self-calling subagents. It improves reasoning performance and efficiency by avoiding explicit multimodal interleaving and using group-relative policy optimization.
🔹 Publication Date: Published on Dec 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.08511
• PDF: https://arxiv.org/pdf/2512.08511
• Github: https://github.com/YWenxi/think-with-images-through-self-calling
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisualReasoning #MultimodalAI #LLMs #AIagents #AIResearch
📝 Summary:
sCoT is a novel visual reasoning paradigm that reformulates interleaved multimodal CoT as a language-only CoT with self-calling subagents. It improves reasoning performance and efficiency by avoiding explicit multimodal interleaving and using group-relative policy optimization.
🔹 Publication Date: Published on Dec 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.08511
• PDF: https://arxiv.org/pdf/2512.08511
• Github: https://github.com/YWenxi/think-with-images-through-self-calling
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisualReasoning #MultimodalAI #LLMs #AIagents #AIResearch
✨DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry
📝 Summary:
DentalGPT is a specialized dental multimodal LLM. It improves fine-grained visual understanding and reasoning using a large dataset and reinforcement learning. DentalGPT achieves superior performance in dental disease classification and VQA.
🔹 Publication Date: Published on Dec 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.11558
• PDF: https://arxiv.org/pdf/2512.11558
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DentalGPT #DentistryAI #LLM #MultimodalAI #HealthcareTech
📝 Summary:
DentalGPT is a specialized dental multimodal LLM. It improves fine-grained visual understanding and reasoning using a large dataset and reinforcement learning. DentalGPT achieves superior performance in dental disease classification and VQA.
🔹 Publication Date: Published on Dec 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.11558
• PDF: https://arxiv.org/pdf/2512.11558
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DentalGPT #DentistryAI #LLM #MultimodalAI #HealthcareTech
Media is too big
VIEW IN TELEGRAM
✨Agent S: An Open Agentic Framework that Uses Computers Like a Human
📝 Summary:
Agent S is an open agentic framework enabling autonomous GUI interaction to automate complex tasks. It employs experience-augmented hierarchical planning and an Agent-Computer Interface with MLLMs for enhanced reasoning. Agent S achieves state-of-the-art performance on OSWorld and demonstrates br...
🔹 Publication Date: Published on Oct 10, 2024
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2410.08164
• PDF: https://arxiv.org/pdf/2410.08164
• Github: https://huggingface.co/collections/ranpox/awesome-computer-use-agents
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AgenticAI #MultimodalAI #HumanComputerInteraction #Automation #AIResearch
📝 Summary:
Agent S is an open agentic framework enabling autonomous GUI interaction to automate complex tasks. It employs experience-augmented hierarchical planning and an Agent-Computer Interface with MLLMs for enhanced reasoning. Agent S achieves state-of-the-art performance on OSWorld and demonstrates br...
🔹 Publication Date: Published on Oct 10, 2024
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2410.08164
• PDF: https://arxiv.org/pdf/2410.08164
• Github: https://huggingface.co/collections/ranpox/awesome-computer-use-agents
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AgenticAI #MultimodalAI #HumanComputerInteraction #Automation #AIResearch
✨MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
📝 Summary:
MeViS is a multi-modal dataset for referring motion expression video segmentation, addressing the need to segment and track objects based on their motion descriptions. It provides text and audio annotations for complex videos, enabling research into motion-guided video understanding.
🔹 Publication Date: Published on Dec 11
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.10945
• PDF: https://arxiv.org/pdf/2512.10945
• Project Page: https://henghuiding.com/MeViS/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoSegmentation #MultiModalAI #ComputerVision #Dataset #MotionUnderstanding
📝 Summary:
MeViS is a multi-modal dataset for referring motion expression video segmentation, addressing the need to segment and track objects based on their motion descriptions. It provides text and audio annotations for complex videos, enabling research into motion-guided video understanding.
🔹 Publication Date: Published on Dec 11
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.10945
• PDF: https://arxiv.org/pdf/2512.10945
• Project Page: https://henghuiding.com/MeViS/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoSegmentation #MultiModalAI #ComputerVision #Dataset #MotionUnderstanding
❤2
✨Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
📝 Summary:
MMRB2 is a new benchmark for multimodal reward models, evaluating them on interleaved image and text tasks using 4,000 expert-annotated preferences. It shows top models like Gemini 3 Pro achieve 75-80% accuracy, still below human performance, highlighting areas for improvement in these models.
🔹 Publication Date: Published on Dec 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16899
• PDF: https://arxiv.org/pdf/2512.16899
• Github: https://github.com/facebookresearch/MMRB2/tree/main
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #RewardModels #AIbenchmark #MachineLearning #AIResearch
📝 Summary:
MMRB2 is a new benchmark for multimodal reward models, evaluating them on interleaved image and text tasks using 4,000 expert-annotated preferences. It shows top models like Gemini 3 Pro achieve 75-80% accuracy, still below human performance, highlighting areas for improvement in these models.
🔹 Publication Date: Published on Dec 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16899
• PDF: https://arxiv.org/pdf/2512.16899
• Github: https://github.com/facebookresearch/MMRB2/tree/main
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #RewardModels #AIbenchmark #MachineLearning #AIResearch
❤1
✨A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos
📝 Summary:
This paper introduces LongShOTBench, a diagnostic benchmark for long-form multimodal video understanding with open-ended questions and agentic tool use. It also presents LongShOTAgent, an agentic system for video analysis. Results show state-of-the-art models struggle significantly, highlighting ...
🔹 Publication Date: Published on Dec 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16978
• PDF: https://arxiv.org/pdf/2512.16978
• Project Page: https://mbzuai-oryx.github.io/LongShOT/
• Github: https://github.com/mbzuai-oryx/longshot
✨ Datasets citing this paper:
• https://huggingface.co/datasets/MBZUAI/longshot-bench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoAI #MultimodalAI #AgenticAI #AIbenchmark #AIResearch
📝 Summary:
This paper introduces LongShOTBench, a diagnostic benchmark for long-form multimodal video understanding with open-ended questions and agentic tool use. It also presents LongShOTAgent, an agentic system for video analysis. Results show state-of-the-art models struggle significantly, highlighting ...
🔹 Publication Date: Published on Dec 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16978
• PDF: https://arxiv.org/pdf/2512.16978
• Project Page: https://mbzuai-oryx.github.io/LongShOT/
• Github: https://github.com/mbzuai-oryx/longshot
✨ Datasets citing this paper:
• https://huggingface.co/datasets/MBZUAI/longshot-bench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoAI #MultimodalAI #AgenticAI #AIbenchmark #AIResearch
✨CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion
📝 Summary:
CASA enhances cross-attention for vision-language models by adding local text-to-text interaction. This approach substantially reduces the performance gap with costly token insertion methods on detailed visual tasks. CASA maintains efficiency and scalability for long-context multimodal applicatio...
🔹 Publication Date: Published on Dec 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.19535
• PDF: https://arxiv.org/pdf/2512.19535
• Project Page: https://kyutai.org/casa
• Github: https://github.com/kyutai-labs/casa
🔹 Models citing this paper:
• https://huggingface.co/kyutai/CASA-Helium1-VL-2B
✨ Spaces citing this paper:
• https://huggingface.co/spaces/kyutai/casa-samples
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguage #MultimodalAI #AttentionMechanisms #EfficientAI #DeepLearning
📝 Summary:
CASA enhances cross-attention for vision-language models by adding local text-to-text interaction. This approach substantially reduces the performance gap with costly token insertion methods on detailed visual tasks. CASA maintains efficiency and scalability for long-context multimodal applicatio...
🔹 Publication Date: Published on Dec 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.19535
• PDF: https://arxiv.org/pdf/2512.19535
• Project Page: https://kyutai.org/casa
• Github: https://github.com/kyutai-labs/casa
🔹 Models citing this paper:
• https://huggingface.co/kyutai/CASA-Helium1-VL-2B
✨ Spaces citing this paper:
• https://huggingface.co/spaces/kyutai/casa-samples
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguage #MultimodalAI #AttentionMechanisms #EfficientAI #DeepLearning
❤4
✨T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
📝 Summary:
T2AV-Compass introduces a unified benchmark for text-to-audio-video generation evaluation. It features 500 diverse prompts and a dual-level framework. Evaluations reveal current T2AV models struggle significantly with realism and cross-modal consistency.
🔹 Publication Date: Published on Dec 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.21094
• PDF: https://arxiv.org/pdf/2512.21094
• Project Page: https://nju-link.github.io/T2AV-Compass/
• Github: https://github.com/NJU-LINK/T2AV-Compass/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#TextToAudioVideo #MultimodalAI #AIEvaluation #GenerativeAI #AIResearch
📝 Summary:
T2AV-Compass introduces a unified benchmark for text-to-audio-video generation evaluation. It features 500 diverse prompts and a dual-level framework. Evaluations reveal current T2AV models struggle significantly with realism and cross-modal consistency.
🔹 Publication Date: Published on Dec 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.21094
• PDF: https://arxiv.org/pdf/2512.21094
• Project Page: https://nju-link.github.io/T2AV-Compass/
• Github: https://github.com/NJU-LINK/T2AV-Compass/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#TextToAudioVideo #MultimodalAI #AIEvaluation #GenerativeAI #AIResearch
✨VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
📝 Summary:
VideoRAG introduces the first RAG framework for long videos, using a dual-channel architecture to integrate textual knowledge grounding and multi-modal context encoding. This enables unlimited-length video processing and significantly outperforms existing methods.
🔹 Publication Date: Published on Feb 3
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2502.01549
• PDF: https://arxiv.org/pdf/2502.01549
• Github: https://github.com/hkuds/videorag
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoRAG #RAG #LongVideo #AI #MultimodalAI
📝 Summary:
VideoRAG introduces the first RAG framework for long videos, using a dual-channel architecture to integrate textual knowledge grounding and multi-modal context encoding. This enables unlimited-length video processing and significantly outperforms existing methods.
🔹 Publication Date: Published on Feb 3
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2502.01549
• PDF: https://arxiv.org/pdf/2502.01549
• Github: https://github.com/hkuds/videorag
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoRAG #RAG #LongVideo #AI #MultimodalAI
❤2
✨See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
📝 Summary:
Bi-directional Perceptual Shaping BiPS improves vision-language models by using question-conditioned masked views to shape perception during training. It employs two constraints to ensure complete coverage of relevant pixels and enforce fine-grained visual reliance, preventing text-only shortcuts...
🔹 Publication Date: Published on Dec 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22120
• PDF: https://arxiv.org/pdf/2512.22120
• Github: https://github.com/zss02/BiPS
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #VisionLanguageModels #MachineLearning #AIResearch #DeepLearning
📝 Summary:
Bi-directional Perceptual Shaping BiPS improves vision-language models by using question-conditioned masked views to shape perception during training. It employs two constraints to ensure complete coverage of relevant pixels and enforce fine-grained visual reliance, preventing text-only shortcuts...
🔹 Publication Date: Published on Dec 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22120
• PDF: https://arxiv.org/pdf/2512.22120
• Github: https://github.com/zss02/BiPS
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #VisionLanguageModels #MachineLearning #AIResearch #DeepLearning
❤1
✨Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding
📝 Summary:
Omni-Weather is a new multimodal foundation model that unifies weather generation and understanding in a single architecture. It uses shared self-attention and a Chain-of-Thought dataset for interpretable, high-quality outputs, achieving state-of-the-art performance.
🔹 Publication Date: Published on Dec 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.21643
• PDF: https://arxiv.org/pdf/2512.21643
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#WeatherGeneration #FoundationModels #MultimodalAI #AIResearch #DeepLearning
📝 Summary:
Omni-Weather is a new multimodal foundation model that unifies weather generation and understanding in a single architecture. It uses shared self-attention and a Chain-of-Thought dataset for interpretable, high-quality outputs, achieving state-of-the-art performance.
🔹 Publication Date: Published on Dec 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.21643
• PDF: https://arxiv.org/pdf/2512.21643
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#WeatherGeneration #FoundationModels #MultimodalAI #AIResearch #DeepLearning
❤1
This media is not supported in your browser
VIEW IN TELEGRAM
✨LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
📝 Summary:
LiveTalk enables real-time multimodal interactive video generation from text, image, and audio by improving on-policy diffusion distillation. It reduces inference latency by 20x while maintaining quality, allowing seamless human-AI interaction.
🔹 Publication Date: Published on Dec 29
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.23576
• PDF: https://arxiv.org/pdf/2512.23576
• Github: https://github.com/GAIR-NLP/LiveTalk
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoGeneration #AI #DiffusionModels #RealTimeAI #MultimodalAI
📝 Summary:
LiveTalk enables real-time multimodal interactive video generation from text, image, and audio by improving on-policy diffusion distillation. It reduces inference latency by 20x while maintaining quality, allowing seamless human-AI interaction.
🔹 Publication Date: Published on Dec 29
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.23576
• PDF: https://arxiv.org/pdf/2512.23576
• Github: https://github.com/GAIR-NLP/LiveTalk
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoGeneration #AI #DiffusionModels #RealTimeAI #MultimodalAI
✨Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
📝 Summary:
Dolphin is a novel multimodal model for document image parsing. It uses an analyze-then-parse approach with heterogeneous anchor prompting, achieving state-of-the-art performance and superior efficiency.
🔹 Publication Date: Published on May 20, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.14059
• PDF: https://arxiv.org/pdf/2505.14059
• Github: https://github.com/bytedance/dolphin
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DocumentParsing #MultimodalAI #DeepLearning #ComputerVision #AI
📝 Summary:
Dolphin is a novel multimodal model for document image parsing. It uses an analyze-then-parse approach with heterogeneous anchor prompting, achieving state-of-the-art performance and superior efficiency.
🔹 Publication Date: Published on May 20, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.14059
• PDF: https://arxiv.org/pdf/2505.14059
• Github: https://github.com/bytedance/dolphin
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DocumentParsing #MultimodalAI #DeepLearning #ComputerVision #AI
❤1
✨SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning
📝 Summary:
SenseNova-MARS empowers Vision-Language Models with interleaved visual reasoning and dynamic tool use like search and cropping via reinforcement learning. It achieves state-of-the-art performance on complex visual tasks, outperforming proprietary models on new and existing benchmarks.
🔹 Publication Date: Published on Dec 30, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24330
• PDF: https://arxiv.org/pdf/2512.24330
• Github: https://github.com/OpenSenseNova/SenseNova-MARS
✨ Datasets citing this paper:
• https://huggingface.co/datasets/sensenova/SenseNova-MARS-Data
• https://huggingface.co/datasets/sensenova/HR-MMSearch
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #ReinforcementLearning #VisionLanguageModels #AgenticAI #ComputerVision
📝 Summary:
SenseNova-MARS empowers Vision-Language Models with interleaved visual reasoning and dynamic tool use like search and cropping via reinforcement learning. It achieves state-of-the-art performance on complex visual tasks, outperforming proprietary models on new and existing benchmarks.
🔹 Publication Date: Published on Dec 30, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24330
• PDF: https://arxiv.org/pdf/2512.24330
• Github: https://github.com/OpenSenseNova/SenseNova-MARS
✨ Datasets citing this paper:
• https://huggingface.co/datasets/sensenova/SenseNova-MARS-Data
• https://huggingface.co/datasets/sensenova/HR-MMSearch
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #ReinforcementLearning #VisionLanguageModels #AgenticAI #ComputerVision
❤1
✨OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions
📝 Summary:
OmniVCus introduces a system for feedforward multi-subject video customization with multimodal controls. It proposes a data pipeline, VideoCus-Factory, and a diffusion Transformer framework with novel embedding mechanisms. This enables more subjects and precise editing, significantly outperformin...
🔹 Publication Date: Published on Jun 29, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2506.23361
• PDF: https://arxiv.org/pdf/2506.23361
• Project Page: https://caiyuanhao1998.github.io/project/OmniVCus/
• Github: https://github.com/caiyuanhao1998/Open-OmniVCus
🔹 Models citing this paper:
• https://huggingface.co/CaiYuanhao/OmniVCus
✨ Datasets citing this paper:
• https://huggingface.co/datasets/CaiYuanhao/OmniVCus
• https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Test
• https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Train
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoGeneration #DiffusionModels #MultimodalAI #DeepLearning #ComputerVision
📝 Summary:
OmniVCus introduces a system for feedforward multi-subject video customization with multimodal controls. It proposes a data pipeline, VideoCus-Factory, and a diffusion Transformer framework with novel embedding mechanisms. This enables more subjects and precise editing, significantly outperformin...
🔹 Publication Date: Published on Jun 29, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2506.23361
• PDF: https://arxiv.org/pdf/2506.23361
• Project Page: https://caiyuanhao1998.github.io/project/OmniVCus/
• Github: https://github.com/caiyuanhao1998/Open-OmniVCus
🔹 Models citing this paper:
• https://huggingface.co/CaiYuanhao/OmniVCus
✨ Datasets citing this paper:
• https://huggingface.co/datasets/CaiYuanhao/OmniVCus
• https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Test
• https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Train
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoGeneration #DiffusionModels #MultimodalAI #DeepLearning #ComputerVision
arXiv.org
OmniVCus: Feedforward Subject-driven Video Customization with...
Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging...
❤1
✨M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models
📝 Summary:
Existing concept erasure methods in diffusion models are vulnerable to non-text inputs. M-ErasureBench is a new multimodal evaluation framework, and IRECE is a module to restore robustness against these attacks, reducing concept reproduction.
🔹 Publication Date: Published on Dec 28, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22877
• PDF: https://arxiv.org/pdf/2512.22877
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DiffusionModels #ConceptErasure #MultimodalAI #AISafety #MachineLearning
📝 Summary:
Existing concept erasure methods in diffusion models are vulnerable to non-text inputs. M-ErasureBench is a new multimodal evaluation framework, and IRECE is a module to restore robustness against these attacks, reducing concept reproduction.
🔹 Publication Date: Published on Dec 28, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22877
• PDF: https://arxiv.org/pdf/2512.22877
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DiffusionModels #ConceptErasure #MultimodalAI #AISafety #MachineLearning
✨CPPO: Contrastive Perception for Vision Language Policy Optimization
📝 Summary:
CPPO improves vision-language model fine-tuning by detecting perception tokens through entropy shifts. It then applies a Contrastive Perception Loss to enhance multimodal reasoning, outperforming prior methods more efficiently.
🔹 Publication Date: Published on Jan 1
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.00501
• PDF: https://arxiv.org/pdf/2601.00501
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModels #MultimodalAI #ContrastiveLearning #DeepLearning #AIResearch
📝 Summary:
CPPO improves vision-language model fine-tuning by detecting perception tokens through entropy shifts. It then applies a Contrastive Perception Loss to enhance multimodal reasoning, outperforming prior methods more efficiently.
🔹 Publication Date: Published on Jan 1
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.00501
• PDF: https://arxiv.org/pdf/2601.00501
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModels #MultimodalAI #ContrastiveLearning #DeepLearning #AIResearch