✨Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
📝 Summary:
Dream-VL and Dream-VLA are diffusion-based vision-language and vision-language-action models. They achieve state-of-the-art performance in visual planning and robotic control, surpassing autoregressive baselines via their diffusion backbone's superior action generation.
🔹 Publication Date: Published on Dec 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22615
• PDF: https://arxiv.org/pdf/2512.22615
• Project Page: https://hkunlp.github.io/blog/2025/dream-vlx/
• Github: https://github.com/DreamLM/Dream-VLX
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModels #DiffusionModels #Robotics #AI #ComputerVision
📝 Summary:
Dream-VL and Dream-VLA are diffusion-based vision-language and vision-language-action models. They achieve state-of-the-art performance in visual planning and robotic control, surpassing autoregressive baselines via their diffusion backbone's superior action generation.
🔹 Publication Date: Published on Dec 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22615
• PDF: https://arxiv.org/pdf/2512.22615
• Project Page: https://hkunlp.github.io/blog/2025/dream-vlx/
• Github: https://github.com/DreamLM/Dream-VLX
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModels #DiffusionModels #Robotics #AI #ComputerVision
This media is not supported in your browser
VIEW IN TELEGRAM
✨GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction
📝 Summary:
GaMO improves sparse-view 3D reconstruction by using geometry-aware multi-view outpainting. It expands existing views to enhance scene coverage and consistency. This achieves state-of-the-art quality 25x faster than prior methods, with reduced computational cost.
🔹 Publication Date: Published on Dec 31, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.25073
• PDF: https://arxiv.org/pdf/2512.25073
• Project Page: https://yichuanh.github.io/GaMO/
• Github: https://yichuanh.github.io/GaMO/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#3DReconstruction #ComputerVision #DiffusionModels #GaMO #AI
📝 Summary:
GaMO improves sparse-view 3D reconstruction by using geometry-aware multi-view outpainting. It expands existing views to enhance scene coverage and consistency. This achieves state-of-the-art quality 25x faster than prior methods, with reduced computational cost.
🔹 Publication Date: Published on Dec 31, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.25073
• PDF: https://arxiv.org/pdf/2512.25073
• Project Page: https://yichuanh.github.io/GaMO/
• Github: https://yichuanh.github.io/GaMO/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#3DReconstruction #ComputerVision #DiffusionModels #GaMO #AI
✨Guiding a Diffusion Transformer with the Internal Dynamics of Itself
📝 Summary:
This paper introduces Internal Guidance IG for diffusion models, which adds auxiliary supervision to intermediate layers during training and extrapolates outputs during sampling. This simple strategy significantly improves training efficiency and generation quality. IG achieves state-of-the-art F...
🔹 Publication Date: Published on Dec 30, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24176
• PDF: https://arxiv.org/pdf/2512.24176
• Project Page: https://zhouxingyu13.github.io/Internal-Guidance/
• Github: https://github.com/CVL-UESTC/Internal-Guidance
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DiffusionModels #AI #DeepLearning #GenerativeAI #ComputerVision
📝 Summary:
This paper introduces Internal Guidance IG for diffusion models, which adds auxiliary supervision to intermediate layers during training and extrapolates outputs during sampling. This simple strategy significantly improves training efficiency and generation quality. IG achieves state-of-the-art F...
🔹 Publication Date: Published on Dec 30, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24176
• PDF: https://arxiv.org/pdf/2512.24176
• Project Page: https://zhouxingyu13.github.io/Internal-Guidance/
• Github: https://github.com/CVL-UESTC/Internal-Guidance
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DiffusionModels #AI #DeepLearning #GenerativeAI #ComputerVision
✨Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation
📝 Summary:
DiffusionGS is a novel single-stage 3D diffusion model that directly generates 3D Gaussian point clouds from a single image. It ensures strong view consistency from any prompt view. This method achieves superior quality and is over 5x faster than state-of-the-art techniques.
🔹 Publication Date: Published on Nov 21, 2024
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2411.14384
• PDF: https://arxiv.org/pdf/2411.14384
• Project Page: https://caiyuanhao1998.github.io/project/DiffusionGS/
• Github: https://github.com/caiyuanhao1998/Open-DiffusionGS
🔹 Models citing this paper:
• https://huggingface.co/CaiYuanhao/DiffusionGS
✨ Datasets citing this paper:
• https://huggingface.co/datasets/CaiYuanhao/DiffusionGS
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#3DGeneration #DiffusionModels #GaussianSplatting #ComputerVision #AIResearch
📝 Summary:
DiffusionGS is a novel single-stage 3D diffusion model that directly generates 3D Gaussian point clouds from a single image. It ensures strong view consistency from any prompt view. This method achieves superior quality and is over 5x faster than state-of-the-art techniques.
🔹 Publication Date: Published on Nov 21, 2024
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2411.14384
• PDF: https://arxiv.org/pdf/2411.14384
• Project Page: https://caiyuanhao1998.github.io/project/DiffusionGS/
• Github: https://github.com/caiyuanhao1998/Open-DiffusionGS
🔹 Models citing this paper:
• https://huggingface.co/CaiYuanhao/DiffusionGS
✨ Datasets citing this paper:
• https://huggingface.co/datasets/CaiYuanhao/DiffusionGS
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#3DGeneration #DiffusionModels #GaussianSplatting #ComputerVision #AIResearch
arXiv.org
Baking Gaussian Splatting into Diffusion Denoiser for Fast and...
Existing feedforward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction...
✨Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
📝 Summary:
Dolphin is a novel multimodal model for document image parsing. It uses an analyze-then-parse approach with heterogeneous anchor prompting, achieving state-of-the-art performance and superior efficiency.
🔹 Publication Date: Published on May 20, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.14059
• PDF: https://arxiv.org/pdf/2505.14059
• Github: https://github.com/bytedance/dolphin
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DocumentParsing #MultimodalAI #DeepLearning #ComputerVision #AI
📝 Summary:
Dolphin is a novel multimodal model for document image parsing. It uses an analyze-then-parse approach with heterogeneous anchor prompting, achieving state-of-the-art performance and superior efficiency.
🔹 Publication Date: Published on May 20, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.14059
• PDF: https://arxiv.org/pdf/2505.14059
• Github: https://github.com/bytedance/dolphin
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DocumentParsing #MultimodalAI #DeepLearning #ComputerVision #AI
❤1
✨SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning
📝 Summary:
SenseNova-MARS empowers Vision-Language Models with interleaved visual reasoning and dynamic tool use like search and cropping via reinforcement learning. It achieves state-of-the-art performance on complex visual tasks, outperforming proprietary models on new and existing benchmarks.
🔹 Publication Date: Published on Dec 30, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24330
• PDF: https://arxiv.org/pdf/2512.24330
• Github: https://github.com/OpenSenseNova/SenseNova-MARS
✨ Datasets citing this paper:
• https://huggingface.co/datasets/sensenova/SenseNova-MARS-Data
• https://huggingface.co/datasets/sensenova/HR-MMSearch
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #ReinforcementLearning #VisionLanguageModels #AgenticAI #ComputerVision
📝 Summary:
SenseNova-MARS empowers Vision-Language Models with interleaved visual reasoning and dynamic tool use like search and cropping via reinforcement learning. It achieves state-of-the-art performance on complex visual tasks, outperforming proprietary models on new and existing benchmarks.
🔹 Publication Date: Published on Dec 30, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24330
• PDF: https://arxiv.org/pdf/2512.24330
• Github: https://github.com/OpenSenseNova/SenseNova-MARS
✨ Datasets citing this paper:
• https://huggingface.co/datasets/sensenova/SenseNova-MARS-Data
• https://huggingface.co/datasets/sensenova/HR-MMSearch
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #ReinforcementLearning #VisionLanguageModels #AgenticAI #ComputerVision
❤1
This media is not supported in your browser
VIEW IN TELEGRAM
✨Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
📝 Summary:
Avatar Forcing creates real-time interactive talking head avatars. It uses diffusion forcing for low-latency reactions to user input and a label-free preference optimization for expressive, preferred motion, achieving 6.8x speedup.
🔹 Publication Date: Published on Jan 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.00664
• PDF: https://arxiv.org/pdf/2601.00664
• Project Page: https://taekyungki.github.io/AvatarForcing/
• Github: https://github.com/TaekyungKi/AvatarForcing
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AvatarGeneration #RealTimeAI #GenerativeAI #ComputerVision #AIResearch
📝 Summary:
Avatar Forcing creates real-time interactive talking head avatars. It uses diffusion forcing for low-latency reactions to user input and a label-free preference optimization for expressive, preferred motion, achieving 6.8x speedup.
🔹 Publication Date: Published on Jan 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.00664
• PDF: https://arxiv.org/pdf/2601.00664
• Project Page: https://taekyungki.github.io/AvatarForcing/
• Github: https://github.com/TaekyungKi/AvatarForcing
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AvatarGeneration #RealTimeAI #GenerativeAI #ComputerVision #AIResearch
This media is not supported in your browser
VIEW IN TELEGRAM
✨NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
📝 Summary:
NeoVerse is a 4D world model for reconstruction and video generation. It scales to in-the-wild monocular videos using pose-free feed-forward reconstruction and online degradation simulation, achieving state-of-the-art performance.
🔹 Publication Date: Published on Jan 1
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.00393
• PDF: https://arxiv.org/pdf/2601.00393
• Project Page: https://neoverse-4d.github.io/
• Github: https://neoverse-4d.github.io
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#4DWorldModel #VideoGeneration #ComputerVision #DeepLearning #AI
📝 Summary:
NeoVerse is a 4D world model for reconstruction and video generation. It scales to in-the-wild monocular videos using pose-free feed-forward reconstruction and online degradation simulation, achieving state-of-the-art performance.
🔹 Publication Date: Published on Jan 1
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.00393
• PDF: https://arxiv.org/pdf/2601.00393
• Project Page: https://neoverse-4d.github.io/
• Github: https://neoverse-4d.github.io
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#4DWorldModel #VideoGeneration #ComputerVision #DeepLearning #AI
This media is not supported in your browser
VIEW IN TELEGRAM
✨AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction
📝 Summary:
AdaGaR reconstructs dynamic 3D scenes from monocular video. It introduces an Adaptive Gabor Representation for detail and stability, and Cubic Hermite Splines for temporal continuity. This method achieves state-of-the-art performance.
🔹 Publication Date: Published on Jan 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.00796
• PDF: https://arxiv.org/pdf/2601.00796
• Project Page: https://jiewenchan.github.io/AdaGaR/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#3DReconstruction #ComputerVision #DynamicScenes #MonocularVideo #GaborRepresentation
📝 Summary:
AdaGaR reconstructs dynamic 3D scenes from monocular video. It introduces an Adaptive Gabor Representation for detail and stability, and Cubic Hermite Splines for temporal continuity. This method achieves state-of-the-art performance.
🔹 Publication Date: Published on Jan 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.00796
• PDF: https://arxiv.org/pdf/2601.00796
• Project Page: https://jiewenchan.github.io/AdaGaR/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#3DReconstruction #ComputerVision #DynamicScenes #MonocularVideo #GaborRepresentation
❤1
✨OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions
📝 Summary:
OmniVCus introduces a system for feedforward multi-subject video customization with multimodal controls. It proposes a data pipeline, VideoCus-Factory, and a diffusion Transformer framework with novel embedding mechanisms. This enables more subjects and precise editing, significantly outperformin...
🔹 Publication Date: Published on Jun 29, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2506.23361
• PDF: https://arxiv.org/pdf/2506.23361
• Project Page: https://caiyuanhao1998.github.io/project/OmniVCus/
• Github: https://github.com/caiyuanhao1998/Open-OmniVCus
🔹 Models citing this paper:
• https://huggingface.co/CaiYuanhao/OmniVCus
✨ Datasets citing this paper:
• https://huggingface.co/datasets/CaiYuanhao/OmniVCus
• https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Test
• https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Train
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoGeneration #DiffusionModels #MultimodalAI #DeepLearning #ComputerVision
📝 Summary:
OmniVCus introduces a system for feedforward multi-subject video customization with multimodal controls. It proposes a data pipeline, VideoCus-Factory, and a diffusion Transformer framework with novel embedding mechanisms. This enables more subjects and precise editing, significantly outperformin...
🔹 Publication Date: Published on Jun 29, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2506.23361
• PDF: https://arxiv.org/pdf/2506.23361
• Project Page: https://caiyuanhao1998.github.io/project/OmniVCus/
• Github: https://github.com/caiyuanhao1998/Open-OmniVCus
🔹 Models citing this paper:
• https://huggingface.co/CaiYuanhao/OmniVCus
✨ Datasets citing this paper:
• https://huggingface.co/datasets/CaiYuanhao/OmniVCus
• https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Test
• https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Train
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoGeneration #DiffusionModels #MultimodalAI #DeepLearning #ComputerVision
arXiv.org
OmniVCus: Feedforward Subject-driven Video Customization with...
Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging...
❤1
This media is not supported in your browser
VIEW IN TELEGRAM
✨DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer
📝 Summary:
DreamID-V is a novel video face swapping framework that uses diffusion transformers and curriculum learning. It achieves superior identity preservation and visual realism by bridging the image-to-video gap, outperforming existing methods and enhancing temporal consistency.
🔹 Publication Date: Published on Jan 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.01425
• PDF: https://arxiv.org/pdf/2601.01425
• Project Page: https://guoxu1233.github.io/DreamID-V/
• Github: https://guoxu1233.github.io/DreamID-V/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#FaceSwapping #DiffusionModels #ComputerVision #GenerativeAI #VideoAI
📝 Summary:
DreamID-V is a novel video face swapping framework that uses diffusion transformers and curriculum learning. It achieves superior identity preservation and visual realism by bridging the image-to-video gap, outperforming existing methods and enhancing temporal consistency.
🔹 Publication Date: Published on Jan 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.01425
• PDF: https://arxiv.org/pdf/2601.01425
• Project Page: https://guoxu1233.github.io/DreamID-V/
• Github: https://guoxu1233.github.io/DreamID-V/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#FaceSwapping #DiffusionModels #ComputerVision #GenerativeAI #VideoAI