✨StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models
📝 Summary:
StageVAR accelerates visual autoregressive models by recognizing early stages are critical while later detail-refinement stages can be pruned or approximated. This plug-and-play framework achieves up to 3.4x speedup with minimal quality loss, outperforming existing methods.
🔹 Publication Date: Published on Dec 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16483
• PDF: https://arxiv.org/pdf/2512.16483
• Github: https://github.com/sen-mao/StageVAR
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ComputerVision #DeepLearning #ModelAcceleration #AI #NeuralNetworks
📝 Summary:
StageVAR accelerates visual autoregressive models by recognizing early stages are critical while later detail-refinement stages can be pruned or approximated. This plug-and-play framework achieves up to 3.4x speedup with minimal quality loss, outperforming existing methods.
🔹 Publication Date: Published on Dec 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16483
• PDF: https://arxiv.org/pdf/2512.16483
• Github: https://github.com/sen-mao/StageVAR
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ComputerVision #DeepLearning #ModelAcceleration #AI #NeuralNetworks
❤1
✨Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing
📝 Summary:
This paper proposes a framework using a semantic-pixel reconstruction objective to adapt encoder features for generation. It creates a compact, semantically rich latent space, leading to state-of-the-art image reconstruction and improved text-to-image generation and editing.
🔹 Publication Date: Published on Dec 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.17909
• PDF: https://arxiv.org/pdf/2512.17909
• Project Page: https://jshilong.github.io/PS-VAE-PAGE/
• Github: https://jshilong.github.io/PS-VAE-PAGE/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#TextToImage #ImageGeneration #DeepLearning #ComputerVision #AIResearch
📝 Summary:
This paper proposes a framework using a semantic-pixel reconstruction objective to adapt encoder features for generation. It creates a compact, semantically rich latent space, leading to state-of-the-art image reconstruction and improved text-to-image generation and editing.
🔹 Publication Date: Published on Dec 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.17909
• PDF: https://arxiv.org/pdf/2512.17909
• Project Page: https://jshilong.github.io/PS-VAE-PAGE/
• Github: https://jshilong.github.io/PS-VAE-PAGE/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#TextToImage #ImageGeneration #DeepLearning #ComputerVision #AIResearch
❤1
✨RadarGen: Automotive Radar Point Cloud Generation from Cameras
📝 Summary:
RadarGen synthesizes realistic automotive radar point clouds from camera images using diffusion models. It incorporates depth, semantic, and motion cues for physical plausibility, enabling scalable multimodal simulation and improving perception models.
🔹 Publication Date: Published on Dec 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.17897
• PDF: https://arxiv.org/pdf/2512.17897
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AutomotiveRadar #PointClouds #DiffusionModels #ComputerVision #AutonomousDriving
📝 Summary:
RadarGen synthesizes realistic automotive radar point clouds from camera images using diffusion models. It incorporates depth, semantic, and motion cues for physical plausibility, enabling scalable multimodal simulation and improving perception models.
🔹 Publication Date: Published on Dec 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.17897
• PDF: https://arxiv.org/pdf/2512.17897
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AutomotiveRadar #PointClouds #DiffusionModels #ComputerVision #AutonomousDriving
❤1
Media is too big
VIEW IN TELEGRAM
✨3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework
📝 Summary:
3D-RE-GEN reconstructs single images into modifiable 3D textured mesh scenes with comprehensive backgrounds. It uses a compositional generative framework and novel optimization for artist-ready, physically realistic layouts, achieving state-of-the-art performance.
🔹 Publication Date: Published on Dec 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.17459
• PDF: https://arxiv.org/pdf/2512.17459
• Project Page: https://3dregen.jdihlmann.com/
• Github: https://github.com/cgtuebingen/3D-RE-GEN
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#3DReconstruction #GenerativeAI #ComputerVision #DeepLearning #ComputerGraphics
📝 Summary:
3D-RE-GEN reconstructs single images into modifiable 3D textured mesh scenes with comprehensive backgrounds. It uses a compositional generative framework and novel optimization for artist-ready, physically realistic layouts, achieving state-of-the-art performance.
🔹 Publication Date: Published on Dec 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.17459
• PDF: https://arxiv.org/pdf/2512.17459
• Project Page: https://3dregen.jdihlmann.com/
• Github: https://github.com/cgtuebingen/3D-RE-GEN
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#3DReconstruction #GenerativeAI #ComputerVision #DeepLearning #ComputerGraphics
❤1
✨The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
📝 Summary:
The Prism Hypothesis posits semantic encoders capture low-frequency meaning, while pixel encoders retain high-frequency details. Unified Autoencoding UAE leverages this with a frequency-band modulator to harmonize both into a single latent space. This achieves state-of-the-art performance on imag...
🔹 Publication Date: Published on Dec 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.19693
• PDF: https://arxiv.org/pdf/2512.19693
• Github: https://github.com/WeichenFan/UAE
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DeepLearning #ComputerVision #Autoencoders #RepresentationLearning #AIResearch
📝 Summary:
The Prism Hypothesis posits semantic encoders capture low-frequency meaning, while pixel encoders retain high-frequency details. Unified Autoencoding UAE leverages this with a frequency-band modulator to harmonize both into a single latent space. This achieves state-of-the-art performance on imag...
🔹 Publication Date: Published on Dec 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.19693
• PDF: https://arxiv.org/pdf/2512.19693
• Github: https://github.com/WeichenFan/UAE
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DeepLearning #ComputerVision #Autoencoders #RepresentationLearning #AIResearch
✨LongVideoAgent: Multi-Agent Reasoning with Long Videos
📝 Summary:
A multi-agent framework with a master LLM, grounding agent, and vision agent enhances long-video QA by improving temporal grounding and extracting visual details. This RL-trained system outperforms non-agent baselines on new datasets.
🔹 Publication Date: Published on Dec 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.20618
• PDF: https://arxiv.org/pdf/2512.20618
• Github: https://longvideoagent.github.io/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultiAgentSystems #LLM #VideoUnderstanding #ComputerVision #AI
📝 Summary:
A multi-agent framework with a master LLM, grounding agent, and vision agent enhances long-video QA by improving temporal grounding and extracting visual details. This RL-trained system outperforms non-agent baselines on new datasets.
🔹 Publication Date: Published on Dec 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.20618
• PDF: https://arxiv.org/pdf/2512.20618
• Github: https://longvideoagent.github.io/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultiAgentSystems #LLM #VideoUnderstanding #ComputerVision #AI
❤1
✨Learning to Refocus with Video Diffusion Models
📝 Summary:
A novel method enables realistic post-capture refocusing from a single defocused image. It uses video diffusion models to generate a focal stack for interactive focus adjustment. This approach outperforms existing methods, improving photography focus-editing.
🔹 Publication Date: Published on Dec 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.19823
• PDF: https://arxiv.org/pdf/2512.19823
• Project Page: https://learn2refocus.github.io/
• Github: https://github.com/tedlasai/learn2refocus
🔹 Models citing this paper:
• https://huggingface.co/tedlasai/learn2refocus
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoDiffusionModels #ComputationalPhotography #ImageRefocusing #DeepLearning #ComputerVision
📝 Summary:
A novel method enables realistic post-capture refocusing from a single defocused image. It uses video diffusion models to generate a focal stack for interactive focus adjustment. This approach outperforms existing methods, improving photography focus-editing.
🔹 Publication Date: Published on Dec 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.19823
• PDF: https://arxiv.org/pdf/2512.19823
• Project Page: https://learn2refocus.github.io/
• Github: https://github.com/tedlasai/learn2refocus
🔹 Models citing this paper:
• https://huggingface.co/tedlasai/learn2refocus
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoDiffusionModels #ComputationalPhotography #ImageRefocusing #DeepLearning #ComputerVision
❤3
This media is not supported in your browser
VIEW IN TELEGRAM
✨Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
📝 Summary:
DSR Suite improves vision language models weak dynamic spatial reasoning. It creates 4D training data from videos using an automated pipeline and integrates geometric priors via a Geometry Selection Module. This significantly enhances VLM dynamic spatial reasoning capability while maintaining gen...
🔹 Publication Date: Published on Dec 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.20557
• PDF: https://arxiv.org/pdf/2512.20557
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModels #SpatialReasoning #4D #ComputerVision #AIResearch
📝 Summary:
DSR Suite improves vision language models weak dynamic spatial reasoning. It creates 4D training data from videos using an automated pipeline and integrates geometric priors via a Geometry Selection Module. This significantly enhances VLM dynamic spatial reasoning capability while maintaining gen...
🔹 Publication Date: Published on Dec 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.20557
• PDF: https://arxiv.org/pdf/2512.20557
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModels #SpatialReasoning #4D #ComputerVision #AIResearch
✨Latent Implicit Visual Reasoning
📝 Summary:
Large Multimodal Models struggle with visual reasoning due to their text-centric nature and limitations of prior methods. This paper introduces a task-agnostic mechanism for LMMs to discover and use visual reasoning tokens without explicit supervision. The approach achieves state-of-the-art resul...
🔹 Publication Date: Published on Dec 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.21218
• PDF: https://arxiv.org/pdf/2512.21218
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#LMMs #VisualReasoning #AI #ComputerVision #DeepLearning
📝 Summary:
Large Multimodal Models struggle with visual reasoning due to their text-centric nature and limitations of prior methods. This paper introduces a task-agnostic mechanism for LMMs to discover and use visual reasoning tokens without explicit supervision. The approach achieves state-of-the-art resul...
🔹 Publication Date: Published on Dec 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.21218
• PDF: https://arxiv.org/pdf/2512.21218
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#LMMs #VisualReasoning #AI #ComputerVision #DeepLearning
❤1
Media is too big
VIEW IN TELEGRAM
✨Spatia: Video Generation with Updatable Spatial Memory
📝 Summary:
Spatia is a video generation framework that improves long-term consistency by using an updatable 3D scene point cloud as persistent spatial memory. It iteratively generates video clips and updates this memory via visual SLAM, enabling realistic videos and 3D-aware interactive editing.
🔹 Publication Date: Published on Dec 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.15716
• PDF: https://arxiv.org/pdf/2512.15716
• Project Page: https://zhaojingjing713.github.io/Spatia/
• Github: https://github.com/ZhaoJingjing713/Spatia
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoGeneration #GenerativeAI #ComputerVision #3DReconstruction #SLAM
📝 Summary:
Spatia is a video generation framework that improves long-term consistency by using an updatable 3D scene point cloud as persistent spatial memory. It iteratively generates video clips and updates this memory via visual SLAM, enabling realistic videos and 3D-aware interactive editing.
🔹 Publication Date: Published on Dec 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.15716
• PDF: https://arxiv.org/pdf/2512.15716
• Project Page: https://zhaojingjing713.github.io/Spatia/
• Github: https://github.com/ZhaoJingjing713/Spatia
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoGeneration #GenerativeAI #ComputerVision #3DReconstruction #SLAM
❤1
✨How Much 3D Do Video Foundation Models Encode?
📝 Summary:
A new framework quantifies 3D understanding in Video Foundation Models VidFMs. VidFMs, trained only on video, show strong 3D awareness, often surpassing expert 3D models, providing insights for 3D AI.
🔹 Publication Date: Published on Dec 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.19949
• PDF: https://arxiv.org/pdf/2512.19949
• Project Page: https://vidfm-3d-probe.github.io/
• Github: https://vidfm-3d-probe.github.io
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoFoundationModels #3DUnderstanding #ComputerVision #AIResearch #DeepLearning
📝 Summary:
A new framework quantifies 3D understanding in Video Foundation Models VidFMs. VidFMs, trained only on video, show strong 3D awareness, often surpassing expert 3D models, providing insights for 3D AI.
🔹 Publication Date: Published on Dec 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.19949
• PDF: https://arxiv.org/pdf/2512.19949
• Project Page: https://vidfm-3d-probe.github.io/
• Github: https://vidfm-3d-probe.github.io
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoFoundationModels #3DUnderstanding #ComputerVision #AIResearch #DeepLearning
❤2
✨Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass
📝 Summary:
Fast3R is a Transformer-based method for efficient and scalable multi-view 3D reconstruction. It processes many images in parallel in a single forward pass, improving speed and accuracy over pairwise approaches like DUSt3R.
🔹 Publication Date: Published on Jan 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2501.13928
• PDF: https://arxiv.org/pdf/2501.13928
• Github: https://github.com/naver/dust3r/pull/16
🔹 Models citing this paper:
• https://huggingface.co/jedyang97/Fast3R_ViT_Large_512
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#3DReconstruction #ComputerVision #Transformers #Fast3R #DeepLearning
📝 Summary:
Fast3R is a Transformer-based method for efficient and scalable multi-view 3D reconstruction. It processes many images in parallel in a single forward pass, improving speed and accuracy over pairwise approaches like DUSt3R.
🔹 Publication Date: Published on Jan 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2501.13928
• PDF: https://arxiv.org/pdf/2501.13928
• Github: https://github.com/naver/dust3r/pull/16
🔹 Models citing this paper:
• https://huggingface.co/jedyang97/Fast3R_ViT_Large_512
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#3DReconstruction #ComputerVision #Transformers #Fast3R #DeepLearning
This media is not supported in your browser
VIEW IN TELEGRAM
✨InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion
📝 Summary:
InsertAnywhere is a framework for realistic video object insertion. It uses 4D aware mask generation for geometric consistency and an extended diffusion model for appearance-faithful synthesis, outperforming existing methods.
🔹 Publication Date: Published on Dec 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.17504
• PDF: https://arxiv.org/pdf/2512.17504
• Project Page: https://myyzzzoooo.github.io/InsertAnywhere/
• Github: https://github.com/myyzzzoooo/InsertAnywhere
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoEditing #DiffusionModels #ComputerVision #DeepLearning #GenerativeAI
📝 Summary:
InsertAnywhere is a framework for realistic video object insertion. It uses 4D aware mask generation for geometric consistency and an extended diffusion model for appearance-faithful synthesis, outperforming existing methods.
🔹 Publication Date: Published on Dec 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.17504
• PDF: https://arxiv.org/pdf/2512.17504
• Project Page: https://myyzzzoooo.github.io/InsertAnywhere/
• Github: https://github.com/myyzzzoooo/InsertAnywhere
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoEditing #DiffusionModels #ComputerVision #DeepLearning #GenerativeAI
❤1
✨UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture
📝 Summary:
UniPercept-Bench provides a unified framework and datasets for perceptual image understanding aesthetics, quality, structure, texture. The UniPercept model, trained with DAPT and T-ARL, outperforms MLLMs, generalizes across VR and VQA, and acts as a text-to-image reward model.
🔹 Publication Date: Published on Dec 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.21675
• PDF: https://arxiv.org/pdf/2512.21675
• Project Page: https://thunderbolt215.github.io/Unipercept-project/
• Github: https://github.com/thunderbolt215/UniPercept
🔹 Models citing this paper:
• https://huggingface.co/Thunderbolt215215/UniPercept
✨ Datasets citing this paper:
• https://huggingface.co/datasets/Thunderbolt215215/UniPercept-Bench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ImageUnderstanding #ComputerVision #AIResearch #PerceptualAI #DeepLearning
📝 Summary:
UniPercept-Bench provides a unified framework and datasets for perceptual image understanding aesthetics, quality, structure, texture. The UniPercept model, trained with DAPT and T-ARL, outperforms MLLMs, generalizes across VR and VQA, and acts as a text-to-image reward model.
🔹 Publication Date: Published on Dec 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.21675
• PDF: https://arxiv.org/pdf/2512.21675
• Project Page: https://thunderbolt215.github.io/Unipercept-project/
• Github: https://github.com/thunderbolt215/UniPercept
🔹 Models citing this paper:
• https://huggingface.co/Thunderbolt215215/UniPercept
✨ Datasets citing this paper:
• https://huggingface.co/datasets/Thunderbolt215215/UniPercept-Bench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ImageUnderstanding #ComputerVision #AIResearch #PerceptualAI #DeepLearning
arXiv.org
UniPercept: Towards Unified Perceptual-Level Image Understanding...
Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks such as visual grounding, segmentation, and captioning. However, their ability to perceive...
❤1
✨Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
📝 Summary:
Transparent objects are hard for perception. This work observes video diffusion models can synthesize transparent phenomena, so they repurpose one. Their DKT model, trained on a new dataset, achieves zero-shot SOTA for depth and normal estimation of transparent objects, proving diffusion knows tr...
🔹 Publication Date: Published on Dec 29
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.23705
• PDF: https://arxiv.org/pdf/2512.23705
• Project Page: https://daniellli.github.io/projects/DKT/
• Github: https://github.com/Daniellli/DKT
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ComputerVision #DiffusionModels #DepthEstimation #TransparentObjects #AIResearch
📝 Summary:
Transparent objects are hard for perception. This work observes video diffusion models can synthesize transparent phenomena, so they repurpose one. Their DKT model, trained on a new dataset, achieves zero-shot SOTA for depth and normal estimation of transparent objects, proving diffusion knows tr...
🔹 Publication Date: Published on Dec 29
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.23705
• PDF: https://arxiv.org/pdf/2512.23705
• Project Page: https://daniellli.github.io/projects/DKT/
• Github: https://github.com/Daniellli/DKT
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ComputerVision #DiffusionModels #DepthEstimation #TransparentObjects #AIResearch
✨SpotEdit: Selective Region Editing in Diffusion Transformers
📝 Summary:
SpotEdit is a training-free framework for selective image editing in diffusion transformers. It avoids reprocessing stable regions by reusing their features, combining them with edited areas. This reduces computation and preserves unchanged regions, enhancing efficiency and precision.
🔹 Publication Date: Published on Dec 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22323
• PDF: https://arxiv.org/pdf/2512.22323
• Project Page: https://biangbiang0321.github.io/SpotEdit.github.io
• Github: https://biangbiang0321.github.io/SpotEdit.github.io
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ImageEditing #DiffusionModels #ComputerVision #AIResearch #DeepLearning
📝 Summary:
SpotEdit is a training-free framework for selective image editing in diffusion transformers. It avoids reprocessing stable regions by reusing their features, combining them with edited areas. This reduces computation and preserves unchanged regions, enhancing efficiency and precision.
🔹 Publication Date: Published on Dec 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22323
• PDF: https://arxiv.org/pdf/2512.22323
• Project Page: https://biangbiang0321.github.io/SpotEdit.github.io
• Github: https://biangbiang0321.github.io/SpotEdit.github.io
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ImageEditing #DiffusionModels #ComputerVision #AIResearch #DeepLearning
✨Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
📝 Summary:
Dream-VL and Dream-VLA are diffusion-based vision-language and vision-language-action models. They achieve state-of-the-art performance in visual planning and robotic control, surpassing autoregressive baselines via their diffusion backbone's superior action generation.
🔹 Publication Date: Published on Dec 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22615
• PDF: https://arxiv.org/pdf/2512.22615
• Project Page: https://hkunlp.github.io/blog/2025/dream-vlx/
• Github: https://github.com/DreamLM/Dream-VLX
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModels #DiffusionModels #Robotics #AI #ComputerVision
📝 Summary:
Dream-VL and Dream-VLA are diffusion-based vision-language and vision-language-action models. They achieve state-of-the-art performance in visual planning and robotic control, surpassing autoregressive baselines via their diffusion backbone's superior action generation.
🔹 Publication Date: Published on Dec 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22615
• PDF: https://arxiv.org/pdf/2512.22615
• Project Page: https://hkunlp.github.io/blog/2025/dream-vlx/
• Github: https://github.com/DreamLM/Dream-VLX
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModels #DiffusionModels #Robotics #AI #ComputerVision
This media is not supported in your browser
VIEW IN TELEGRAM
✨GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction
📝 Summary:
GaMO improves sparse-view 3D reconstruction by using geometry-aware multi-view outpainting. It expands existing views to enhance scene coverage and consistency. This achieves state-of-the-art quality 25x faster than prior methods, with reduced computational cost.
🔹 Publication Date: Published on Dec 31, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.25073
• PDF: https://arxiv.org/pdf/2512.25073
• Project Page: https://yichuanh.github.io/GaMO/
• Github: https://yichuanh.github.io/GaMO/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#3DReconstruction #ComputerVision #DiffusionModels #GaMO #AI
📝 Summary:
GaMO improves sparse-view 3D reconstruction by using geometry-aware multi-view outpainting. It expands existing views to enhance scene coverage and consistency. This achieves state-of-the-art quality 25x faster than prior methods, with reduced computational cost.
🔹 Publication Date: Published on Dec 31, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.25073
• PDF: https://arxiv.org/pdf/2512.25073
• Project Page: https://yichuanh.github.io/GaMO/
• Github: https://yichuanh.github.io/GaMO/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#3DReconstruction #ComputerVision #DiffusionModels #GaMO #AI
✨Guiding a Diffusion Transformer with the Internal Dynamics of Itself
📝 Summary:
This paper introduces Internal Guidance IG for diffusion models, which adds auxiliary supervision to intermediate layers during training and extrapolates outputs during sampling. This simple strategy significantly improves training efficiency and generation quality. IG achieves state-of-the-art F...
🔹 Publication Date: Published on Dec 30, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24176
• PDF: https://arxiv.org/pdf/2512.24176
• Project Page: https://zhouxingyu13.github.io/Internal-Guidance/
• Github: https://github.com/CVL-UESTC/Internal-Guidance
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DiffusionModels #AI #DeepLearning #GenerativeAI #ComputerVision
📝 Summary:
This paper introduces Internal Guidance IG for diffusion models, which adds auxiliary supervision to intermediate layers during training and extrapolates outputs during sampling. This simple strategy significantly improves training efficiency and generation quality. IG achieves state-of-the-art F...
🔹 Publication Date: Published on Dec 30, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24176
• PDF: https://arxiv.org/pdf/2512.24176
• Project Page: https://zhouxingyu13.github.io/Internal-Guidance/
• Github: https://github.com/CVL-UESTC/Internal-Guidance
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DiffusionModels #AI #DeepLearning #GenerativeAI #ComputerVision
✨Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation
📝 Summary:
DiffusionGS is a novel single-stage 3D diffusion model that directly generates 3D Gaussian point clouds from a single image. It ensures strong view consistency from any prompt view. This method achieves superior quality and is over 5x faster than state-of-the-art techniques.
🔹 Publication Date: Published on Nov 21, 2024
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2411.14384
• PDF: https://arxiv.org/pdf/2411.14384
• Project Page: https://caiyuanhao1998.github.io/project/DiffusionGS/
• Github: https://github.com/caiyuanhao1998/Open-DiffusionGS
🔹 Models citing this paper:
• https://huggingface.co/CaiYuanhao/DiffusionGS
✨ Datasets citing this paper:
• https://huggingface.co/datasets/CaiYuanhao/DiffusionGS
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#3DGeneration #DiffusionModels #GaussianSplatting #ComputerVision #AIResearch
📝 Summary:
DiffusionGS is a novel single-stage 3D diffusion model that directly generates 3D Gaussian point clouds from a single image. It ensures strong view consistency from any prompt view. This method achieves superior quality and is over 5x faster than state-of-the-art techniques.
🔹 Publication Date: Published on Nov 21, 2024
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2411.14384
• PDF: https://arxiv.org/pdf/2411.14384
• Project Page: https://caiyuanhao1998.github.io/project/DiffusionGS/
• Github: https://github.com/caiyuanhao1998/Open-DiffusionGS
🔹 Models citing this paper:
• https://huggingface.co/CaiYuanhao/DiffusionGS
✨ Datasets citing this paper:
• https://huggingface.co/datasets/CaiYuanhao/DiffusionGS
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#3DGeneration #DiffusionModels #GaussianSplatting #ComputerVision #AIResearch
arXiv.org
Baking Gaussian Splatting into Diffusion Denoiser for Fast and...
Existing feedforward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction...