ML Research Hub

✨G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

📝 Summary:
G^2VLM integrates 3D geometry learning into vision-language models to overcome their spatial intelligence deficits. It unifies 3D reconstruction and spatial reasoning, leveraging learned 3D features to achieve strong performance in both tasks.

🔹 Publication Date: Published on Nov 26

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.21688
• PDF: https://arxiv.org/pdf/2511.21688
• Project Page: https://gordonhu608.github.io/g2vlm.github.io/
• Github: https://github.com/InternRobotics/G2VLM

🔹 Models citing this paper:
• https://huggingface.co/InternRobotics/G2VLM-2B-MoT

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VisionLanguageModels #3DReconstruction #SpatialReasoning #ComputerVision #ArtificialIntelligence

❤1

445 views22:08

✨ Explore Data Science 📝 Write your paper

✨ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

📝 Summary:
ENACT is a benchmark evaluating embodied cognition in vision-language models through egocentric world modeling tasks. It reveals a performance gap between VLMs and humans that widens with interaction, and models exhibit anthropocentric biases.

🔹 Publication Date: Published on Nov 26

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20937
• PDF: https://arxiv.org/pdf/2511.20937

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#EmbodiedCognition #VisionLanguageModels #AIResearch #WorldModeling #CognitiveScience

❤1

696 views18:04

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

📝 Summary:
LVLMs struggle to preserve cultural identities in mixed visual scenes. Researchers created CultureMix, a VQA benchmark, finding consistent failures and background reliance. Supervised fine-tuning with diverse culture mixing data significantly improves model consistency and reduces background sens...

🔹 Publication Date: Published on Nov 27

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22787
• PDF: https://arxiv.org/pdf/2511.22787

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VisionLanguageModels #CulturalAI #ComputerVision #AIML #AIResearch

❤1

291 views05:04

✨ Explore Data Science 📝 Write your paper

✨VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference

📝 Summary:
VLASH is an asynchronous inference framework for VLAs. It achieves fast accurate and low-latency robotic control by estimating future robot states bridging prediction-execution gaps. This enables VLAs to perform high-precision tasks like ping-pong with significant speedup and reduced latency.

🔹 Publication Date: Published on Nov 30

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01031
• PDF: https://arxiv.org/pdf/2512.01031
• Github: https://github.com/mit-han-lab/vlash

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#Robotics #VisionLanguageModels #RealTimeAI #AIResearch #MachineLearning

124 views04:03

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Structured Extraction from Business Process Diagrams Using Vision-Language Models

📝 Summary:
This paper presents a method using Vision-Language Models to extract structured JSON from BPMN diagram images. It incorporates OCR for text enrichment, demonstrating improved model performance and enabling extraction when source files are unavailable.

🔹 Publication Date: Published on Nov 27

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22448
• PDF: https://arxiv.org/pdf/2511.22448
• Github: https://github.com/pritamdeka/BPMN-VLM

✨ Datasets citing this paper:
• https://huggingface.co/datasets/pritamdeka/BPMN-VLM

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VisionLanguageModels #BPMN #InformationExtraction #AI #ComputerVision

❤1

216 views09:09

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨CauSight: Learning to Supersense for Visual Causal Discovery

📝 Summary:
CauSight is a novel vision-language model for visual causal discovery, inferring cause-effect relations in images. It uses the VCG-32K dataset and Tree-of-Causal-Thought, significantly outperforming GPT-4.1 with a threefold performance boost.

🔹 Publication Date: Published on Dec 1

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01827
• PDF: https://arxiv.org/pdf/2512.01827
• Github: https://github.com/OpenCausaLab/CauSight

🔹 Models citing this paper:
• https://huggingface.co/OpenCausaLab/CauSight

✨ Datasets citing this paper:
• https://huggingface.co/datasets/OpenCausaLab/VCG-32K

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VisualCausalDiscovery #VisionLanguageModels #AI #DeepLearning #CausalInference

313 views16:12

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

📝 Summary:
Concise Chain-of-Thought steps, specifically minimal visual grounding, are most effective for achieving generalizable visual reasoning in vision-language models. Longer or visual CoT primarily accelerate training but do not improve final performance or generalization across tasks.

🔹 Publication Date: Published on Nov 27

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22586
• PDF: https://arxiv.org/pdf/2511.22586

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#ChainOfThought #VisionLanguageModels #VisualReasoning #AIGeneralization #DeepLearning

183 views03:01

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

📝 Summary:
TRivia is a self-supervised fine-tuning method for vision-language models to learn table recognition from unlabeled data. It uses a question-answering reward mechanism to autonomously optimize the model. This open-source solution outperforms state-of-the-art systems on popular benchmarks.

🔹 Publication Date: Published on Dec 1

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01248
• PDF: https://arxiv.org/pdf/2512.01248
• Github: https://github.com/opendatalab/TRivia

🔹 Models citing this paper:
• https://huggingface.co/opendatalab/TRivia-3B

✨ Spaces citing this paper:
• https://huggingface.co/spaces/opendatalab/TRivia-3B

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#TableRecognition #VisionLanguageModels #SelfSupervisedLearning #AI #DeepLearning

146 views05:04

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

📝 Summary:
SpaceTools introduces Double Interactive Reinforcement Learning DIRL. This two-phase RL framework enables Vision Language Models to coordinate multiple tools for precise spatial reasoning, achieving state-of-the-art performance on benchmarks and real-world robot tasks.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04069
• PDF: https://arxiv.org/pdf/2512.04069
• Project Page: https://spacetools.github.io/
• Github: https://spacetools.github.io/

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#ReinforcementLearning #VisionLanguageModels #Robotics #SpatialReasoning #AI

118 views08:02

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

📝 Summary:
AdaptVision is an efficient VLM that adaptively acquires visual tokens through a coarse-to-fine approach, using a bounding box tool. Trained with reinforcement learning to balance accuracy and efficiency, it achieves superior VQA performance using fewer visual tokens.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.03794
• PDF: https://arxiv.org/pdf/2512.03794
• Project Page: https://adaptvision.github.io/
• Github: https://github.com/AdaptVision/AdaptVision

🔹 Models citing this paper:
• https://huggingface.co/AdaptVision/AdaptVision-7B

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VisionLanguageModels #ReinforcementLearning #ComputerVision #AIResearch #EfficientAI

301 views15:38

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨AutoNeural: Co-Designing Vision-Language Models for NPU Inference

📝 Summary:
AutoNeural is an NPU-native VLM co-designed for efficient edge inference. It uses a MobileNetV5-style vision backbone for stable integer quantization and a hybrid SSM-Transformer language backbone. This design reduces quantization errors and latency, improving real-time performance on edge devices.

🔹 Publication Date: Published on Dec 2

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.02924
• PDF: https://arxiv.org/pdf/2512.02924

🔹 Models citing this paper:
• https://huggingface.co/NexaAI/AutoNeural

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#AutoNeural #VisionLanguageModels #EdgeAI #AIHardware #EfficientAI

321 views16:38

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

📝 Summary:
NEO is a novel family of native Vision-Language Models built from first principles. It unifies vision and language, aligning pixels and words in a shared semantic space. NEO achieves competitive performance with limited data while efficiently developing visual perception from scratch.

🔹 Publication Date: Published on Oct 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.14979
• PDF: https://arxiv.org/pdf/2510.14979
• Github: https://github.com/EvolvingLMMs-Lab/NEO

🔹 Models citing this paper:
• https://huggingface.co/Paranioar/NEO1_0-2B-SFT
• https://huggingface.co/Paranioar/NEO1_0-9B-SFT
• https://huggingface.co/Paranioar/NEO1_0-2B-PT

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VisionLanguageModels #MultimodalAI #DeepLearning #ComputerVision #AIREsearch

❤1

566 views19:01

✨ Explore Data Science 📝 Write your paper

ML Research Hub

🤖🧠 Reducing Hallucinations in Vision-Language Models: A Step Forward with VisAlign

🗓️ 24 Nov 2025
📚 AI News & Trends

As artificial intelligence continues to evolve, Large Vision-Language Models (LVLMs) have revolutionized how machines understand and describe the world. These models combine visual perception with natural language understanding to perform tasks such as image captioning, visual question answering and multimodal reasoning. Despite their success, a major problem persists – hallucination. This issue occurs when a ...

#VisAlign #ReducingHallucinations #VisionLanguageModels #LVLMs #MultimodalAI #AISafety

364 views11:05

📖 Read More

📣 BEST TELEGRAM CHANNELS

ML Research Hub

✨ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

📝 Summary:
ReVSeg enhances video object segmentation. It uses sequential reasoning within pretrained vision language models, optimized by reinforcement learning. This achieves state-of-the-art results and provides interpretable reasoning.

🔹 Publication Date: Published on Dec 2

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.02835
• PDF: https://arxiv.org/pdf/2512.02835
• Project Page: https://clementine24.github.io/ReVSeg/
• Github: https://github.com/Clementine24/ReVSeg

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoSegmentation #ReinforcementLearning #VisionLanguageModels #ComputerVision #DeepLearning

224 views03:01

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

📝 Summary:
The TAD benchmark is introduced to evaluate temporal understanding in autonomous driving, addressing a gap where current VLMs perform poorly. It reveals that state-of-the-art models show substandard accuracy in this domain. Two training-free solutions, Scene-CoT and TCogMap, are proposed, improvi...

🔹 Publication Date: Published on Dec 4

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.05277
• PDF: https://arxiv.org/pdf/2512.05277
• Github: https://github.com/vbdi/tad_bench

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#AutonomousDriving #VisionLanguageModels #ComputerVision #AIResearch #DeepLearning

❤3

309 views19:06

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

📝 Summary:
GLM-4.1V-Thinking is a vision-language model using a reasoning-centric training framework. It achieves state-of-the-art multimodal reasoning across various tasks like STEM and long document understanding. The model outperforms larger models and competes with closed-source systems like GPT-4o.

🔹 Publication Date: Published on Jul 1

🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/glm-41v-thinking-towards-versatile-multimodal-reasoning-with-scalable-reinforcement-learning
• PDF: https://arxiv.org/pdf/2507.01006
• Github: https://github.com/THUDM/GLM-4.1V-Thinking

🔹 Models citing this paper:
• https://huggingface.co/zai-org/GLM-4.1V-9B-Thinking
• https://huggingface.co/zai-org/GLM-4.5V
• https://huggingface.co/zai-org/GLM-4.6V-Flash

✨ Spaces citing this paper:
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-Demo
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-API-Demo
• https://huggingface.co/spaces/akhaliq/anycoder

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#GLM41VThinking #MultimodalAI #VisionLanguageModels #ReinforcementLearning #AIResearch

Arxivexplained

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning - Explained Simply

By Wenyi Hong, Wenmeng Yu, Xiaotao Gu et al.. # GLM-4.1V-Thinking: The AI That Actually Thinks Through Visual Problems

**The Problem:** Current A...

300 views11:05

✨ Explore Data Science 📝 Write your paper

ML Research Hub

2:53

This media is not supported in your browser

VIEW IN TELEGRAM

✨Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

📝 Summary:
DSR Suite improves vision language models weak dynamic spatial reasoning. It creates 4D training data from videos using an automated pipeline and integrates geometric priors via a Geometry Selection Module. This significantly enhances VLM dynamic spatial reasoning capability while maintaining gen...

🔹 Publication Date: Published on Dec 23

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.20557
• PDF: https://arxiv.org/pdf/2512.20557

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VisionLanguageModels #SpatialReasoning #4D #ComputerVision #AIResearch

211 views03:01

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

📝 Summary:
Bi-directional Perceptual Shaping BiPS improves vision-language models by using question-conditioned masked views to shape perception during training. It employs two constraints to ensure complete coverage of relevant pixels and enforce fine-grained visual reliance, preventing text-only shortcuts...

🔹 Publication Date: Published on Dec 26

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22120
• PDF: https://arxiv.org/pdf/2512.22120
• Github: https://github.com/zss02/BiPS

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#MultimodalAI #VisionLanguageModels #MachineLearning #AIResearch #DeepLearning

❤1

208 views03:01

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

📝 Summary:
Dream-VL and Dream-VLA are diffusion-based vision-language and vision-language-action models. They achieve state-of-the-art performance in visual planning and robotic control, surpassing autoregressive baselines via their diffusion backbone's superior action generation.

🔹 Publication Date: Published on Dec 27

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22615
• PDF: https://arxiv.org/pdf/2512.22615
• Project Page: https://hkunlp.github.io/blog/2025/dream-vlx/
• Github: https://github.com/DreamLM/Dream-VLX

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VisionLanguageModels #DiffusionModels #Robotics #AI #ComputerVision

113 views09:57

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

📝 Summary:
SenseNova-MARS empowers Vision-Language Models with interleaved visual reasoning and dynamic tool use like search and cropping via reinforcement learning. It achieves state-of-the-art performance on complex visual tasks, outperforming proprietary models on new and existing benchmarks.

🔹 Publication Date: Published on Dec 30, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24330
• PDF: https://arxiv.org/pdf/2512.24330
• Github: https://github.com/OpenSenseNova/SenseNova-MARS

✨ Datasets citing this paper:
• https://huggingface.co/datasets/sensenova/SenseNova-MARS-Data
• https://huggingface.co/datasets/sensenova/HR-MMSearch

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#MultimodalAI #ReinforcementLearning #VisionLanguageModels #AgenticAI #ComputerVision

86 views03:00

✨ Explore Data Science 📝 Write your paper

About

Blog

Apps

Platform