ML Research Hub
32.8K subscribers
4.09K photos
237 videos
23 files
4.41K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
SPHINX: A Synthetic Environment for Visual Perception and Reasoning

📝 Summary:
Sphinx is a synthetic environment for visual perception and reasoning, using procedurally generated puzzles to evaluate large vision-language models. It shows that current state-of-the-art models perform poorly, but reinforcement learning with verifiable rewards substantially improves accuracy.

🔹 Publication Date: Published on Nov 25

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20814
• PDF: https://arxiv.org/pdf/2511.20814
• Github: https://github.com/xashru/sphinx

Datasets citing this paper:
https://huggingface.co/datasets/xashru/sphinx

==================================

For more data science resources:
https://t.me/DataScienceT

#AI #ComputerVision #ReinforcementLearning #VisionLanguageModels #SyntheticEnvironments
G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

📝 Summary:
G^2VLM integrates 3D geometry learning into vision-language models to overcome their spatial intelligence deficits. It unifies 3D reconstruction and spatial reasoning, leveraging learned 3D features to achieve strong performance in both tasks.

🔹 Publication Date: Published on Nov 26

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.21688
• PDF: https://arxiv.org/pdf/2511.21688
• Project Page: https://gordonhu608.github.io/g2vlm.github.io/
• Github: https://github.com/InternRobotics/G2VLM

🔹 Models citing this paper:
https://huggingface.co/InternRobotics/G2VLM-2B-MoT

==================================

For more data science resources:
https://t.me/DataScienceT

#VisionLanguageModels #3DReconstruction #SpatialReasoning #ComputerVision #ArtificialIntelligence
1
Media is too big
VIEW IN TELEGRAM
ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

📝 Summary:
ENACT is a benchmark evaluating embodied cognition in vision-language models through egocentric world modeling tasks. It reveals a performance gap between VLMs and humans that widens with interaction, and models exhibit anthropocentric biases.

🔹 Publication Date: Published on Nov 26

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20937
• PDF: https://arxiv.org/pdf/2511.20937

==================================

For more data science resources:
https://t.me/DataScienceT

#EmbodiedCognition #VisionLanguageModels #AIResearch #WorldModeling #CognitiveScience
1
World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

📝 Summary:
LVLMs struggle to preserve cultural identities in mixed visual scenes. Researchers created CultureMix, a VQA benchmark, finding consistent failures and background reliance. Supervised fine-tuning with diverse culture mixing data significantly improves model consistency and reduces background sens...

🔹 Publication Date: Published on Nov 27

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22787
• PDF: https://arxiv.org/pdf/2511.22787

==================================

For more data science resources:
https://t.me/DataScienceT

#VisionLanguageModels #CulturalAI #ComputerVision #AIML #AIResearch
1
Media is too big
VIEW IN TELEGRAM
VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference

📝 Summary:
VLASH is an asynchronous inference framework for VLAs. It achieves fast accurate and low-latency robotic control by estimating future robot states bridging prediction-execution gaps. This enables VLAs to perform high-precision tasks like ping-pong with significant speedup and reduced latency.

🔹 Publication Date: Published on Nov 30

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01031
• PDF: https://arxiv.org/pdf/2512.01031
• Github: https://github.com/mit-han-lab/vlash

==================================

For more data science resources:
https://t.me/DataScienceT

#Robotics #VisionLanguageModels #RealTimeAI #AIResearch #MachineLearning
Structured Extraction from Business Process Diagrams Using Vision-Language Models

📝 Summary:
This paper presents a method using Vision-Language Models to extract structured JSON from BPMN diagram images. It incorporates OCR for text enrichment, demonstrating improved model performance and enabling extraction when source files are unavailable.

🔹 Publication Date: Published on Nov 27

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22448
• PDF: https://arxiv.org/pdf/2511.22448
• Github: https://github.com/pritamdeka/BPMN-VLM

Datasets citing this paper:
https://huggingface.co/datasets/pritamdeka/BPMN-VLM

==================================

For more data science resources:
https://t.me/DataScienceT

#VisionLanguageModels #BPMN #InformationExtraction #AI #ComputerVision
1
CauSight: Learning to Supersense for Visual Causal Discovery

📝 Summary:
CauSight is a novel vision-language model for visual causal discovery, inferring cause-effect relations in images. It uses the VCG-32K dataset and Tree-of-Causal-Thought, significantly outperforming GPT-4.1 with a threefold performance boost.

🔹 Publication Date: Published on Dec 1

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01827
• PDF: https://arxiv.org/pdf/2512.01827
• Github: https://github.com/OpenCausaLab/CauSight

🔹 Models citing this paper:
https://huggingface.co/OpenCausaLab/CauSight

Datasets citing this paper:
https://huggingface.co/datasets/OpenCausaLab/VCG-32K

==================================

For more data science resources:
https://t.me/DataScienceT

#VisualCausalDiscovery #VisionLanguageModels #AI #DeepLearning #CausalInference
Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

📝 Summary:
Concise Chain-of-Thought steps, specifically minimal visual grounding, are most effective for achieving generalizable visual reasoning in vision-language models. Longer or visual CoT primarily accelerate training but do not improve final performance or generalization across tasks.

🔹 Publication Date: Published on Nov 27

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22586
• PDF: https://arxiv.org/pdf/2511.22586

==================================

For more data science resources:
https://t.me/DataScienceT

#ChainOfThought #VisionLanguageModels #VisualReasoning #AIGeneralization #DeepLearning
TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

📝 Summary:
TRivia is a self-supervised fine-tuning method for vision-language models to learn table recognition from unlabeled data. It uses a question-answering reward mechanism to autonomously optimize the model. This open-source solution outperforms state-of-the-art systems on popular benchmarks.

🔹 Publication Date: Published on Dec 1

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01248
• PDF: https://arxiv.org/pdf/2512.01248
• Github: https://github.com/opendatalab/TRivia

🔹 Models citing this paper:
https://huggingface.co/opendatalab/TRivia-3B

Spaces citing this paper:
https://huggingface.co/spaces/opendatalab/TRivia-3B

==================================

For more data science resources:
https://t.me/DataScienceT

#TableRecognition #VisionLanguageModels #SelfSupervisedLearning #AI #DeepLearning
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

📝 Summary:
SpaceTools introduces Double Interactive Reinforcement Learning DIRL. This two-phase RL framework enables Vision Language Models to coordinate multiple tools for precise spatial reasoning, achieving state-of-the-art performance on benchmarks and real-world robot tasks.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04069
• PDF: https://arxiv.org/pdf/2512.04069
• Project Page: https://spacetools.github.io/
• Github: https://spacetools.github.io/

==================================

For more data science resources:
https://t.me/DataScienceT

#ReinforcementLearning #VisionLanguageModels #Robotics #SpatialReasoning #AI
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

📝 Summary:
AdaptVision is an efficient VLM that adaptively acquires visual tokens through a coarse-to-fine approach, using a bounding box tool. Trained with reinforcement learning to balance accuracy and efficiency, it achieves superior VQA performance using fewer visual tokens.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.03794
• PDF: https://arxiv.org/pdf/2512.03794
• Project Page: https://adaptvision.github.io/
• Github: https://github.com/AdaptVision/AdaptVision

🔹 Models citing this paper:
https://huggingface.co/AdaptVision/AdaptVision-7B

==================================

For more data science resources:
https://t.me/DataScienceT

#VisionLanguageModels #ReinforcementLearning #ComputerVision #AIResearch #EfficientAI
AutoNeural: Co-Designing Vision-Language Models for NPU Inference

📝 Summary:
AutoNeural is an NPU-native VLM co-designed for efficient edge inference. It uses a MobileNetV5-style vision backbone for stable integer quantization and a hybrid SSM-Transformer language backbone. This design reduces quantization errors and latency, improving real-time performance on edge devices.

🔹 Publication Date: Published on Dec 2

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.02924
• PDF: https://arxiv.org/pdf/2512.02924

🔹 Models citing this paper:
https://huggingface.co/NexaAI/AutoNeural

==================================

For more data science resources:
https://t.me/DataScienceT

#AutoNeural #VisionLanguageModels #EdgeAI #AIHardware #EfficientAI
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

📝 Summary:
NEO is a novel family of native Vision-Language Models built from first principles. It unifies vision and language, aligning pixels and words in a shared semantic space. NEO achieves competitive performance with limited data while efficiently developing visual perception from scratch.

🔹 Publication Date: Published on Oct 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.14979
• PDF: https://arxiv.org/pdf/2510.14979
• Github: https://github.com/EvolvingLMMs-Lab/NEO

🔹 Models citing this paper:
https://huggingface.co/Paranioar/NEO1_0-2B-SFT
https://huggingface.co/Paranioar/NEO1_0-9B-SFT
https://huggingface.co/Paranioar/NEO1_0-2B-PT

==================================

For more data science resources:
https://t.me/DataScienceT

#VisionLanguageModels #MultimodalAI #DeepLearning #ComputerVision #AIREsearch
1
🤖🧠 Reducing Hallucinations in Vision-Language Models: A Step Forward with VisAlign

🗓️ 24 Nov 2025
📚 AI News & Trends

As artificial intelligence continues to evolve, Large Vision-Language Models (LVLMs) have revolutionized how machines understand and describe the world. These models combine visual perception with natural language understanding to perform tasks such as image captioning, visual question answering and multimodal reasoning. Despite their success, a major problem persists – hallucination. This issue occurs when a ...

#VisAlign #ReducingHallucinations #VisionLanguageModels #LVLMs #MultimodalAI #AISafety
ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

📝 Summary:
ReVSeg enhances video object segmentation. It uses sequential reasoning within pretrained vision language models, optimized by reinforcement learning. This achieves state-of-the-art results and provides interpretable reasoning.

🔹 Publication Date: Published on Dec 2

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.02835
• PDF: https://arxiv.org/pdf/2512.02835
• Project Page: https://clementine24.github.io/ReVSeg/
• Github: https://github.com/Clementine24/ReVSeg

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoSegmentation #ReinforcementLearning #VisionLanguageModels #ComputerVision #DeepLearning
From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

📝 Summary:
The TAD benchmark is introduced to evaluate temporal understanding in autonomous driving, addressing a gap where current VLMs perform poorly. It reveals that state-of-the-art models show substandard accuracy in this domain. Two training-free solutions, Scene-CoT and TCogMap, are proposed, improvi...

🔹 Publication Date: Published on Dec 4

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.05277
• PDF: https://arxiv.org/pdf/2512.05277
• Github: https://github.com/vbdi/tad_bench

==================================

For more data science resources:
https://t.me/DataScienceT

#AutonomousDriving #VisionLanguageModels #ComputerVision #AIResearch #DeepLearning
3
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

📝 Summary:
GLM-4.1V-Thinking is a vision-language model using a reasoning-centric training framework. It achieves state-of-the-art multimodal reasoning across various tasks like STEM and long document understanding. The model outperforms larger models and competes with closed-source systems like GPT-4o.

🔹 Publication Date: Published on Jul 1

🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/glm-41v-thinking-towards-versatile-multimodal-reasoning-with-scalable-reinforcement-learning
• PDF: https://arxiv.org/pdf/2507.01006
• Github: https://github.com/THUDM/GLM-4.1V-Thinking

🔹 Models citing this paper:
https://huggingface.co/zai-org/GLM-4.1V-9B-Thinking
https://huggingface.co/zai-org/GLM-4.5V
https://huggingface.co/zai-org/GLM-4.6V-Flash

Spaces citing this paper:
https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-Demo
https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-API-Demo
https://huggingface.co/spaces/akhaliq/anycoder

==================================

For more data science resources:
https://t.me/DataScienceT

#GLM41VThinking #MultimodalAI #VisionLanguageModels #ReinforcementLearning #AIResearch
This media is not supported in your browser
VIEW IN TELEGRAM
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

📝 Summary:
DSR Suite improves vision language models weak dynamic spatial reasoning. It creates 4D training data from videos using an automated pipeline and integrates geometric priors via a Geometry Selection Module. This significantly enhances VLM dynamic spatial reasoning capability while maintaining gen...

🔹 Publication Date: Published on Dec 23

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.20557
• PDF: https://arxiv.org/pdf/2512.20557

==================================

For more data science resources:
https://t.me/DataScienceT

#VisionLanguageModels #SpatialReasoning #4D #ComputerVision #AIResearch
See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

📝 Summary:
Bi-directional Perceptual Shaping BiPS improves vision-language models by using question-conditioned masked views to shape perception during training. It employs two constraints to ensure complete coverage of relevant pixels and enforce fine-grained visual reliance, preventing text-only shortcuts...

🔹 Publication Date: Published on Dec 26

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22120
• PDF: https://arxiv.org/pdf/2512.22120
• Github: https://github.com/zss02/BiPS

==================================

For more data science resources:
https://t.me/DataScienceT

#MultimodalAI #VisionLanguageModels #MachineLearning #AIResearch #DeepLearning
1
Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

📝 Summary:
Dream-VL and Dream-VLA are diffusion-based vision-language and vision-language-action models. They achieve state-of-the-art performance in visual planning and robotic control, surpassing autoregressive baselines via their diffusion backbone's superior action generation.

🔹 Publication Date: Published on Dec 27

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22615
• PDF: https://arxiv.org/pdf/2512.22615
• Project Page: https://hkunlp.github.io/blog/2025/dream-vlx/
• Github: https://github.com/DreamLM/Dream-VLX

==================================

For more data science resources:
https://t.me/DataScienceT

#VisionLanguageModels #DiffusionModels #Robotics #AI #ComputerVision