ML Research Hub
32.8K subscribers
4.38K photos
270 videos
23 files
4.74K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

📝 Summary:
LVLMs struggle to preserve cultural identities in mixed visual scenes. Researchers created CultureMix, a VQA benchmark, finding consistent failures and background reliance. Supervised fine-tuning with diverse culture mixing data significantly improves model consistency and reduces background sens...

🔹 Publication Date: Published on Nov 27

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22787
• PDF: https://arxiv.org/pdf/2511.22787

==================================

For more data science resources:
https://t.me/DataScienceT

#VisionLanguageModels #CulturalAI #ComputerVision #AIML #AIResearch
1
Media is too big
VIEW IN TELEGRAM
VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference

📝 Summary:
VLASH is an asynchronous inference framework for VLAs. It achieves fast accurate and low-latency robotic control by estimating future robot states bridging prediction-execution gaps. This enables VLAs to perform high-precision tasks like ping-pong with significant speedup and reduced latency.

🔹 Publication Date: Published on Nov 30

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01031
• PDF: https://arxiv.org/pdf/2512.01031
• Github: https://github.com/mit-han-lab/vlash

==================================

For more data science resources:
https://t.me/DataScienceT

#Robotics #VisionLanguageModels #RealTimeAI #AIResearch #MachineLearning
Structured Extraction from Business Process Diagrams Using Vision-Language Models

📝 Summary:
This paper presents a method using Vision-Language Models to extract structured JSON from BPMN diagram images. It incorporates OCR for text enrichment, demonstrating improved model performance and enabling extraction when source files are unavailable.

🔹 Publication Date: Published on Nov 27

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22448
• PDF: https://arxiv.org/pdf/2511.22448
• Github: https://github.com/pritamdeka/BPMN-VLM

Datasets citing this paper:
https://huggingface.co/datasets/pritamdeka/BPMN-VLM

==================================

For more data science resources:
https://t.me/DataScienceT

#VisionLanguageModels #BPMN #InformationExtraction #AI #ComputerVision
1
CauSight: Learning to Supersense for Visual Causal Discovery

📝 Summary:
CauSight is a novel vision-language model for visual causal discovery, inferring cause-effect relations in images. It uses the VCG-32K dataset and Tree-of-Causal-Thought, significantly outperforming GPT-4.1 with a threefold performance boost.

🔹 Publication Date: Published on Dec 1

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01827
• PDF: https://arxiv.org/pdf/2512.01827
• Github: https://github.com/OpenCausaLab/CauSight

🔹 Models citing this paper:
https://huggingface.co/OpenCausaLab/CauSight

Datasets citing this paper:
https://huggingface.co/datasets/OpenCausaLab/VCG-32K

==================================

For more data science resources:
https://t.me/DataScienceT

#VisualCausalDiscovery #VisionLanguageModels #AI #DeepLearning #CausalInference
Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

📝 Summary:
Concise Chain-of-Thought steps, specifically minimal visual grounding, are most effective for achieving generalizable visual reasoning in vision-language models. Longer or visual CoT primarily accelerate training but do not improve final performance or generalization across tasks.

🔹 Publication Date: Published on Nov 27

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.22586
• PDF: https://arxiv.org/pdf/2511.22586

==================================

For more data science resources:
https://t.me/DataScienceT

#ChainOfThought #VisionLanguageModels #VisualReasoning #AIGeneralization #DeepLearning
TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

📝 Summary:
TRivia is a self-supervised fine-tuning method for vision-language models to learn table recognition from unlabeled data. It uses a question-answering reward mechanism to autonomously optimize the model. This open-source solution outperforms state-of-the-art systems on popular benchmarks.

🔹 Publication Date: Published on Dec 1

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01248
• PDF: https://arxiv.org/pdf/2512.01248
• Github: https://github.com/opendatalab/TRivia

🔹 Models citing this paper:
https://huggingface.co/opendatalab/TRivia-3B

Spaces citing this paper:
https://huggingface.co/spaces/opendatalab/TRivia-3B

==================================

For more data science resources:
https://t.me/DataScienceT

#TableRecognition #VisionLanguageModels #SelfSupervisedLearning #AI #DeepLearning
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

📝 Summary:
SpaceTools introduces Double Interactive Reinforcement Learning DIRL. This two-phase RL framework enables Vision Language Models to coordinate multiple tools for precise spatial reasoning, achieving state-of-the-art performance on benchmarks and real-world robot tasks.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04069
• PDF: https://arxiv.org/pdf/2512.04069
• Project Page: https://spacetools.github.io/
• Github: https://spacetools.github.io/

==================================

For more data science resources:
https://t.me/DataScienceT

#ReinforcementLearning #VisionLanguageModels #Robotics #SpatialReasoning #AI
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

📝 Summary:
AdaptVision is an efficient VLM that adaptively acquires visual tokens through a coarse-to-fine approach, using a bounding box tool. Trained with reinforcement learning to balance accuracy and efficiency, it achieves superior VQA performance using fewer visual tokens.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.03794
• PDF: https://arxiv.org/pdf/2512.03794
• Project Page: https://adaptvision.github.io/
• Github: https://github.com/AdaptVision/AdaptVision

🔹 Models citing this paper:
https://huggingface.co/AdaptVision/AdaptVision-7B

==================================

For more data science resources:
https://t.me/DataScienceT

#VisionLanguageModels #ReinforcementLearning #ComputerVision #AIResearch #EfficientAI
AutoNeural: Co-Designing Vision-Language Models for NPU Inference

📝 Summary:
AutoNeural is an NPU-native VLM co-designed for efficient edge inference. It uses a MobileNetV5-style vision backbone for stable integer quantization and a hybrid SSM-Transformer language backbone. This design reduces quantization errors and latency, improving real-time performance on edge devices.

🔹 Publication Date: Published on Dec 2

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.02924
• PDF: https://arxiv.org/pdf/2512.02924

🔹 Models citing this paper:
https://huggingface.co/NexaAI/AutoNeural

==================================

For more data science resources:
https://t.me/DataScienceT

#AutoNeural #VisionLanguageModels #EdgeAI #AIHardware #EfficientAI
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

📝 Summary:
NEO is a novel family of native Vision-Language Models built from first principles. It unifies vision and language, aligning pixels and words in a shared semantic space. NEO achieves competitive performance with limited data while efficiently developing visual perception from scratch.

🔹 Publication Date: Published on Oct 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.14979
• PDF: https://arxiv.org/pdf/2510.14979
• Github: https://github.com/EvolvingLMMs-Lab/NEO

🔹 Models citing this paper:
https://huggingface.co/Paranioar/NEO1_0-2B-SFT
https://huggingface.co/Paranioar/NEO1_0-9B-SFT
https://huggingface.co/Paranioar/NEO1_0-2B-PT

==================================

For more data science resources:
https://t.me/DataScienceT

#VisionLanguageModels #MultimodalAI #DeepLearning #ComputerVision #AIREsearch
1