ML Research Hub

✨TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

📝 Summary:
TRivia is a self-supervised fine-tuning method for vision-language models to learn table recognition from unlabeled data. It uses a question-answering reward mechanism to autonomously optimize the model. This open-source solution outperforms state-of-the-art systems on popular benchmarks.

🔹 Publication Date: Published on Dec 1

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01248
• PDF: https://arxiv.org/pdf/2512.01248
• Github: https://github.com/opendatalab/TRivia

🔹 Models citing this paper:
• https://huggingface.co/opendatalab/TRivia-3B

✨ Spaces citing this paper:
• https://huggingface.co/spaces/opendatalab/TRivia-3B

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#TableRecognition #VisionLanguageModels #SelfSupervisedLearning #AI #DeepLearning

146 views05:04

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

📝 Summary:
SpaceTools introduces Double Interactive Reinforcement Learning DIRL. This two-phase RL framework enables Vision Language Models to coordinate multiple tools for precise spatial reasoning, achieving state-of-the-art performance on benchmarks and real-world robot tasks.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04069
• PDF: https://arxiv.org/pdf/2512.04069
• Project Page: https://spacetools.github.io/
• Github: https://spacetools.github.io/

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#ReinforcementLearning #VisionLanguageModels #Robotics #SpatialReasoning #AI

117 views08:02

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

📝 Summary:
AdaptVision is an efficient VLM that adaptively acquires visual tokens through a coarse-to-fine approach, using a bounding box tool. Trained with reinforcement learning to balance accuracy and efficiency, it achieves superior VQA performance using fewer visual tokens.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.03794
• PDF: https://arxiv.org/pdf/2512.03794
• Project Page: https://adaptvision.github.io/
• Github: https://github.com/AdaptVision/AdaptVision

🔹 Models citing this paper:
• https://huggingface.co/AdaptVision/AdaptVision-7B

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VisionLanguageModels #ReinforcementLearning #ComputerVision #AIResearch #EfficientAI

301 views15:38

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨AutoNeural: Co-Designing Vision-Language Models for NPU Inference

📝 Summary:
AutoNeural is an NPU-native VLM co-designed for efficient edge inference. It uses a MobileNetV5-style vision backbone for stable integer quantization and a hybrid SSM-Transformer language backbone. This design reduces quantization errors and latency, improving real-time performance on edge devices.

🔹 Publication Date: Published on Dec 2

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.02924
• PDF: https://arxiv.org/pdf/2512.02924

🔹 Models citing this paper:
• https://huggingface.co/NexaAI/AutoNeural

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#AutoNeural #VisionLanguageModels #EdgeAI #AIHardware #EfficientAI

321 views16:38

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

📝 Summary:
NEO is a novel family of native Vision-Language Models built from first principles. It unifies vision and language, aligning pixels and words in a shared semantic space. NEO achieves competitive performance with limited data while efficiently developing visual perception from scratch.

🔹 Publication Date: Published on Oct 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.14979
• PDF: https://arxiv.org/pdf/2510.14979
• Github: https://github.com/EvolvingLMMs-Lab/NEO

🔹 Models citing this paper:
• https://huggingface.co/Paranioar/NEO1_0-2B-SFT
• https://huggingface.co/Paranioar/NEO1_0-9B-SFT
• https://huggingface.co/Paranioar/NEO1_0-2B-PT

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VisionLanguageModels #MultimodalAI #DeepLearning #ComputerVision #AIREsearch

❤1

565 views19:01

✨ Explore Data Science 📝 Write your paper

ML Research Hub

🤖🧠 Reducing Hallucinations in Vision-Language Models: A Step Forward with VisAlign

🗓️ 24 Nov 2025
📚 AI News & Trends

As artificial intelligence continues to evolve, Large Vision-Language Models (LVLMs) have revolutionized how machines understand and describe the world. These models combine visual perception with natural language understanding to perform tasks such as image captioning, visual question answering and multimodal reasoning. Despite their success, a major problem persists – hallucination. This issue occurs when a ...

#VisAlign #ReducingHallucinations #VisionLanguageModels #LVLMs #MultimodalAI #AISafety

362 views11:05

📖 Read More

📣 BEST TELEGRAM CHANNELS

ML Research Hub

✨ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

📝 Summary:
ReVSeg enhances video object segmentation. It uses sequential reasoning within pretrained vision language models, optimized by reinforcement learning. This achieves state-of-the-art results and provides interpretable reasoning.

🔹 Publication Date: Published on Dec 2

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.02835
• PDF: https://arxiv.org/pdf/2512.02835
• Project Page: https://clementine24.github.io/ReVSeg/
• Github: https://github.com/Clementine24/ReVSeg

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VideoSegmentation #ReinforcementLearning #VisionLanguageModels #ComputerVision #DeepLearning

224 views03:01

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

📝 Summary:
The TAD benchmark is introduced to evaluate temporal understanding in autonomous driving, addressing a gap where current VLMs perform poorly. It reveals that state-of-the-art models show substandard accuracy in this domain. Two training-free solutions, Scene-CoT and TCogMap, are proposed, improvi...

🔹 Publication Date: Published on Dec 4

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.05277
• PDF: https://arxiv.org/pdf/2512.05277
• Github: https://github.com/vbdi/tad_bench

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#AutonomousDriving #VisionLanguageModels #ComputerVision #AIResearch #DeepLearning

❤3

308 views19:06

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

📝 Summary:
GLM-4.1V-Thinking is a vision-language model using a reasoning-centric training framework. It achieves state-of-the-art multimodal reasoning across various tasks like STEM and long document understanding. The model outperforms larger models and competes with closed-source systems like GPT-4o.

🔹 Publication Date: Published on Jul 1

🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/glm-41v-thinking-towards-versatile-multimodal-reasoning-with-scalable-reinforcement-learning
• PDF: https://arxiv.org/pdf/2507.01006
• Github: https://github.com/THUDM/GLM-4.1V-Thinking

🔹 Models citing this paper:
• https://huggingface.co/zai-org/GLM-4.1V-9B-Thinking
• https://huggingface.co/zai-org/GLM-4.5V
• https://huggingface.co/zai-org/GLM-4.6V-Flash

✨ Spaces citing this paper:
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-Demo
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-API-Demo
• https://huggingface.co/spaces/akhaliq/anycoder

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#GLM41VThinking #MultimodalAI #VisionLanguageModels #ReinforcementLearning #AIResearch

Arxivexplained

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning - Explained Simply

By Wenyi Hong, Wenmeng Yu, Xiaotao Gu et al.. # GLM-4.1V-Thinking: The AI That Actually Thinks Through Visual Problems

**The Problem:** Current A...

299 views11:05

✨ Explore Data Science 📝 Write your paper

ML Research Hub

2:53

This media is not supported in your browser

VIEW IN TELEGRAM

✨Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

📝 Summary:
DSR Suite improves vision language models weak dynamic spatial reasoning. It creates 4D training data from videos using an automated pipeline and integrates geometric priors via a Geometry Selection Module. This significantly enhances VLM dynamic spatial reasoning capability while maintaining gen...

🔹 Publication Date: Published on Dec 23

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.20557
• PDF: https://arxiv.org/pdf/2512.20557

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VisionLanguageModels #SpatialReasoning #4D #ComputerVision #AIResearch

211 views03:01

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

📝 Summary:
Bi-directional Perceptual Shaping BiPS improves vision-language models by using question-conditioned masked views to shape perception during training. It employs two constraints to ensure complete coverage of relevant pixels and enforce fine-grained visual reliance, preventing text-only shortcuts...

🔹 Publication Date: Published on Dec 26

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22120
• PDF: https://arxiv.org/pdf/2512.22120
• Github: https://github.com/zss02/BiPS

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#MultimodalAI #VisionLanguageModels #MachineLearning #AIResearch #DeepLearning

❤1

208 views03:01

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

📝 Summary:
Dream-VL and Dream-VLA are diffusion-based vision-language and vision-language-action models. They achieve state-of-the-art performance in visual planning and robotic control, surpassing autoregressive baselines via their diffusion backbone's superior action generation.

🔹 Publication Date: Published on Dec 27

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22615
• PDF: https://arxiv.org/pdf/2512.22615
• Project Page: https://hkunlp.github.io/blog/2025/dream-vlx/
• Github: https://github.com/DreamLM/Dream-VLX

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VisionLanguageModels #DiffusionModels #Robotics #AI #ComputerVision

113 views09:57

✨ Explore Data Science 📝 Write your paper

About

Blog

Apps

Platform