✨TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
📝 Summary:
TRivia is a self-supervised fine-tuning method for vision-language models to learn table recognition from unlabeled data. It uses a question-answering reward mechanism to autonomously optimize the model. This open-source solution outperforms state-of-the-art systems on popular benchmarks.
🔹 Publication Date: Published on Dec 1
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01248
• PDF: https://arxiv.org/pdf/2512.01248
• Github: https://github.com/opendatalab/TRivia
🔹 Models citing this paper:
• https://huggingface.co/opendatalab/TRivia-3B
✨ Spaces citing this paper:
• https://huggingface.co/spaces/opendatalab/TRivia-3B
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#TableRecognition #VisionLanguageModels #SelfSupervisedLearning #AI #DeepLearning
📝 Summary:
TRivia is a self-supervised fine-tuning method for vision-language models to learn table recognition from unlabeled data. It uses a question-answering reward mechanism to autonomously optimize the model. This open-source solution outperforms state-of-the-art systems on popular benchmarks.
🔹 Publication Date: Published on Dec 1
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01248
• PDF: https://arxiv.org/pdf/2512.01248
• Github: https://github.com/opendatalab/TRivia
🔹 Models citing this paper:
• https://huggingface.co/opendatalab/TRivia-3B
✨ Spaces citing this paper:
• https://huggingface.co/spaces/opendatalab/TRivia-3B
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#TableRecognition #VisionLanguageModels #SelfSupervisedLearning #AI #DeepLearning
✨SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
📝 Summary:
SpaceTools introduces Double Interactive Reinforcement Learning DIRL. This two-phase RL framework enables Vision Language Models to coordinate multiple tools for precise spatial reasoning, achieving state-of-the-art performance on benchmarks and real-world robot tasks.
🔹 Publication Date: Published on Dec 3
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04069
• PDF: https://arxiv.org/pdf/2512.04069
• Project Page: https://spacetools.github.io/
• Github: https://spacetools.github.io/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ReinforcementLearning #VisionLanguageModels #Robotics #SpatialReasoning #AI
📝 Summary:
SpaceTools introduces Double Interactive Reinforcement Learning DIRL. This two-phase RL framework enables Vision Language Models to coordinate multiple tools for precise spatial reasoning, achieving state-of-the-art performance on benchmarks and real-world robot tasks.
🔹 Publication Date: Published on Dec 3
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04069
• PDF: https://arxiv.org/pdf/2512.04069
• Project Page: https://spacetools.github.io/
• Github: https://spacetools.github.io/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ReinforcementLearning #VisionLanguageModels #Robotics #SpatialReasoning #AI
✨AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
📝 Summary:
AdaptVision is an efficient VLM that adaptively acquires visual tokens through a coarse-to-fine approach, using a bounding box tool. Trained with reinforcement learning to balance accuracy and efficiency, it achieves superior VQA performance using fewer visual tokens.
🔹 Publication Date: Published on Dec 3
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.03794
• PDF: https://arxiv.org/pdf/2512.03794
• Project Page: https://adaptvision.github.io/
• Github: https://github.com/AdaptVision/AdaptVision
🔹 Models citing this paper:
• https://huggingface.co/AdaptVision/AdaptVision-7B
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModels #ReinforcementLearning #ComputerVision #AIResearch #EfficientAI
📝 Summary:
AdaptVision is an efficient VLM that adaptively acquires visual tokens through a coarse-to-fine approach, using a bounding box tool. Trained with reinforcement learning to balance accuracy and efficiency, it achieves superior VQA performance using fewer visual tokens.
🔹 Publication Date: Published on Dec 3
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.03794
• PDF: https://arxiv.org/pdf/2512.03794
• Project Page: https://adaptvision.github.io/
• Github: https://github.com/AdaptVision/AdaptVision
🔹 Models citing this paper:
• https://huggingface.co/AdaptVision/AdaptVision-7B
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModels #ReinforcementLearning #ComputerVision #AIResearch #EfficientAI
✨AutoNeural: Co-Designing Vision-Language Models for NPU Inference
📝 Summary:
AutoNeural is an NPU-native VLM co-designed for efficient edge inference. It uses a MobileNetV5-style vision backbone for stable integer quantization and a hybrid SSM-Transformer language backbone. This design reduces quantization errors and latency, improving real-time performance on edge devices.
🔹 Publication Date: Published on Dec 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.02924
• PDF: https://arxiv.org/pdf/2512.02924
🔹 Models citing this paper:
• https://huggingface.co/NexaAI/AutoNeural
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AutoNeural #VisionLanguageModels #EdgeAI #AIHardware #EfficientAI
📝 Summary:
AutoNeural is an NPU-native VLM co-designed for efficient edge inference. It uses a MobileNetV5-style vision backbone for stable integer quantization and a hybrid SSM-Transformer language backbone. This design reduces quantization errors and latency, improving real-time performance on edge devices.
🔹 Publication Date: Published on Dec 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.02924
• PDF: https://arxiv.org/pdf/2512.02924
🔹 Models citing this paper:
• https://huggingface.co/NexaAI/AutoNeural
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AutoNeural #VisionLanguageModels #EdgeAI #AIHardware #EfficientAI
✨From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
📝 Summary:
NEO is a novel family of native Vision-Language Models built from first principles. It unifies vision and language, aligning pixels and words in a shared semantic space. NEO achieves competitive performance with limited data while efficiently developing visual perception from scratch.
🔹 Publication Date: Published on Oct 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.14979
• PDF: https://arxiv.org/pdf/2510.14979
• Github: https://github.com/EvolvingLMMs-Lab/NEO
🔹 Models citing this paper:
• https://huggingface.co/Paranioar/NEO1_0-2B-SFT
• https://huggingface.co/Paranioar/NEO1_0-9B-SFT
• https://huggingface.co/Paranioar/NEO1_0-2B-PT
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModels #MultimodalAI #DeepLearning #ComputerVision #AIREsearch
📝 Summary:
NEO is a novel family of native Vision-Language Models built from first principles. It unifies vision and language, aligning pixels and words in a shared semantic space. NEO achieves competitive performance with limited data while efficiently developing visual perception from scratch.
🔹 Publication Date: Published on Oct 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.14979
• PDF: https://arxiv.org/pdf/2510.14979
• Github: https://github.com/EvolvingLMMs-Lab/NEO
🔹 Models citing this paper:
• https://huggingface.co/Paranioar/NEO1_0-2B-SFT
• https://huggingface.co/Paranioar/NEO1_0-9B-SFT
• https://huggingface.co/Paranioar/NEO1_0-2B-PT
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModels #MultimodalAI #DeepLearning #ComputerVision #AIREsearch
❤1
🤖🧠 Reducing Hallucinations in Vision-Language Models: A Step Forward with VisAlign
🗓️ 24 Nov 2025
📚 AI News & Trends
As artificial intelligence continues to evolve, Large Vision-Language Models (LVLMs) have revolutionized how machines understand and describe the world. These models combine visual perception with natural language understanding to perform tasks such as image captioning, visual question answering and multimodal reasoning. Despite their success, a major problem persists – hallucination. This issue occurs when a ...
#VisAlign #ReducingHallucinations #VisionLanguageModels #LVLMs #MultimodalAI #AISafety
🗓️ 24 Nov 2025
📚 AI News & Trends
As artificial intelligence continues to evolve, Large Vision-Language Models (LVLMs) have revolutionized how machines understand and describe the world. These models combine visual perception with natural language understanding to perform tasks such as image captioning, visual question answering and multimodal reasoning. Despite their success, a major problem persists – hallucination. This issue occurs when a ...
#VisAlign #ReducingHallucinations #VisionLanguageModels #LVLMs #MultimodalAI #AISafety
✨ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning
📝 Summary:
ReVSeg enhances video object segmentation. It uses sequential reasoning within pretrained vision language models, optimized by reinforcement learning. This achieves state-of-the-art results and provides interpretable reasoning.
🔹 Publication Date: Published on Dec 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.02835
• PDF: https://arxiv.org/pdf/2512.02835
• Project Page: https://clementine24.github.io/ReVSeg/
• Github: https://github.com/Clementine24/ReVSeg
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoSegmentation #ReinforcementLearning #VisionLanguageModels #ComputerVision #DeepLearning
📝 Summary:
ReVSeg enhances video object segmentation. It uses sequential reasoning within pretrained vision language models, optimized by reinforcement learning. This achieves state-of-the-art results and provides interpretable reasoning.
🔹 Publication Date: Published on Dec 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.02835
• PDF: https://arxiv.org/pdf/2512.02835
• Project Page: https://clementine24.github.io/ReVSeg/
• Github: https://github.com/Clementine24/ReVSeg
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoSegmentation #ReinforcementLearning #VisionLanguageModels #ComputerVision #DeepLearning
✨From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model
📝 Summary:
The TAD benchmark is introduced to evaluate temporal understanding in autonomous driving, addressing a gap where current VLMs perform poorly. It reveals that state-of-the-art models show substandard accuracy in this domain. Two training-free solutions, Scene-CoT and TCogMap, are proposed, improvi...
🔹 Publication Date: Published on Dec 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.05277
• PDF: https://arxiv.org/pdf/2512.05277
• Github: https://github.com/vbdi/tad_bench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AutonomousDriving #VisionLanguageModels #ComputerVision #AIResearch #DeepLearning
📝 Summary:
The TAD benchmark is introduced to evaluate temporal understanding in autonomous driving, addressing a gap where current VLMs perform poorly. It reveals that state-of-the-art models show substandard accuracy in this domain. Two training-free solutions, Scene-CoT and TCogMap, are proposed, improvi...
🔹 Publication Date: Published on Dec 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.05277
• PDF: https://arxiv.org/pdf/2512.05277
• Github: https://github.com/vbdi/tad_bench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AutonomousDriving #VisionLanguageModels #ComputerVision #AIResearch #DeepLearning
❤3
✨GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
📝 Summary:
GLM-4.1V-Thinking is a vision-language model using a reasoning-centric training framework. It achieves state-of-the-art multimodal reasoning across various tasks like STEM and long document understanding. The model outperforms larger models and competes with closed-source systems like GPT-4o.
🔹 Publication Date: Published on Jul 1
🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/glm-41v-thinking-towards-versatile-multimodal-reasoning-with-scalable-reinforcement-learning
• PDF: https://arxiv.org/pdf/2507.01006
• Github: https://github.com/THUDM/GLM-4.1V-Thinking
🔹 Models citing this paper:
• https://huggingface.co/zai-org/GLM-4.1V-9B-Thinking
• https://huggingface.co/zai-org/GLM-4.5V
• https://huggingface.co/zai-org/GLM-4.6V-Flash
✨ Spaces citing this paper:
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-Demo
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-API-Demo
• https://huggingface.co/spaces/akhaliq/anycoder
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#GLM41VThinking #MultimodalAI #VisionLanguageModels #ReinforcementLearning #AIResearch
📝 Summary:
GLM-4.1V-Thinking is a vision-language model using a reasoning-centric training framework. It achieves state-of-the-art multimodal reasoning across various tasks like STEM and long document understanding. The model outperforms larger models and competes with closed-source systems like GPT-4o.
🔹 Publication Date: Published on Jul 1
🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/glm-41v-thinking-towards-versatile-multimodal-reasoning-with-scalable-reinforcement-learning
• PDF: https://arxiv.org/pdf/2507.01006
• Github: https://github.com/THUDM/GLM-4.1V-Thinking
🔹 Models citing this paper:
• https://huggingface.co/zai-org/GLM-4.1V-9B-Thinking
• https://huggingface.co/zai-org/GLM-4.5V
• https://huggingface.co/zai-org/GLM-4.6V-Flash
✨ Spaces citing this paper:
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-Demo
• https://huggingface.co/spaces/zai-org/GLM-4.1V-9B-Thinking-API-Demo
• https://huggingface.co/spaces/akhaliq/anycoder
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#GLM41VThinking #MultimodalAI #VisionLanguageModels #ReinforcementLearning #AIResearch
Arxivexplained
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning - Explained Simply
By Wenyi Hong, Wenmeng Yu, Xiaotao Gu et al.. # GLM-4.1V-Thinking: The AI That Actually Thinks Through Visual Problems
**The Problem:** Current A...
**The Problem:** Current A...
This media is not supported in your browser
VIEW IN TELEGRAM
✨Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
📝 Summary:
DSR Suite improves vision language models weak dynamic spatial reasoning. It creates 4D training data from videos using an automated pipeline and integrates geometric priors via a Geometry Selection Module. This significantly enhances VLM dynamic spatial reasoning capability while maintaining gen...
🔹 Publication Date: Published on Dec 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.20557
• PDF: https://arxiv.org/pdf/2512.20557
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModels #SpatialReasoning #4D #ComputerVision #AIResearch
📝 Summary:
DSR Suite improves vision language models weak dynamic spatial reasoning. It creates 4D training data from videos using an automated pipeline and integrates geometric priors via a Geometry Selection Module. This significantly enhances VLM dynamic spatial reasoning capability while maintaining gen...
🔹 Publication Date: Published on Dec 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.20557
• PDF: https://arxiv.org/pdf/2512.20557
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModels #SpatialReasoning #4D #ComputerVision #AIResearch
✨See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
📝 Summary:
Bi-directional Perceptual Shaping BiPS improves vision-language models by using question-conditioned masked views to shape perception during training. It employs two constraints to ensure complete coverage of relevant pixels and enforce fine-grained visual reliance, preventing text-only shortcuts...
🔹 Publication Date: Published on Dec 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22120
• PDF: https://arxiv.org/pdf/2512.22120
• Github: https://github.com/zss02/BiPS
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #VisionLanguageModels #MachineLearning #AIResearch #DeepLearning
📝 Summary:
Bi-directional Perceptual Shaping BiPS improves vision-language models by using question-conditioned masked views to shape perception during training. It employs two constraints to ensure complete coverage of relevant pixels and enforce fine-grained visual reliance, preventing text-only shortcuts...
🔹 Publication Date: Published on Dec 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22120
• PDF: https://arxiv.org/pdf/2512.22120
• Github: https://github.com/zss02/BiPS
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #VisionLanguageModels #MachineLearning #AIResearch #DeepLearning
❤1
✨Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
📝 Summary:
Dream-VL and Dream-VLA are diffusion-based vision-language and vision-language-action models. They achieve state-of-the-art performance in visual planning and robotic control, surpassing autoregressive baselines via their diffusion backbone's superior action generation.
🔹 Publication Date: Published on Dec 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22615
• PDF: https://arxiv.org/pdf/2512.22615
• Project Page: https://hkunlp.github.io/blog/2025/dream-vlx/
• Github: https://github.com/DreamLM/Dream-VLX
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModels #DiffusionModels #Robotics #AI #ComputerVision
📝 Summary:
Dream-VL and Dream-VLA are diffusion-based vision-language and vision-language-action models. They achieve state-of-the-art performance in visual planning and robotic control, surpassing autoregressive baselines via their diffusion backbone's superior action generation.
🔹 Publication Date: Published on Dec 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22615
• PDF: https://arxiv.org/pdf/2512.22615
• Project Page: https://hkunlp.github.io/blog/2025/dream-vlx/
• Github: https://github.com/DreamLM/Dream-VLX
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModels #DiffusionModels #Robotics #AI #ComputerVision