🤖🧠 PaddleOCR-VL: Redefining Multilingual Document Parsing with a 0.9B Vision-Language Model
🗓️ 20 Oct 2025
📚 AI News & Trends
In an era where information is predominantly digital, the ability to extract, interpret and organize data from documents is crucial. From invoices and research papers to multilingual contracts and handwritten notes, document parsing stands at the intersection of vision and language. Traditional Optical Character Recognition (OCR) systems have made impressive strides but they often fall ...
#PaddleOCR-VL #Multilingual #DocumentParsing #VisionLanguageModel #OCR #AI
🗓️ 20 Oct 2025
📚 AI News & Trends
In an era where information is predominantly digital, the ability to extract, interpret and organize data from documents is crucial. From invoices and research papers to multilingual contracts and handwritten notes, document parsing stands at the intersection of vision and language. Traditional Optical Character Recognition (OCR) systems have made impressive strides but they often fall ...
#PaddleOCR-VL #Multilingual #DocumentParsing #VisionLanguageModel #OCR #AI
✨PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
📝 Summary:
PaddleOCR-VL is a new 0.9B vision-language model for document parsing. It uses a NaViT-style visual encoder and ERNIE-4.5, achieving state-of-the-art performance across 109 languages with minimal resources and fast inference. This model is highly suitable for practical deployment.
🔹 Publication Date: Published on Oct 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.14528
• PDF: https://arxiv.org/pdf/2510.14528
• Github: https://github.com/PaddlePaddle/PaddleOCR
🔹 Models citing this paper:
• https://huggingface.co/PaddlePaddle/PaddleOCR-VL
• https://huggingface.co/PaddlePaddle/PP-DocLayoutV2
• https://huggingface.co/lvyufeng/PaddleOCR-VL-0.9B
✨ Spaces citing this paper:
• https://huggingface.co/spaces/PaddlePaddle/PaddleOCR-VL_Online_Demo
• https://huggingface.co/spaces/markobinario/PaddleOCR-VL_Online_Demo
• https://huggingface.co/spaces/waytoAGI/PaddleOCR-VL_Online_Demo
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#OCR #VisionLanguageModel #DocumentAI #DeepLearning #AI
📝 Summary:
PaddleOCR-VL is a new 0.9B vision-language model for document parsing. It uses a NaViT-style visual encoder and ERNIE-4.5, achieving state-of-the-art performance across 109 languages with minimal resources and fast inference. This model is highly suitable for practical deployment.
🔹 Publication Date: Published on Oct 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.14528
• PDF: https://arxiv.org/pdf/2510.14528
• Github: https://github.com/PaddlePaddle/PaddleOCR
🔹 Models citing this paper:
• https://huggingface.co/PaddlePaddle/PaddleOCR-VL
• https://huggingface.co/PaddlePaddle/PP-DocLayoutV2
• https://huggingface.co/lvyufeng/PaddleOCR-VL-0.9B
✨ Spaces citing this paper:
• https://huggingface.co/spaces/PaddlePaddle/PaddleOCR-VL_Online_Demo
• https://huggingface.co/spaces/markobinario/PaddleOCR-VL_Online_Demo
• https://huggingface.co/spaces/waytoAGI/PaddleOCR-VL_Online_Demo
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#OCR #VisionLanguageModel #DocumentAI #DeepLearning #AI
arXiv.org
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B...
In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model...
✨MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
📝 Summary:
MinerU2.5 is a new 1.2B-parameter VLM for document parsing. It uses a coarse-to-fine, two-stage strategy: global layout analysis on downsampled images, then targeted content recognition on native-resolution crops. This achieves state-of-the-art accuracy efficiently for high-resolution documents.
🔹 Publication Date: Published on Sep 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2509.22186
• PDF: https://arxiv.org/pdf/2509.22186
• Project Page: https://opendatalab.github.io/MinerU/
• Github: https://github.com/opendatalab/MinerU
🔹 Models citing this paper:
• https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B
• https://huggingface.co/freakynit/MinerU2.5-2509-1.2B
• https://huggingface.co/Mungert/MinerU2.5-2509-1.2B-GGUF
✨ Spaces citing this paper:
• https://huggingface.co/spaces/opendatalab/MinerU
• https://huggingface.co/spaces/xiaoye-winters/MinerU-API
• https://huggingface.co/spaces/ApeAITW/MinerU_2.5_Test
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModel #DocumentAI #DeepLearning #ComputerVision #AIResearch
📝 Summary:
MinerU2.5 is a new 1.2B-parameter VLM for document parsing. It uses a coarse-to-fine, two-stage strategy: global layout analysis on downsampled images, then targeted content recognition on native-resolution crops. This achieves state-of-the-art accuracy efficiently for high-resolution documents.
🔹 Publication Date: Published on Sep 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2509.22186
• PDF: https://arxiv.org/pdf/2509.22186
• Project Page: https://opendatalab.github.io/MinerU/
• Github: https://github.com/opendatalab/MinerU
🔹 Models citing this paper:
• https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B
• https://huggingface.co/freakynit/MinerU2.5-2509-1.2B
• https://huggingface.co/Mungert/MinerU2.5-2509-1.2B-GGUF
✨ Spaces citing this paper:
• https://huggingface.co/spaces/opendatalab/MinerU
• https://huggingface.co/spaces/xiaoye-winters/MinerU-API
• https://huggingface.co/spaces/ApeAITW/MinerU_2.5_Test
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModel #DocumentAI #DeepLearning #ComputerVision #AIResearch
arXiv.org
MinerU2.5: A Decoupled Vision-Language Model for Efficient...
We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our...
This media is not supported in your browser
VIEW IN TELEGRAM
✨Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds
📝 Summary:
Lumine introduces an open recipe for generalist agents in 3D open worlds. This vision-language model-based agent processes pixels to perform complex, hours-long missions with human efficiency and demonstrates strong zero-shot generalization across diverse games like Genshin Impact and Honkai Star...
🔹 Publication Date: Published on Nov 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.08892
• PDF: https://arxiv.org/pdf/2511.08892
• Project Page: https://www.lumine-ai.org/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#GeneralistAI #VisionLanguageModel #3DWorlds #AIagents #GamingAI
📝 Summary:
Lumine introduces an open recipe for generalist agents in 3D open worlds. This vision-language model-based agent processes pixels to perform complex, hours-long missions with human efficiency and demonstrates strong zero-shot generalization across diverse games like Genshin Impact and Honkai Star...
🔹 Publication Date: Published on Nov 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.08892
• PDF: https://arxiv.org/pdf/2511.08892
• Project Page: https://www.lumine-ai.org/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#GeneralistAI #VisionLanguageModel #3DWorlds #AIagents #GamingAI
✨Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
📝 Summary:
Researchers introduce Instruction-Guided Lesion Segmentation ILS for CXRs, allowing diverse lesion segmentation using simple instructions. They developed MIMIC-ILS, a large-scale dataset, and ROSALIA, a vision-language model. ROSALIA accurately segments various lesions and provides textual explan...
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15186
• PDF: https://arxiv.org/pdf/2511.15186
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MedicalAI #LesionSegmentation #ChestXray #VisionLanguageModel #DeepLearning
📝 Summary:
Researchers introduce Instruction-Guided Lesion Segmentation ILS for CXRs, allowing diverse lesion segmentation using simple instructions. They developed MIMIC-ILS, a large-scale dataset, and ROSALIA, a vision-language model. ROSALIA accurately segments various lesions and provides textual explan...
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15186
• PDF: https://arxiv.org/pdf/2511.15186
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MedicalAI #LesionSegmentation #ChestXray #VisionLanguageModel #DeepLearning
✨Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding
📝 Summary:
Hulu-Med is a transparent medical vision-language model unifying diverse data modalities like text, 2D/3D images, and video. It achieves state-of-the-art performance across 30 clinical benchmarks with efficient training, promoting accessible AI.
🔹 Publication Date: Published on Oct 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.08668
• PDF: https://arxiv.org/pdf/2510.08668
• Github: https://github.com/ZJUI-AI4H/Hulu-Med
🔹 Models citing this paper:
• https://huggingface.co/ZJU-AI4H/Hulu-Med-32B
• https://huggingface.co/ZJU-AI4H/Hulu-Med-7B
• https://huggingface.co/ZJU-AI4H/Hulu-Med-14B
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MedicalAI #VisionLanguageModel #MultimodalAI #HealthcareAI #AIResearch
📝 Summary:
Hulu-Med is a transparent medical vision-language model unifying diverse data modalities like text, 2D/3D images, and video. It achieves state-of-the-art performance across 30 clinical benchmarks with efficient training, promoting accessible AI.
🔹 Publication Date: Published on Oct 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.08668
• PDF: https://arxiv.org/pdf/2510.08668
• Github: https://github.com/ZJUI-AI4H/Hulu-Med
🔹 Models citing this paper:
• https://huggingface.co/ZJU-AI4H/Hulu-Med-32B
• https://huggingface.co/ZJU-AI4H/Hulu-Med-7B
• https://huggingface.co/ZJU-AI4H/Hulu-Med-14B
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MedicalAI #VisionLanguageModel #MultimodalAI #HealthcareAI #AIResearch
✨HunyuanOCR Technical Report
📝 Summary:
HunyuanOCR is a lightweight Vision-Language Model for OCR, using a unified end-to-end architecture ViT + LLM. It achieves state-of-the-art performance in diverse tasks, outperforming larger models and commercial APIs, powered by data-driven and RL strategies.
🔹 Publication Date: Published on Nov 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.19575
• PDF: https://arxiv.org/pdf/2511.19575
• Github: https://github.com/Tencent-Hunyuan/HunyuanOCR
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#OCR #VisionLanguageModel #LLM #AI #MachineLearning
📝 Summary:
HunyuanOCR is a lightweight Vision-Language Model for OCR, using a unified end-to-end architecture ViT + LLM. It achieves state-of-the-art performance in diverse tasks, outperforming larger models and commercial APIs, powered by data-driven and RL strategies.
🔹 Publication Date: Published on Nov 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.19575
• PDF: https://arxiv.org/pdf/2511.19575
• Github: https://github.com/Tencent-Hunyuan/HunyuanOCR
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#OCR #VisionLanguageModel #LLM #AI #MachineLearning
✨Qwen3-VL Technical Report
📝 Summary:
Qwen3-VL is a highly capable vision-language model, achieving superior performance across multimodal benchmarks. It supports 256K interleaved contexts and offers strong text understanding, robust long-context comprehension, and advanced multimodal reasoning through key architectural upgrades.
🔹 Publication Date: Published on Nov 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.21631
• PDF: https://arxiv.org/pdf/2511.21631
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModel #MultimodalAI #AI #DeepLearning #LLM
📝 Summary:
Qwen3-VL is a highly capable vision-language model, achieving superior performance across multimodal benchmarks. It supports 256K interleaved contexts and offers strong text understanding, robust long-context comprehension, and advanced multimodal reasoning through key architectural upgrades.
🔹 Publication Date: Published on Nov 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.21631
• PDF: https://arxiv.org/pdf/2511.21631
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModel #MultimodalAI #AI #DeepLearning #LLM
✨MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning
📝 Summary:
MomaGraph-R1, a vision-language model trained with reinforcement learning, achieves state-of-the-art performance in predicting task-oriented scene graphs and zero-shot task planning in household envir...
🔹 Publication Date: Published on Dec 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16909
• PDF: https://arxiv.org/pdf/2512.16909
• Github: https://hybridrobotics.github.io/MomaGraph/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModel #EmbodiedAI #ReinforcementLearning #SceneGraphs #Robotics
📝 Summary:
MomaGraph-R1, a vision-language model trained with reinforcement learning, achieves state-of-the-art performance in predicting task-oriented scene graphs and zero-shot task planning in household envir...
🔹 Publication Date: Published on Dec 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16909
• PDF: https://arxiv.org/pdf/2512.16909
• Github: https://hybridrobotics.github.io/MomaGraph/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VisionLanguageModel #EmbodiedAI #ReinforcementLearning #SceneGraphs #Robotics
❤2