ML Research Hub

✨WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

📝 Summary:
WebWatcher, a multimodal agent, enhances visual-language reasoning for complex information retrieval. It uses synthetic trajectories, tools, and RL for training, outperforming existing agents. This advances solving multimodal info-seeking tasks.

🔹 Publication Date: Published on Aug 7

🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/webwatcher-breaking-new-frontier-of-vision-language-deep-research-agent
• PDF: https://arxiv.org/pdf/2508.05748
• Project Page: https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/
• Github: https://github.com/Alibaba-NLP/WebAgent

🔹 Models citing this paper:
• https://huggingface.co/Alibaba-NLP/WebWatcher-32B
• https://huggingface.co/Alibaba-NLP/WebWatcher-7B

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VisionLanguage #MultimodalAI #DeepLearning #AIagents #InformationRetrieval

❤1

56 views06:00

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

📝 Summary:
ReVeL converts multiple-choice questions to verifiable open-form questions to address unreliable MCQA metrics and answer guessing. This framework improves data efficiency and robustness for multimodal language models, revealing significant score inflation in MCQA benchmarks.

🔹 Publication Date: Published on Nov 21

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.17405
• PDF: https://arxiv.org/pdf/2511.17405
• Github: https://flageval-baai.github.io/ReVeL/

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#OpenQA #VisionLanguage #LanguageModels #AIEvaluation #MachineLearning

397 views02:00

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

📝 Summary:
Agent0-VL is a self-evolving vision-language agent that integrates tool usage into both reasoning and self-evaluation. It uses a Solver and Verifier in a self-evolving cycle for continuous improvement without human annotation or external rewards, achieving a 12.5% performance gain.

🔹 Publication Date: Published on Nov 25

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.19900
• PDF: https://arxiv.org/pdf/2511.19900

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#AIAgents #VisionLanguage #SelfEvolvingAI #ToolAugmentedAI #AIResearch

266 views05:03

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning

📝 Summary:
DoGe is a framework that addresses data scarcity in vision-language models. It decouples context learning from problem solving, using a curriculum to improve reward signals and data diversity. This enhances generalization and performance.

🔹 Publication Date: Published on Dec 7

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.06835
• PDF: https://arxiv.org/pdf/2512.06835

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VisionLanguage #DataScarcity #MachineLearning #AIResearch #DeepLearning

❤1

255 views11:04

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

📝 Summary:
This survey offers a structured guide to Vision-Language-Action VLA models in robotics. It breaks down five key challenges: representation, execution, generalization, safety, and datasets, serving as a roadmap for researchers.

🔹 Publication Date: Published on Dec 12

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.11362
• PDF: https://arxiv.org/pdf/2512.11362
• Project Page: https://suyuz1.github.io/Survery/
• Github: https://suyuz1.github.io/VLA-Survey-Anatomy/

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VLAModels #Robotics #ArtificialIntelligence #VisionLanguage #AIResearch

❤1

294 views08:04

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

📝 Summary:
CASA enhances cross-attention for vision-language models by adding local text-to-text interaction. This approach substantially reduces the performance gap with costly token insertion methods on detailed visual tasks. CASA maintains efficiency and scalability for long-context multimodal applicatio...

🔹 Publication Date: Published on Dec 22

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.19535
• PDF: https://arxiv.org/pdf/2512.19535
• Project Page: https://kyutai.org/casa
• Github: https://github.com/kyutai-labs/casa

🔹 Models citing this paper:
• https://huggingface.co/kyutai/CASA-Helium1-VL-2B

✨ Spaces citing this paper:
• https://huggingface.co/spaces/kyutai/casa-samples

==================================

For more data science resources:
✓ https://t.me/DataScienceT

#VisionLanguage #MultimodalAI #AttentionMechanisms #EfficientAI #DeepLearning

❤4

379 views15:22

✨ Explore Data Science 📝 Write your paper

About

Blog

Apps

Platform