ML Research Hub
32.8K subscribers
4.09K photos
237 videos
23 files
4.41K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

📝 Summary:
WebWatcher, a multimodal agent, enhances visual-language reasoning for complex information retrieval. It uses synthetic trajectories, tools, and RL for training, outperforming existing agents. This advances solving multimodal info-seeking tasks.

🔹 Publication Date: Published on Aug 7

🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/webwatcher-breaking-new-frontier-of-vision-language-deep-research-agent
• PDF: https://arxiv.org/pdf/2508.05748
• Project Page: https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/
• Github: https://github.com/Alibaba-NLP/WebAgent

🔹 Models citing this paper:
https://huggingface.co/Alibaba-NLP/WebWatcher-32B
https://huggingface.co/Alibaba-NLP/WebWatcher-7B

==================================

For more data science resources:
https://t.me/DataScienceT

#VisionLanguage #MultimodalAI #DeepLearning #AIagents #InformationRetrieval
1
Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

📝 Summary:
ReVeL converts multiple-choice questions to verifiable open-form questions to address unreliable MCQA metrics and answer guessing. This framework improves data efficiency and robustness for multimodal language models, revealing significant score inflation in MCQA benchmarks.

🔹 Publication Date: Published on Nov 21

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.17405
• PDF: https://arxiv.org/pdf/2511.17405
• Github: https://flageval-baai.github.io/ReVeL/

==================================

For more data science resources:
https://t.me/DataScienceT

#OpenQA #VisionLanguage #LanguageModels #AIEvaluation #MachineLearning
Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

📝 Summary:
Agent0-VL is a self-evolving vision-language agent that integrates tool usage into both reasoning and self-evaluation. It uses a Solver and Verifier in a self-evolving cycle for continuous improvement without human annotation or external rewards, achieving a 12.5% performance gain.

🔹 Publication Date: Published on Nov 25

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.19900
• PDF: https://arxiv.org/pdf/2511.19900

==================================

For more data science resources:
https://t.me/DataScienceT

#AIAgents #VisionLanguage #SelfEvolvingAI #ToolAugmentedAI #AIResearch
Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning

📝 Summary:
DoGe is a framework that addresses data scarcity in vision-language models. It decouples context learning from problem solving, using a curriculum to improve reward signals and data diversity. This enhances generalization and performance.

🔹 Publication Date: Published on Dec 7

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.06835
• PDF: https://arxiv.org/pdf/2512.06835

==================================

For more data science resources:
https://t.me/DataScienceT

#VisionLanguage #DataScarcity #MachineLearning #AIResearch #DeepLearning
1
An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

📝 Summary:
This survey offers a structured guide to Vision-Language-Action VLA models in robotics. It breaks down five key challenges: representation, execution, generalization, safety, and datasets, serving as a roadmap for researchers.

🔹 Publication Date: Published on Dec 12

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.11362
• PDF: https://arxiv.org/pdf/2512.11362
• Project Page: https://suyuz1.github.io/Survery/
• Github: https://suyuz1.github.io/VLA-Survey-Anatomy/

==================================

For more data science resources:
https://t.me/DataScienceT

#VLAModels #Robotics #ArtificialIntelligence #VisionLanguage #AIResearch
1
CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

📝 Summary:
CASA enhances cross-attention for vision-language models by adding local text-to-text interaction. This approach substantially reduces the performance gap with costly token insertion methods on detailed visual tasks. CASA maintains efficiency and scalability for long-context multimodal applicatio...

🔹 Publication Date: Published on Dec 22

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.19535
• PDF: https://arxiv.org/pdf/2512.19535
• Project Page: https://kyutai.org/casa
• Github: https://github.com/kyutai-labs/casa

🔹 Models citing this paper:
https://huggingface.co/kyutai/CASA-Helium1-VL-2B

Spaces citing this paper:
https://huggingface.co/spaces/kyutai/casa-samples

==================================

For more data science resources:
https://t.me/DataScienceT

#VisionLanguage #MultimodalAI #AttentionMechanisms #EfficientAI #DeepLearning
4