✨GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents
📝 Summary:
GUI-360 is a large dataset and benchmark for computer-using agents, addressing gaps in real-world tasks and unified evaluation. It contains over 1.2M action steps in Windows apps for GUI grounding, screen parsing, and action prediction. Benchmarking reveals significant shortcomings in current mod...
🔹 Publication Date: Published on Nov 6
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.04307
• PDF: https://arxiv.org/pdf/2511.04307
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AI #ComputerAgents #GUIAgents #Dataset #Benchmark
📝 Summary:
GUI-360 is a large dataset and benchmark for computer-using agents, addressing gaps in real-world tasks and unified evaluation. It contains over 1.2M action steps in Windows apps for GUI grounding, screen parsing, and action prediction. Benchmarking reveals significant shortcomings in current mod...
🔹 Publication Date: Published on Nov 6
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.04307
• PDF: https://arxiv.org/pdf/2511.04307
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AI #ComputerAgents #GUIAgents #Dataset #Benchmark
✨ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
📝 Summary:
ATLAS is a new, high-difficulty, multidisciplinary benchmark for LLMs, featuring 800 original problems across seven scientific fields. It addresses current benchmark limitations with complex, open-ended answers and aims to differentiate advanced scientific reasoning, serving as a ruler for AGI pr...
🔹 Publication Date: Published on Nov 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.14366
• PDF: https://arxiv.org/pdf/2511.14366
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#LLM #AGI #AIResearch #ScientificReasoning #Benchmark
📝 Summary:
ATLAS is a new, high-difficulty, multidisciplinary benchmark for LLMs, featuring 800 original problems across seven scientific fields. It addresses current benchmark limitations with complex, open-ended answers and aims to differentiate advanced scientific reasoning, serving as a ruler for AGI pr...
🔹 Publication Date: Published on Nov 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.14366
• PDF: https://arxiv.org/pdf/2511.14366
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#LLM #AGI #AIResearch #ScientificReasoning #Benchmark
✨UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits
📝 Summary:
This paper tackles image editing model performance gaps due to data scarcity by introducing UnicEdit-10M, a 10M-scale high-quality dataset from a lightweight verified pipeline. It also proposes UnicBench, a new benchmark with novel metrics to diagnose reasoning limitations in models.
🔹 Publication Date: Published on Dec 1
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.02790
• PDF: https://arxiv.org/pdf/2512.02790
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ImageEditing #AI #Dataset #Benchmark #ComputerVision
📝 Summary:
This paper tackles image editing model performance gaps due to data scarcity by introducing UnicEdit-10M, a 10M-scale high-quality dataset from a lightweight verified pipeline. It also proposes UnicBench, a new benchmark with novel metrics to diagnose reasoning limitations in models.
🔹 Publication Date: Published on Dec 1
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.02790
• PDF: https://arxiv.org/pdf/2512.02790
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ImageEditing #AI #Dataset #Benchmark #ComputerVision
✨OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation
📝 Summary:
OmniSafeBench-MM is a unified toolbox for evaluating multi-modal jailbreak attacks and defenses in MLLMs. It integrates various attacks, defense strategies, and a diverse dataset to provide a comprehensive, standardized, and reproducible platform for research.
🔹 Publication Date: Published on Dec 6
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.06589
• PDF: https://arxiv.org/pdf/2512.06589
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MLLMs #AISafety #AIsecurity #Benchmark #DeepLearning
📝 Summary:
OmniSafeBench-MM is a unified toolbox for evaluating multi-modal jailbreak attacks and defenses in MLLMs. It integrates various attacks, defense strategies, and a diverse dataset to provide a comprehensive, standardized, and reproducible platform for research.
🔹 Publication Date: Published on Dec 6
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.06589
• PDF: https://arxiv.org/pdf/2512.06589
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MLLMs #AISafety #AIsecurity #Benchmark #DeepLearning
❤1