✨MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity
📝 Summary:
MME-CC is a new vision-grounded benchmark to evaluate multimodal large language models cognitive capacity in spatial, geometric, and knowledge-based reasoning tasks. It reveals that while some models lead, spatial and geometric reasoning remain broadly weak. This highlights the need for better ev...
🔹 Publication Date: Published on Nov 5
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.03146
• PDF: https://arxiv.org/pdf/2511.03146
• Project Page: https://randomtutu.github.io/MME-CC/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #LLMs #Benchmarking #CognitiveAI #ComputerVision
📝 Summary:
MME-CC is a new vision-grounded benchmark to evaluate multimodal large language models cognitive capacity in spatial, geometric, and knowledge-based reasoning tasks. It reveals that while some models lead, spatial and geometric reasoning remain broadly weak. This highlights the need for better ev...
🔹 Publication Date: Published on Nov 5
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.03146
• PDF: https://arxiv.org/pdf/2511.03146
• Project Page: https://randomtutu.github.io/MME-CC/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #LLMs #Benchmarking #CognitiveAI #ComputerVision
✨LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation
📝 Summary:
The paper introduces LEGO-Eval, a tool-augmented framework, and LEGO-Bench, a detailed instruction benchmark, to improve 3D scene evaluation. It shows LEGO-Eval accurately assesses scene-instruction alignment, outperforming VLMs, and current generation methods largely fail to create realistic sce...
🔹 Publication Date: Published on Nov 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.03001
• PDF: https://arxiv.org/pdf/2511.03001
• Project Page: https://gyeomh.github.io/LEGO-Eval/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#EmbodiedAI #3DGeneration #EvaluationMetrics #VLMs #Benchmarking
📝 Summary:
The paper introduces LEGO-Eval, a tool-augmented framework, and LEGO-Bench, a detailed instruction benchmark, to improve 3D scene evaluation. It shows LEGO-Eval accurately assesses scene-instruction alignment, outperforming VLMs, and current generation methods largely fail to create realistic sce...
🔹 Publication Date: Published on Nov 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.03001
• PDF: https://arxiv.org/pdf/2511.03001
• Project Page: https://gyeomh.github.io/LEGO-Eval/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#EmbodiedAI #3DGeneration #EvaluationMetrics #VLMs #Benchmarking
✨MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs
📝 Summary:
MVU-Eval is a new comprehensive benchmark for evaluating Multi-Video Understanding in Multimodal Large Language Models. It addresses a critical gap in existing single-video benchmarks and reveals significant performance limitations in current MLLMs for multi-video scenarios.
🔹 Publication Date: Published on Nov 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.07250
• PDF: https://arxiv.org/pdf/2511.07250
• Project Page: https://huggingface.co/datasets/MVU-Eval-Team/MVU-Eval-Data
• Github: https://github.com/NJU-LINK/MVU-Eval
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MLLMs #VideoUnderstanding #AI #Benchmarking #ComputerVision
📝 Summary:
MVU-Eval is a new comprehensive benchmark for evaluating Multi-Video Understanding in Multimodal Large Language Models. It addresses a critical gap in existing single-video benchmarks and reveals significant performance limitations in current MLLMs for multi-video scenarios.
🔹 Publication Date: Published on Nov 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.07250
• PDF: https://arxiv.org/pdf/2511.07250
• Project Page: https://huggingface.co/datasets/MVU-Eval-Team/MVU-Eval-Data
• Github: https://github.com/NJU-LINK/MVU-Eval
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MLLMs #VideoUnderstanding #AI #Benchmarking #ComputerVision
✨SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?
📝 Summary:
SWE-fficiency is a new benchmark evaluating how language models optimize real-world software repositories for performance on actual workloads. Agents must identify bottlenecks and generate correct code patches matching expert speedup. Current agents significantly underperform, struggling with loc...
🔹 Publication Date: Published on Nov 8
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.06090
• PDF: https://arxiv.org/pdf/2511.06090
• Project Page: https://swefficiency.com/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#LLM #SoftwareOptimization #PerformanceTuning #AIagents #Benchmarking
📝 Summary:
SWE-fficiency is a new benchmark evaluating how language models optimize real-world software repositories for performance on actual workloads. Agents must identify bottlenecks and generate correct code patches matching expert speedup. Current agents significantly underperform, struggling with loc...
🔹 Publication Date: Published on Nov 8
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.06090
• PDF: https://arxiv.org/pdf/2511.06090
• Project Page: https://swefficiency.com/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#LLM #SoftwareOptimization #PerformanceTuning #AIagents #Benchmarking
✨Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation
📝 Summary:
This paper introduces a framework to robustly evaluate diversity in text-to-image models. It uses a novel human evaluation template, curated prompts with variation factors, and systematic analysis of image embeddings to rank models and identify diversity weaknesses.
🔹 Publication Date: Published on Nov 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.10547
• PDF: https://arxiv.org/pdf/2511.10547
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ImageGeneration #TextToImage #AIDiversity #Benchmarking #HumanEvaluation
📝 Summary:
This paper introduces a framework to robustly evaluate diversity in text-to-image models. It uses a novel human evaluation template, curated prompts with variation factors, and systematic analysis of image embeddings to rank models and identify diversity weaknesses.
🔹 Publication Date: Published on Nov 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.10547
• PDF: https://arxiv.org/pdf/2511.10547
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ImageGeneration #TextToImage #AIDiversity #Benchmarking #HumanEvaluation
✨MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique
📝 Summary:
MM-CRITIC is a new benchmark evaluating Large Multimodal Models critique abilities across various dimensions and tasks. It uses expert-informed ground answers and GPT-4o for reliable scoring. This benchmark provides a comprehensive assessment of leading LMMs' critique capabilities.
🔹 Publication Date: Published on Nov 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.09067
• PDF: https://arxiv.org/pdf/2511.09067
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#LMMs #MultimodalAI #AIEvaluation #Benchmarking #AIResearch
📝 Summary:
MM-CRITIC is a new benchmark evaluating Large Multimodal Models critique abilities across various dimensions and tasks. It uses expert-informed ground answers and GPT-4o for reliable scoring. This benchmark provides a comprehensive assessment of leading LMMs' critique capabilities.
🔹 Publication Date: Published on Nov 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.09067
• PDF: https://arxiv.org/pdf/2511.09067
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#LMMs #MultimodalAI #AIEvaluation #Benchmarking #AIResearch
🤖🧠 OpenAI Evals: The Framework Transforming LLM Evaluation and Benchmarking
🗓️ 16 Nov 2025
📚 AI News & Trends
As large language models (LLMs) continue to reshape industries from education and healthcare to marketing and software development – the need for reliable evaluation methods has never been greater. With new models constantly emerging, developers and researchers require a standardized system to test, compare and understand model performance across real-world scenarios. This is where OpenAI ...
#OpenAIEvals #LLMEvaluation #Benchmarking #LargeLanguageModels #AIResearch #ModelEvaluation
🗓️ 16 Nov 2025
📚 AI News & Trends
As large language models (LLMs) continue to reshape industries from education and healthcare to marketing and software development – the need for reliable evaluation methods has never been greater. With new models constantly emerging, developers and researchers require a standardized system to test, compare and understand model performance across real-world scenarios. This is where OpenAI ...
#OpenAIEvals #LLMEvaluation #Benchmarking #LargeLanguageModels #AIResearch #ModelEvaluation
❤1
✨DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains
📝 Summary:
A new benchmark, DiscoX, and evaluation system, Metric-S, are introduced for discourse-level, expert Chinese-English translation. Findings show advanced LLMs still fall short of human performance, underscoring challenges in professional machine translation.
🔹 Publication Date: Published on Nov 14
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.10984
• PDF: https://arxiv.org/pdf/2511.10984
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MachineTranslation #NLP #LLM #Benchmarking #AI
📝 Summary:
A new benchmark, DiscoX, and evaluation system, Metric-S, are introduced for discourse-level, expert Chinese-English translation. Findings show advanced LLMs still fall short of human performance, underscoring challenges in professional machine translation.
🔹 Publication Date: Published on Nov 14
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.10984
• PDF: https://arxiv.org/pdf/2511.10984
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MachineTranslation #NLP #LLM #Benchmarking #AI
✨V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models
📝 Summary:
V-ReasonBench is a new benchmark to evaluate generative video models' reasoning across structured problem-solving, spatial cognition, pattern inference, and physical dynamics. It uses diverse tasks to reveal dimension-wise differences in models, aiming to support development of human-aligned reas...
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16668
• PDF: https://arxiv.org/pdf/2511.16668
• Project Page: https://oahzxl.github.io/VReasonBench/
• Github: https://github.com/yangluo7/V-ReasonBench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoGeneration #AIReasoning #GenerativeAI #Benchmarking #MachineLearning
📝 Summary:
V-ReasonBench is a new benchmark to evaluate generative video models' reasoning across structured problem-solving, spatial cognition, pattern inference, and physical dynamics. It uses diverse tasks to reveal dimension-wise differences in models, aiming to support development of human-aligned reas...
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16668
• PDF: https://arxiv.org/pdf/2511.16668
• Project Page: https://oahzxl.github.io/VReasonBench/
• Github: https://github.com/yangluo7/V-ReasonBench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoGeneration #AIReasoning #GenerativeAI #Benchmarking #MachineLearning
❤1
✨TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval
📝 Summary:
TurkColBERT, the first benchmark for Turkish IR, shows late-interaction models significantly outperform dense encoders. They offer superior parameter efficiency, faster indexing, and better performance for Turkish retrieval tasks.
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16528
• PDF: https://arxiv.org/pdf/2511.16528
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#InformationRetrieval #TurkishNLP #MachineLearning #DeepLearning #Benchmarking
📝 Summary:
TurkColBERT, the first benchmark for Turkish IR, shows late-interaction models significantly outperform dense encoders. They offer superior parameter efficiency, faster indexing, and better performance for Turkish retrieval tasks.
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16528
• PDF: https://arxiv.org/pdf/2511.16528
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#InformationRetrieval #TurkishNLP #MachineLearning #DeepLearning #Benchmarking
✨M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark
📝 Summary:
M3-Bench is a new benchmark evaluating multimodal LLM agent tool use in complex, multi-hop workflows requiring visual grounding and tool dependencies. It introduces a similarity-driven alignment method and interpretable metrics. Evaluations show significant gaps in current MLLMs, especially in ar...
🔹 Publication Date: Published on Nov 21
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.17729
• PDF: https://arxiv.org/pdf/2511.17729
• Github: https://github.com/EtaYang10th/Open-M3-Bench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MLLM #LLMAgents #AI #Benchmarking #ToolUse
📝 Summary:
M3-Bench is a new benchmark evaluating multimodal LLM agent tool use in complex, multi-hop workflows requiring visual grounding and tool dependencies. It introduces a similarity-driven alignment method and interpretable metrics. Evaluations show significant gaps in current MLLMs, especially in ar...
🔹 Publication Date: Published on Nov 21
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.17729
• PDF: https://arxiv.org/pdf/2511.17729
• Github: https://github.com/EtaYang10th/Open-M3-Bench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MLLM #LLMAgents #AI #Benchmarking #ToolUse
✨Language Model Council: Benchmarking Foundation Models on Highly Subjective Tasks by Consensus
📝 Summary:
Benchmarking LLMs on subjective tasks like emotional intelligence is challenging. The Language Model Council LMC uses a democratic process with 20 LLMs to formulate, administer, and evaluate tests. This yields more robust, less biased rankings that align better with human leaderboards.
🔹 Publication Date: Published on Jun 12, 2024
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2406.08598
• PDF: https://arxiv.org/pdf/2406.08598
• Github: https://github.com/llm-council/llm-council
✨ Datasets citing this paper:
• https://huggingface.co/datasets/llm-council/emotional_application
✨ Spaces citing this paper:
• https://huggingface.co/spaces/llm-council/llm-council
• https://huggingface.co/spaces/llm-council/sandbox
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#LLM #Benchmarking #AIEvaluation #FoundationModels #ConsensusAI
📝 Summary:
Benchmarking LLMs on subjective tasks like emotional intelligence is challenging. The Language Model Council LMC uses a democratic process with 20 LLMs to formulate, administer, and evaluate tests. This yields more robust, less biased rankings that align better with human leaderboards.
🔹 Publication Date: Published on Jun 12, 2024
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2406.08598
• PDF: https://arxiv.org/pdf/2406.08598
• Github: https://github.com/llm-council/llm-council
✨ Datasets citing this paper:
• https://huggingface.co/datasets/llm-council/emotional_application
✨ Spaces citing this paper:
• https://huggingface.co/spaces/llm-council/llm-council
• https://huggingface.co/spaces/llm-council/sandbox
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#LLM #Benchmarking #AIEvaluation #FoundationModels #ConsensusAI
✨PAI-Bench: A Comprehensive Benchmark For Physical AI
📝 Summary:
PAI-Bench is a new benchmark evaluating multi-modal LLMs and video generative models for physical AI perception and prediction. It reveals current models struggle with physical coherence, forecasting, and causal reasoning in real-world dynamics. This highlights significant gaps for future physica...
🔹 Publication Date: Published on Dec 1
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01989
• PDF: https://arxiv.org/pdf/2512.01989
• Github: https://github.com/SHI-Labs/physical-ai-bench
✨ Spaces citing this paper:
• https://huggingface.co/spaces/shi-labs/physical-ai-bench-leaderboard
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#PhysicalAI #LLMs #Benchmarking #GenerativeAI #ComputerVision
📝 Summary:
PAI-Bench is a new benchmark evaluating multi-modal LLMs and video generative models for physical AI perception and prediction. It reveals current models struggle with physical coherence, forecasting, and causal reasoning in real-world dynamics. This highlights significant gaps for future physica...
🔹 Publication Date: Published on Dec 1
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01989
• PDF: https://arxiv.org/pdf/2512.01989
• Github: https://github.com/SHI-Labs/physical-ai-bench
✨ Spaces citing this paper:
• https://huggingface.co/spaces/shi-labs/physical-ai-bench-leaderboard
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#PhysicalAI #LLMs #Benchmarking #GenerativeAI #ComputerVision
✨Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench
📝 Summary:
VideoScience-Bench introduces a new benchmark evaluating video models scientific reasoning. It assesses their ability to generate phenomena consistent with undergraduate physics and chemistry, filling a critical gap. It is the first to evaluate models as scientific reasoners.
🔹 Publication Date: Published on Dec 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.02942
• PDF: https://arxiv.org/pdf/2512.02942
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoGeneration #AIResearch #ScientificReasoning #AIModels #Benchmarking
📝 Summary:
VideoScience-Bench introduces a new benchmark evaluating video models scientific reasoning. It assesses their ability to generate phenomena consistent with undergraduate physics and chemistry, filling a critical gap. It is the first to evaluate models as scientific reasoners.
🔹 Publication Date: Published on Dec 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.02942
• PDF: https://arxiv.org/pdf/2512.02942
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VideoGeneration #AIResearch #ScientificReasoning #AIModels #Benchmarking
✨AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs
📝 Summary:
AlignBench is a new benchmark for fine-grained image-text alignment, using detailed synthetic image-caption pairs. It reveals that CLIP-based models struggle with compositional reasoning and shows detector self-preference.
🔹 Publication Date: Published on Nov 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20515
• PDF: https://arxiv.org/pdf/2511.20515
• Project Page: https://dahlian00.github.io/AlignBench/
• Github: https://dahlian00.github.io/AlignBench/
✨ Datasets citing this paper:
• https://huggingface.co/datasets/omron-sinicx/AlignBench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ImageTextAlignment #MultimodalAI #ComputerVision #Benchmarking #CLIPModels
📝 Summary:
AlignBench is a new benchmark for fine-grained image-text alignment, using detailed synthetic image-caption pairs. It reveals that CLIP-based models struggle with compositional reasoning and shows detector self-preference.
🔹 Publication Date: Published on Nov 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20515
• PDF: https://arxiv.org/pdf/2511.20515
• Project Page: https://dahlian00.github.io/AlignBench/
• Github: https://dahlian00.github.io/AlignBench/
✨ Datasets citing this paper:
• https://huggingface.co/datasets/omron-sinicx/AlignBench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#ImageTextAlignment #MultimodalAI #ComputerVision #Benchmarking #CLIPModels
✨DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
📝 Summary:
DAComp is a benchmark with 210 tasks for data engineering and analysis workflows. It reveals significant deficiencies in state-of-the-art agents, with success rates under 20% for engineering and below 40% for analysis, highlighting critical gaps.
🔹 Publication Date: Published on Dec 3
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04324
• PDF: https://arxiv.org/pdf/2512.04324
• Project Page: https://da-comp.github.io/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DataAgents #Benchmarking #DataEngineering #DataAnalysis #AIResearch
📝 Summary:
DAComp is a benchmark with 210 tasks for data engineering and analysis workflows. It reveals significant deficiencies in state-of-the-art agents, with success rates under 20% for engineering and below 40% for analysis, highlighting critical gaps.
🔹 Publication Date: Published on Dec 3
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04324
• PDF: https://arxiv.org/pdf/2512.04324
• Project Page: https://da-comp.github.io/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#DataAgents #Benchmarking #DataEngineering #DataAnalysis #AIResearch
✨IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting
📝 Summary:
IF-Bench is introduced as the first benchmark to evaluate multimodal large language models on infrared images using diverse assessment strategies. It includes varied infrared images and question-answer pairs for systematic evaluation of over 40 models. The paper also proposes GenViP, a training-f...
🔹 Publication Date: Published on Dec 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.09663
• PDF: https://arxiv.org/pdf/2512.09663
🔹 Models citing this paper:
• https://huggingface.co/casiatao/Qwen-Edit-2509-FT
✨ Datasets citing this paper:
• https://huggingface.co/datasets/casiatao/IF-Bench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MLLMs #InfraredImaging #Benchmarking #GenerativeAI #AIResearch
📝 Summary:
IF-Bench is introduced as the first benchmark to evaluate multimodal large language models on infrared images using diverse assessment strategies. It includes varied infrared images and question-answer pairs for systematic evaluation of over 40 models. The paper also proposes GenViP, a training-f...
🔹 Publication Date: Published on Dec 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.09663
• PDF: https://arxiv.org/pdf/2512.09663
🔹 Models citing this paper:
• https://huggingface.co/casiatao/Qwen-Edit-2509-FT
✨ Datasets citing this paper:
• https://huggingface.co/datasets/casiatao/IF-Bench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MLLMs #InfraredImaging #Benchmarking #GenerativeAI #AIResearch
✨GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation
📝 Summary:
GroundingME is a new benchmark revealing significant visual grounding gaps in MLLMs, which often hallucinate instead of rejecting ungroundable queries. State-of-the-art models only reach 45.1% accuracy, raising safety concerns. Data-mixture training shows promise in improving their ability to rec...
🔹 Publication Date: Published on Dec 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.17495
• PDF: https://arxiv.org/pdf/2512.17495
• Project Page: https://groundingme.github.io/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MLLMs #VisualGrounding #AISafety #AIResearch #Benchmarking
📝 Summary:
GroundingME is a new benchmark revealing significant visual grounding gaps in MLLMs, which often hallucinate instead of rejecting ungroundable queries. State-of-the-art models only reach 45.1% accuracy, raising safety concerns. Data-mixture training shows promise in improving their ability to rec...
🔹 Publication Date: Published on Dec 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.17495
• PDF: https://arxiv.org/pdf/2512.17495
• Project Page: https://groundingme.github.io/
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MLLMs #VisualGrounding #AISafety #AIResearch #Benchmarking
❤1
✨SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
📝 Summary:
SWE-EVO is a new benchmark for AI coding agents that evaluates them on long-horizon, multi-step software evolution tasks across many files. It reveals a significant gap in current models abilities, with even top models achieving only 21 percent resolution. This highlights their struggle with sust...
🔹 Publication Date: Published on Dec 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.18470
• PDF: https://arxiv.org/pdf/2512.18470
✨ Datasets citing this paper:
• https://huggingface.co/datasets/Fsoft-AIC/SWE-EVO
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AICoding #SoftwareEvolution #Benchmarking #LLMs #AIResearch
📝 Summary:
SWE-EVO is a new benchmark for AI coding agents that evaluates them on long-horizon, multi-step software evolution tasks across many files. It reveals a significant gap in current models abilities, with even top models achieving only 21 percent resolution. This highlights their struggle with sust...
🔹 Publication Date: Published on Dec 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.18470
• PDF: https://arxiv.org/pdf/2512.18470
✨ Datasets citing this paper:
• https://huggingface.co/datasets/Fsoft-AIC/SWE-EVO
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AICoding #SoftwareEvolution #Benchmarking #LLMs #AIResearch
❤2
✨InfoSynth: Information-Guided Benchmark Synthesis for LLMs
📝 Summary:
InfoSynth automatically generates novel and diverse coding benchmarks for LLMs. It uses information-theoretic metrics and genetic algorithms to create scalable self-verifying problems, overcoming manual effort and training data contamination.
🔹 Publication Date: Published on Jan 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.00575
• PDF: https://arxiv.org/pdf/2601.00575
• Project Page: https://ishirgarg.github.io/infosynth_web/
• Github: https://github.com/ishirgarg/infosynth
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#LLM #AI #Benchmarking #GenerativeAI #DeepLearning
📝 Summary:
InfoSynth automatically generates novel and diverse coding benchmarks for LLMs. It uses information-theoretic metrics and genetic algorithms to create scalable self-verifying problems, overcoming manual effort and training data contamination.
🔹 Publication Date: Published on Jan 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.00575
• PDF: https://arxiv.org/pdf/2601.00575
• Project Page: https://ishirgarg.github.io/infosynth_web/
• Github: https://github.com/ishirgarg/infosynth
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#LLM #AI #Benchmarking #GenerativeAI #DeepLearning