ML Research Hub
32.8K subscribers
4.12K photos
243 videos
23 files
4.45K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

📝 Summary:
MME-CC is a new vision-grounded benchmark to evaluate multimodal large language models cognitive capacity in spatial, geometric, and knowledge-based reasoning tasks. It reveals that while some models lead, spatial and geometric reasoning remain broadly weak. This highlights the need for better ev...

🔹 Publication Date: Published on Nov 5

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.03146
• PDF: https://arxiv.org/pdf/2511.03146
• Project Page: https://randomtutu.github.io/MME-CC/

==================================

For more data science resources:
https://t.me/DataScienceT

#MultimodalAI #LLMs #Benchmarking #CognitiveAI #ComputerVision
LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

📝 Summary:
The paper introduces LEGO-Eval, a tool-augmented framework, and LEGO-Bench, a detailed instruction benchmark, to improve 3D scene evaluation. It shows LEGO-Eval accurately assesses scene-instruction alignment, outperforming VLMs, and current generation methods largely fail to create realistic sce...

🔹 Publication Date: Published on Nov 4

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.03001
• PDF: https://arxiv.org/pdf/2511.03001
• Project Page: https://gyeomh.github.io/LEGO-Eval/

==================================

For more data science resources:
https://t.me/DataScienceT

#EmbodiedAI #3DGeneration #EvaluationMetrics #VLMs #Benchmarking
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

📝 Summary:
MVU-Eval is a new comprehensive benchmark for evaluating Multi-Video Understanding in Multimodal Large Language Models. It addresses a critical gap in existing single-video benchmarks and reveals significant performance limitations in current MLLMs for multi-video scenarios.

🔹 Publication Date: Published on Nov 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.07250
• PDF: https://arxiv.org/pdf/2511.07250
• Project Page: https://huggingface.co/datasets/MVU-Eval-Team/MVU-Eval-Data
• Github: https://github.com/NJU-LINK/MVU-Eval

==================================

For more data science resources:
https://t.me/DataScienceT

#MLLMs #VideoUnderstanding #AI #Benchmarking #ComputerVision
SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

📝 Summary:
SWE-fficiency is a new benchmark evaluating how language models optimize real-world software repositories for performance on actual workloads. Agents must identify bottlenecks and generate correct code patches matching expert speedup. Current agents significantly underperform, struggling with loc...

🔹 Publication Date: Published on Nov 8

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.06090
• PDF: https://arxiv.org/pdf/2511.06090
• Project Page: https://swefficiency.com/

==================================

For more data science resources:
https://t.me/DataScienceT

#LLM #SoftwareOptimization #PerformanceTuning #AIagents #Benchmarking
Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation

📝 Summary:
This paper introduces a framework to robustly evaluate diversity in text-to-image models. It uses a novel human evaluation template, curated prompts with variation factors, and systematic analysis of image embeddings to rank models and identify diversity weaknesses.

🔹 Publication Date: Published on Nov 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.10547
• PDF: https://arxiv.org/pdf/2511.10547

==================================

For more data science resources:
https://t.me/DataScienceT

#ImageGeneration #TextToImage #AIDiversity #Benchmarking #HumanEvaluation
MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique

📝 Summary:
MM-CRITIC is a new benchmark evaluating Large Multimodal Models critique abilities across various dimensions and tasks. It uses expert-informed ground answers and GPT-4o for reliable scoring. This benchmark provides a comprehensive assessment of leading LMMs' critique capabilities.

🔹 Publication Date: Published on Nov 12

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.09067
• PDF: https://arxiv.org/pdf/2511.09067

==================================

For more data science resources:
https://t.me/DataScienceT

#LMMs #MultimodalAI #AIEvaluation #Benchmarking #AIResearch
🤖🧠 OpenAI Evals: The Framework Transforming LLM Evaluation and Benchmarking

🗓️ 16 Nov 2025
📚 AI News & Trends

As large language models (LLMs) continue to reshape industries from education and healthcare to marketing and software development – the need for reliable evaluation methods has never been greater. With new models constantly emerging, developers and researchers require a standardized system to test, compare and understand model performance across real-world scenarios. This is where OpenAI ...

#OpenAIEvals #LLMEvaluation #Benchmarking #LargeLanguageModels #AIResearch #ModelEvaluation
1
DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

📝 Summary:
A new benchmark, DiscoX, and evaluation system, Metric-S, are introduced for discourse-level, expert Chinese-English translation. Findings show advanced LLMs still fall short of human performance, underscoring challenges in professional machine translation.

🔹 Publication Date: Published on Nov 14

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.10984
• PDF: https://arxiv.org/pdf/2511.10984

==================================

For more data science resources:
https://t.me/DataScienceT

#MachineTranslation #NLP #LLM #Benchmarking #AI
V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

📝 Summary:
V-ReasonBench is a new benchmark to evaluate generative video models' reasoning across structured problem-solving, spatial cognition, pattern inference, and physical dynamics. It uses diverse tasks to reveal dimension-wise differences in models, aiming to support development of human-aligned reas...

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16668
• PDF: https://arxiv.org/pdf/2511.16668
• Project Page: https://oahzxl.github.io/VReasonBench/
• Github: https://github.com/yangluo7/V-ReasonBench

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoGeneration #AIReasoning #GenerativeAI #Benchmarking #MachineLearning
1
TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

📝 Summary:
TurkColBERT, the first benchmark for Turkish IR, shows late-interaction models significantly outperform dense encoders. They offer superior parameter efficiency, faster indexing, and better performance for Turkish retrieval tasks.

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16528
• PDF: https://arxiv.org/pdf/2511.16528

==================================

For more data science resources:
https://t.me/DataScienceT

#InformationRetrieval #TurkishNLP #MachineLearning #DeepLearning #Benchmarking
M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark

📝 Summary:
M3-Bench is a new benchmark evaluating multimodal LLM agent tool use in complex, multi-hop workflows requiring visual grounding and tool dependencies. It introduces a similarity-driven alignment method and interpretable metrics. Evaluations show significant gaps in current MLLMs, especially in ar...

🔹 Publication Date: Published on Nov 21

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.17729
• PDF: https://arxiv.org/pdf/2511.17729
• Github: https://github.com/EtaYang10th/Open-M3-Bench

==================================

For more data science resources:
https://t.me/DataScienceT

#MLLM #LLMAgents #AI #Benchmarking #ToolUse
Language Model Council: Benchmarking Foundation Models on Highly Subjective Tasks by Consensus

📝 Summary:
Benchmarking LLMs on subjective tasks like emotional intelligence is challenging. The Language Model Council LMC uses a democratic process with 20 LLMs to formulate, administer, and evaluate tests. This yields more robust, less biased rankings that align better with human leaderboards.

🔹 Publication Date: Published on Jun 12, 2024

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2406.08598
• PDF: https://arxiv.org/pdf/2406.08598
• Github: https://github.com/llm-council/llm-council

Datasets citing this paper:
https://huggingface.co/datasets/llm-council/emotional_application

Spaces citing this paper:
https://huggingface.co/spaces/llm-council/llm-council
https://huggingface.co/spaces/llm-council/sandbox

==================================

For more data science resources:
https://t.me/DataScienceT

#LLM #Benchmarking #AIEvaluation #FoundationModels #ConsensusAI
PAI-Bench: A Comprehensive Benchmark For Physical AI

📝 Summary:
PAI-Bench is a new benchmark evaluating multi-modal LLMs and video generative models for physical AI perception and prediction. It reveals current models struggle with physical coherence, forecasting, and causal reasoning in real-world dynamics. This highlights significant gaps for future physica...

🔹 Publication Date: Published on Dec 1

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01989
• PDF: https://arxiv.org/pdf/2512.01989
• Github: https://github.com/SHI-Labs/physical-ai-bench

Spaces citing this paper:
https://huggingface.co/spaces/shi-labs/physical-ai-bench-leaderboard

==================================

For more data science resources:
https://t.me/DataScienceT

#PhysicalAI #LLMs #Benchmarking #GenerativeAI #ComputerVision
Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench

📝 Summary:
VideoScience-Bench introduces a new benchmark evaluating video models scientific reasoning. It assesses their ability to generate phenomena consistent with undergraduate physics and chemistry, filling a critical gap. It is the first to evaluate models as scientific reasoners.

🔹 Publication Date: Published on Dec 2

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.02942
• PDF: https://arxiv.org/pdf/2512.02942

==================================

For more data science resources:
https://t.me/DataScienceT

#VideoGeneration #AIResearch #ScientificReasoning #AIModels #Benchmarking
AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs

📝 Summary:
AlignBench is a new benchmark for fine-grained image-text alignment, using detailed synthetic image-caption pairs. It reveals that CLIP-based models struggle with compositional reasoning and shows detector self-preference.

🔹 Publication Date: Published on Nov 25

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20515
• PDF: https://arxiv.org/pdf/2511.20515
• Project Page: https://dahlian00.github.io/AlignBench/
• Github: https://dahlian00.github.io/AlignBench/

Datasets citing this paper:
https://huggingface.co/datasets/omron-sinicx/AlignBench

==================================

For more data science resources:
https://t.me/DataScienceT

#ImageTextAlignment #MultimodalAI #ComputerVision #Benchmarking #CLIPModels
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

📝 Summary:
DAComp is a benchmark with 210 tasks for data engineering and analysis workflows. It reveals significant deficiencies in state-of-the-art agents, with success rates under 20% for engineering and below 40% for analysis, highlighting critical gaps.

🔹 Publication Date: Published on Dec 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04324
• PDF: https://arxiv.org/pdf/2512.04324
• Project Page: https://da-comp.github.io/

==================================

For more data science resources:
https://t.me/DataScienceT

#DataAgents #Benchmarking #DataEngineering #DataAnalysis #AIResearch
IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting

📝 Summary:
IF-Bench is introduced as the first benchmark to evaluate multimodal large language models on infrared images using diverse assessment strategies. It includes varied infrared images and question-answer pairs for systematic evaluation of over 40 models. The paper also proposes GenViP, a training-f...

🔹 Publication Date: Published on Dec 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.09663
• PDF: https://arxiv.org/pdf/2512.09663

🔹 Models citing this paper:
https://huggingface.co/casiatao/Qwen-Edit-2509-FT

Datasets citing this paper:
https://huggingface.co/datasets/casiatao/IF-Bench

==================================

For more data science resources:
https://t.me/DataScienceT

#MLLMs #InfraredImaging #Benchmarking #GenerativeAI #AIResearch
GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

📝 Summary:
GroundingME is a new benchmark revealing significant visual grounding gaps in MLLMs, which often hallucinate instead of rejecting ungroundable queries. State-of-the-art models only reach 45.1% accuracy, raising safety concerns. Data-mixture training shows promise in improving their ability to rec...

🔹 Publication Date: Published on Dec 19

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.17495
• PDF: https://arxiv.org/pdf/2512.17495
• Project Page: https://groundingme.github.io/

==================================

For more data science resources:
https://t.me/DataScienceT

#MLLMs #VisualGrounding #AISafety #AIResearch #Benchmarking
1
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

📝 Summary:
SWE-EVO is a new benchmark for AI coding agents that evaluates them on long-horizon, multi-step software evolution tasks across many files. It reveals a significant gap in current models abilities, with even top models achieving only 21 percent resolution. This highlights their struggle with sust...

🔹 Publication Date: Published on Dec 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.18470
• PDF: https://arxiv.org/pdf/2512.18470

Datasets citing this paper:
https://huggingface.co/datasets/Fsoft-AIC/SWE-EVO

==================================

For more data science resources:
https://t.me/DataScienceT

#AICoding #SoftwareEvolution #Benchmarking #LLMs #AIResearch
2
InfoSynth: Information-Guided Benchmark Synthesis for LLMs

📝 Summary:
InfoSynth automatically generates novel and diverse coding benchmarks for LLMs. It uses information-theoretic metrics and genetic algorithms to create scalable self-verifying problems, overcoming manual effort and training data contamination.

🔹 Publication Date: Published on Jan 2

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.00575
• PDF: https://arxiv.org/pdf/2601.00575
• Project Page: https://ishirgarg.github.io/infosynth_web/
• Github: https://github.com/ishirgarg/infosynth

==================================

For more data science resources:
https://t.me/DataScienceT

#LLM #AI #Benchmarking #GenerativeAI #DeepLearning