ML Research Hub
32.8K subscribers
4.36K photos
267 videos
23 files
4.71K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique

📝 Summary:
MM-CRITIC is a new benchmark evaluating Large Multimodal Models critique abilities across various dimensions and tasks. It uses expert-informed ground answers and GPT-4o for reliable scoring. This benchmark provides a comprehensive assessment of leading LMMs' critique capabilities.

🔹 Publication Date: Published on Nov 12

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.09067
• PDF: https://arxiv.org/pdf/2511.09067

==================================

For more data science resources:
https://t.me/DataScienceT

#LMMs #MultimodalAI #AIEvaluation #Benchmarking #AIResearch
Rethinking Saliency Maps: A Cognitive Human Aligned Taxonomy and Evaluation Framework for Explanations

📝 Summary:
This paper introduces the RFxG taxonomy to categorize saliency map explanations by reference-frame and granularity. It proposes novel faithfulness metrics to improve evaluation, aiming to align explanations with diverse user intent and human understanding.

🔹 Publication Date: Published on Nov 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13081
• PDF: https://arxiv.org/pdf/2511.13081

==================================

For more data science resources:
https://t.me/DataScienceT

#ExplainableAI #SaliencyMaps #CognitiveScience #AIEvaluation #AIResearch
This media is not supported in your browser
VIEW IN TELEGRAM
Computer-Use Agents as Judges for Generative User Interface

📝 Summary:
This paper introduces a framework where Computer-Use Agents CUA act as judges for coding language models Coder to automatically design GUIs. The goal is to optimize interfaces for CUA efficiency and task solvability, rather than human aesthetics, using a new benchmark called AUI-Gym.

🔹 Publication Date: Published on Nov 19

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15567
• PDF: https://arxiv.org/pdf/2511.15567
• Project Page: https://showlab.github.io/AUI/
• Github: https://github.com/showlab/AUI/

==================================

For more data science resources:
https://t.me/DataScienceT

#AIAgents #GUIDesign #GenerativeAI #AIevaluation #LanguageModels
Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

📝 Summary:
ReVeL converts multiple-choice questions to verifiable open-form questions to address unreliable MCQA metrics and answer guessing. This framework improves data efficiency and robustness for multimodal language models, revealing significant score inflation in MCQA benchmarks.

🔹 Publication Date: Published on Nov 21

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.17405
• PDF: https://arxiv.org/pdf/2511.17405
• Github: https://flageval-baai.github.io/ReVeL/

==================================

For more data science resources:
https://t.me/DataScienceT

#OpenQA #VisionLanguage #LanguageModels #AIEvaluation #MachineLearning
Language Model Council: Benchmarking Foundation Models on Highly Subjective Tasks by Consensus

📝 Summary:
Benchmarking LLMs on subjective tasks like emotional intelligence is challenging. The Language Model Council LMC uses a democratic process with 20 LLMs to formulate, administer, and evaluate tests. This yields more robust, less biased rankings that align better with human leaderboards.

🔹 Publication Date: Published on Jun 12, 2024

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2406.08598
• PDF: https://arxiv.org/pdf/2406.08598
• Github: https://github.com/llm-council/llm-council

Datasets citing this paper:
https://huggingface.co/datasets/llm-council/emotional_application

Spaces citing this paper:
https://huggingface.co/spaces/llm-council/llm-council
https://huggingface.co/spaces/llm-council/sandbox

==================================

For more data science resources:
https://t.me/DataScienceT

#LLM #Benchmarking #AIEvaluation #FoundationModels #ConsensusAI
Multimodal Evaluation of Russian-language Architectures

📝 Summary:
Mera Multi is the first open multimodal evaluation framework for Russian-language AI, addressing a lack of such benchmarks. It introduces 18 new instruction-based tasks across text, image, audio, and video, created with Russian cultural specificity and a leakage prevention methodology.

🔹 Publication Date: Published on Nov 19

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15552
• PDF: https://arxiv.org/pdf/2511.15552
• Project Page: https://mera.a-ai.ru/en/multi
• Github: https://github.com/MERA-Evaluation/MERA_MULTIMODAL/tree/main

==================================

For more data science resources:
https://t.me/DataScienceT

#MultimodalAI #RussianAI #AIEvaluation #Benchmarks #AIresearch
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

📝 Summary:
Multi-Crit evaluates multimodal models as judges on following diverse criteria using novel metrics. Findings reveal current models struggle with consistent adherence and flexibility to pluralistic criteria. This highlights gaps in capabilities and lays a foundation for building reliable AI evalua...

🔹 Publication Date: Published on Nov 26

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.21662
• PDF: https://arxiv.org/pdf/2511.21662
• Project Page: https://multi-crit.github.io/
• Github: https://multi-crit.github.io/

==================================

For more data science resources:
https://t.me/DataScienceT

#MultimodalAI #AIEvaluation #BenchmarkingAI #AIJudges #MachineLearning
CaptionQA: Is Your Caption as Useful as the Image Itself?

📝 Summary:
CaptionQA assesses if AI captions adequately substitute images for downstream tasks. This benchmark uses over 33000 visual questions across 4 domains. It shows large utility gaps as MLLMs perform up to 32% worse with captions than with images.

🔹 Publication Date: Published on Nov 26

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.21025
• PDF: https://arxiv.org/pdf/2511.21025
• Github: https://github.com/bronyayang/CaptionQA

Datasets citing this paper:
https://huggingface.co/datasets/Borise/CaptionQA

==================================

For more data science resources:
https://t.me/DataScienceT

#AICaptions #MultimodalAI #ComputerVision #AIevaluation #NLP
Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems

📝 Summary:
CJE improves LLM-as-judge evaluation by fixing statistical issues like uncalibrated scores and poor confidence intervals. It achieves 99% ranking accuracy at 14x lower cost by calibrating a cheaper judge with 5% oracle labels.

🔹 Publication Date: Published on Dec 11

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.11150
• PDF: https://arxiv.org/pdf/2512.11150
• Project Page: https://www.cimolabs.com/cje
• Github: https://github.com/cimo-labs/cje

==================================

For more data science resources:
https://t.me/DataScienceT

#LLMs #AIEvaluation #MachineLearning #DataScience #NLP
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

📝 Summary:
T2AV-Compass introduces a unified benchmark for text-to-audio-video generation evaluation. It features 500 diverse prompts and a dual-level framework. Evaluations reveal current T2AV models struggle significantly with realism and cross-modal consistency.

🔹 Publication Date: Published on Dec 24

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.21094
• PDF: https://arxiv.org/pdf/2512.21094
• Project Page: https://nju-link.github.io/T2AV-Compass/
• Github: https://github.com/NJU-LINK/T2AV-Compass/

==================================

For more data science resources:
https://t.me/DataScienceT

#TextToAudioVideo #MultimodalAI #AIEvaluation #GenerativeAI #AIResearch
SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence

📝 Summary:
SciEvalKit is an open-source toolkit for evaluating AI models in science. It assesses scientific intelligence across diverse domains and competencies using expert-grade benchmarks and a flexible pipeline. This provides a standardized platform for scientific AI evaluation.

🔹 Publication Date: Published on Dec 26, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22334
• PDF: https://arxiv.org/pdf/2512.22334

==================================

For more data science resources:
https://t.me/DataScienceT

#AIevaluation #ScientificAI #OpenSource #AIBenchmarks #AIResearch