ML Research Hub
32.8K subscribers
4.36K photos
267 videos
23 files
4.71K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
RiddleBench: A New Generative Reasoning Benchmark for LLMs

📝 Summary:
RiddleBench, a new benchmark of 1,737 puzzles, reveals fundamental weaknesses in state-of-the-art LLMs, including hallucination cascades and poor self-correction. Models achieve only about 60% accuracy, underscoring the need for more robust and reliable reasoning capabilities.

🔹 Publication Date: Published on Oct 28

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.24932
• PDF: https://arxiv.org/pdf/2510.24932

Datasets citing this paper:
https://huggingface.co/datasets/ai4bharat/RiddleBench

==================================

For more data science resources:
https://t.me/DataScienceT

#LLMs #GenerativeAI #AIResearch #Benchmarks #NLP
miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward

📝 Summary:
An analysis of miniF2F showed AI systems had 36% accuracy due to problem errors. Correcting these errors created miniF2F-v2, improving accuracy to 70%. High-quality benchmarks like miniF2F-v2 are crucial for evaluating formal reasoning progress.

🔹 Publication Date: Published on Nov 5

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.03108
• PDF: https://arxiv.org/pdf/2511.03108
• Github: https://github.com/roozbeh-yz/miniF2F_v2

Datasets citing this paper:
https://huggingface.co/datasets/roozbeh-yz/miniF2F_v2

==================================

For more data science resources:
https://t.me/DataScienceT

#AI #FormalReasoning #Benchmarks #MachineLearning #Dataset
Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

📝 Summary:
Current video model benchmarks miss assessing Chain-of-Frames CoF reasoning, crucial for world simulators. Gen-ViRe is a new benchmark that decomposes CoF reasoning into cognitive subtasks, offering the first quantitative assessment. It reveals poor reasoning depth despite impressive visual quali...

🔹 Publication Date: Published on Nov 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13853
• PDF: https://arxiv.org/pdf/2511.13853

==================================

For more data science resources:
https://t.me/DataScienceT

#AI #WorldSimulators #VisualReasoning #GenerativeAI #Benchmarks
Multimodal Evaluation of Russian-language Architectures

📝 Summary:
Mera Multi is the first open multimodal evaluation framework for Russian-language AI, addressing a lack of such benchmarks. It introduces 18 new instruction-based tasks across text, image, audio, and video, created with Russian cultural specificity and a leakage prevention methodology.

🔹 Publication Date: Published on Nov 19

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15552
• PDF: https://arxiv.org/pdf/2511.15552
• Project Page: https://mera.a-ai.ru/en/multi
• Github: https://github.com/MERA-Evaluation/MERA_MULTIMODAL/tree/main

==================================

For more data science resources:
https://t.me/DataScienceT

#MultimodalAI #RussianAI #AIEvaluation #Benchmarks #AIresearch
Reveal Hidden Pitfalls and Navigate Next Generation of Vector Similarity Search from Task-Centric Views

📝 Summary:
Iceberg is a new benchmark for vector similarity search VSS that evaluates methods from a task-centric view. It uncovers performance degradation, re-ranks VSS algorithms based on application-level metrics, and guides practitioners.

🔹 Publication Date: Published on Dec 15

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.12980
• PDF: https://arxiv.org/pdf/2512.12980

==================================

For more data science resources:
https://t.me/DataScienceT

#VectorSimilaritySearch #MachineLearning #DataScience #Benchmarks #Algorithms
2