✨RiddleBench: A New Generative Reasoning Benchmark for LLMs
📝 Summary:
RiddleBench, a new benchmark of 1,737 puzzles, reveals fundamental weaknesses in state-of-the-art LLMs, including hallucination cascades and poor self-correction. Models achieve only about 60% accuracy, underscoring the need for more robust and reliable reasoning capabilities.
🔹 Publication Date: Published on Oct 28
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.24932
• PDF: https://arxiv.org/pdf/2510.24932
✨ Datasets citing this paper:
• https://huggingface.co/datasets/ai4bharat/RiddleBench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#LLMs #GenerativeAI #AIResearch #Benchmarks #NLP
📝 Summary:
RiddleBench, a new benchmark of 1,737 puzzles, reveals fundamental weaknesses in state-of-the-art LLMs, including hallucination cascades and poor self-correction. Models achieve only about 60% accuracy, underscoring the need for more robust and reliable reasoning capabilities.
🔹 Publication Date: Published on Oct 28
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.24932
• PDF: https://arxiv.org/pdf/2510.24932
✨ Datasets citing this paper:
• https://huggingface.co/datasets/ai4bharat/RiddleBench
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#LLMs #GenerativeAI #AIResearch #Benchmarks #NLP
✨miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward
📝 Summary:
An analysis of miniF2F showed AI systems had 36% accuracy due to problem errors. Correcting these errors created miniF2F-v2, improving accuracy to 70%. High-quality benchmarks like miniF2F-v2 are crucial for evaluating formal reasoning progress.
🔹 Publication Date: Published on Nov 5
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.03108
• PDF: https://arxiv.org/pdf/2511.03108
• Github: https://github.com/roozbeh-yz/miniF2F_v2
✨ Datasets citing this paper:
• https://huggingface.co/datasets/roozbeh-yz/miniF2F_v2
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AI #FormalReasoning #Benchmarks #MachineLearning #Dataset
📝 Summary:
An analysis of miniF2F showed AI systems had 36% accuracy due to problem errors. Correcting these errors created miniF2F-v2, improving accuracy to 70%. High-quality benchmarks like miniF2F-v2 are crucial for evaluating formal reasoning progress.
🔹 Publication Date: Published on Nov 5
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.03108
• PDF: https://arxiv.org/pdf/2511.03108
• Github: https://github.com/roozbeh-yz/miniF2F_v2
✨ Datasets citing this paper:
• https://huggingface.co/datasets/roozbeh-yz/miniF2F_v2
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AI #FormalReasoning #Benchmarks #MachineLearning #Dataset
✨Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark
📝 Summary:
Current video model benchmarks miss assessing Chain-of-Frames CoF reasoning, crucial for world simulators. Gen-ViRe is a new benchmark that decomposes CoF reasoning into cognitive subtasks, offering the first quantitative assessment. It reveals poor reasoning depth despite impressive visual quali...
🔹 Publication Date: Published on Nov 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13853
• PDF: https://arxiv.org/pdf/2511.13853
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AI #WorldSimulators #VisualReasoning #GenerativeAI #Benchmarks
📝 Summary:
Current video model benchmarks miss assessing Chain-of-Frames CoF reasoning, crucial for world simulators. Gen-ViRe is a new benchmark that decomposes CoF reasoning into cognitive subtasks, offering the first quantitative assessment. It reveals poor reasoning depth despite impressive visual quali...
🔹 Publication Date: Published on Nov 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13853
• PDF: https://arxiv.org/pdf/2511.13853
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AI #WorldSimulators #VisualReasoning #GenerativeAI #Benchmarks
✨Multimodal Evaluation of Russian-language Architectures
📝 Summary:
Mera Multi is the first open multimodal evaluation framework for Russian-language AI, addressing a lack of such benchmarks. It introduces 18 new instruction-based tasks across text, image, audio, and video, created with Russian cultural specificity and a leakage prevention methodology.
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15552
• PDF: https://arxiv.org/pdf/2511.15552
• Project Page: https://mera.a-ai.ru/en/multi
• Github: https://github.com/MERA-Evaluation/MERA_MULTIMODAL/tree/main
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #RussianAI #AIEvaluation #Benchmarks #AIresearch
📝 Summary:
Mera Multi is the first open multimodal evaluation framework for Russian-language AI, addressing a lack of such benchmarks. It introduces 18 new instruction-based tasks across text, image, audio, and video, created with Russian cultural specificity and a leakage prevention methodology.
🔹 Publication Date: Published on Nov 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15552
• PDF: https://arxiv.org/pdf/2511.15552
• Project Page: https://mera.a-ai.ru/en/multi
• Github: https://github.com/MERA-Evaluation/MERA_MULTIMODAL/tree/main
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#MultimodalAI #RussianAI #AIEvaluation #Benchmarks #AIresearch
✨Reveal Hidden Pitfalls and Navigate Next Generation of Vector Similarity Search from Task-Centric Views
📝 Summary:
Iceberg is a new benchmark for vector similarity search VSS that evaluates methods from a task-centric view. It uncovers performance degradation, re-ranks VSS algorithms based on application-level metrics, and guides practitioners.
🔹 Publication Date: Published on Dec 15
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.12980
• PDF: https://arxiv.org/pdf/2512.12980
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VectorSimilaritySearch #MachineLearning #DataScience #Benchmarks #Algorithms
📝 Summary:
Iceberg is a new benchmark for vector similarity search VSS that evaluates methods from a task-centric view. It uncovers performance degradation, re-ranks VSS algorithms based on application-level metrics, and guides practitioners.
🔹 Publication Date: Published on Dec 15
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.12980
• PDF: https://arxiv.org/pdf/2512.12980
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#VectorSimilaritySearch #MachineLearning #DataScience #Benchmarks #Algorithms
❤2