Daily AI Papers 📚
206 subscribers
140 links
Summaries generated from HuggingFace's Daily Papers. All credits go to the researchers and HF. Summaries by Gemini and audio by OpenAI.

🌐 gabrielchua.me/daily-ai-papers

đŸ“Ļ http://github.com/gabrielchua/daily-ai-papers

â„šī¸ https://huggingface.co/papers
Download Telegram
EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental
Health Safety


by Edify-Kd2024, yaozixin, YimingWang, ChrisJuan, yinghuihe

i) EmoAgent is a multi-agent AI framework for evaluating and mitigating mental health risks in human-AI interactions within character-based chatbots. ii) The research aims to assess and safeguard human-AI interactions for mental health safety, particularly for vulnerable users. iii) EmoAgent employs a simulated environment (EmoEval) using clinically validated psychological assessment tools and a real-time safeguard agent (EmoGuard) that monitors and provides corrective feedback. iv) Experiments show that emotionally engaging dialogues can lead to mental state deterioration in vulnerable users in more than 34.4% of simulations; EmoGuard reduces these deterioration rates significantly. v) AI practitioners should be aware that emotionally engaging AI dialogues can lead to mental state deterioration in vulnerable users; and real-time monitoring and corrective feedback are crucial for ensuring safety in AI-human interactions.

Read more on arXiv or HuggingFace
Audio
Listen to the summary of the paper.
ReZero: Enhancing LLM search ability by trying one-more-time

by Thinh Le, alandao

ReZero introduces a reinforcement learning framework to enhance LLM search persistence within Retrieval-Augmented Generation (RAG) by rewarding query retries. The main objective is to improve LLM robustness in information retrieval by explicitly incentivizing the model to attempt subsequent searches if the initial one fails. The key methodology utilizes Group Relative Policy Optimization (GRPO) to fine-tune an LLM, incorporating a specific `reward_retry` function that rewards additional search attempts conditional on generating a correct final answer. The primary result showed the ReZero model achieved 46.88% peak accuracy on the evaluation dataset, nearly doubling the 25.00% peak accuracy of a baseline model trained without the retry incentive. For AI practitioners, this implies that designing RL rewards to explicitly encourage persistence can significantly improve RAG system performance, especially for tasks where initial information retrieval attempts are likely insufficient.

Read more on arXiv or HuggingFace
Audio
Listen to the summary of the paper.
Vivid4D: Improving 4D Reconstruction from Monocular Video by Video
Inpainting


by Yiyi Liao, BangBnag Yang, yuewenma, shengmiao, JaceyH919

Vivid4D enhances 4D reconstruction from monocular video by reformulating view augmentation as a video inpainting task integrating geometric and generative priors. The primary research objective is to improve the quality and completeness of 4D dynamic scene reconstruction from sparse monocular video inputs. Key methodology involves warping observed views to novel viewpoints using monocular depth priors, training a video diffusion model on unposed web videos with synthetic occlusion masks to inpaint missing regions, and employing an iterative view augmentation strategy with a robust reconstruction loss. Results demonstrate improved reconstruction quality, achieving an overall PSNR of 19.45 on the HyperNeRF dataset, outperforming baselines like 4D GS (18.24) and Shape of Motion (18.82). For AI practitioners, this work presents a practical method using video inpainting to generate richer supervision signals from monocular video, thereby enhancing the fidelity of 4D scene reconstructions for applications like VR/AR content creation.

Read more on arXiv or HuggingFace
Audio
Listen to the summary of the paper.
Does Reinforcement Learning Really Incentivize Reasoning Capacity in
LLMs Beyond the Base Model?


by Zhaokai Wang, Andrew Zhao, Rui Lu, Zhiqi Chen, Yang Yue

This paper demonstrates that Reinforcement Learning with Verifiable Rewards (RLVR) primarily enhances sampling efficiency for existing reasoning paths within LLMs, rather than fundamentally expanding reasoning capacity beyond the base model. The study critically investigates if RLVR enables LLMs to acquire novel reasoning abilities exceeding their base models' intrinsic capabilities. Using the `pass@k` metric with large `k` values across math, coding, and visual reasoning benchmarks, alongside perplexity analysis and manual Chain-of-Thought checks, the researchers compared the reasoning boundaries of base and RL-trained models. Key findings reveal that while RL models excel at low `k` (pass@1), base models consistently match or surpass RL models at high `k` (e.g., base Minerva 32B outperformed its RL counterpart by ~9% pass@128), indicating RL primarily learns to sample pre-existing correct reasoning paths more efficiently, rather than discovering new ones. For AI practitioners, this implies current RLVR mainly optimizes known reasoning patterns rather than fostering new skills, suggesting that achieving breakthroughs in reasoning might require complementary methods like distillation or fundamentally different training paradigms that overcome RL's observed limitation in narrowing exploration.

Read more on arXiv or HuggingFace
Audio
Listen to the summary of the paper.
Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls
for Video Generation


by Shikai Li, yanweifuture, Alex-snow, theFoxofSky, ewrfcas

Uni3C introduces a unified framework for precise 3D-enhanced camera and human motion control in video generation using foundational video diffusion models (VDMs). The objective is to enable joint, precise control over both camera trajectories and human motions in video generation, overcoming limitations of separate controls and reliance on jointly annotated data. Key methodologies include PCDController, a lightweight, plug-and-play module trained with a frozen VDM backbone using unprojected point clouds for camera control, and a global 3D world guidance system aligning scenic point clouds and SMPL-X characters for unified inference. Uni3C significantly improves joint control, achieving an Absolute Trajectory Error (ATE) of 0.251 on the unified benchmark, substantially outperforming the baseline RealisDance-DiT's ATE of 0.549 while maintaining visual quality. For AI practitioners, the PCDController offers a robust, parameter-efficient module for adding precise camera control to existing VDMs with minimal training overhead and without needing joint annotations, while the global alignment enables unified multi-modal control.

Read more on arXiv or HuggingFace
Audio
Listen to the summary of the paper.
TTRL: Test-Time Reinforcement Learning

by Xuekai Zhu, Li Sheng, Shang Qu, Yuxin Zuo, iseesaw

This paper introduces Test-Time Reinforcement Learning (TTRL), a method for improving Large Language Models (LLMs) on reasoning tasks using unlabeled test data. The objective is to enable LLM self-evolution using Reinforcement Learning (RL) during inference without access to ground-truth labels, addressing the challenge of reward estimation in this setting. TTRL employs repeated sampling to generate multiple outputs, uses majority voting to estimate a consensus label, and computes rule-based rewards based on this estimate to drive RL training. Experiments show TTRL boosted Qwen-2.5-Math-7B pass@1 performance on AIME 2024 by approximately 159% using only unlabeled test data, and consistently surpassed the performance upper limit implied by the initial model's majority voting accuracy. For AI practitioners, TTRL demonstrates a method for adapting and improving LLMs on new tasks using unlabeled data alone, suggesting a potential pathway for continuous learning and reduced reliance on extensive labeled datasets for RL fine-tuning.

Read more on arXiv or HuggingFace
Audio
Listen to the summary of the paper.
Decoupled Global-Local Alignment for Improving Compositional
Understanding


by Ziyong Feng, Jun Wang, haoranxu, Kaichengalex, xiaoxing2001

This paper introduces DeGLA, a framework enhancing vision-language models' compositional understanding while maintaining general capabilities by decoupling global self-distillation alignment from local contrastive alignment using LLM-generated hard negatives. The main objective is to overcome the limitation where improving compositional reasoning in models like CLIP often degrades their general performance due to catastrophic forgetting during fine-tuning. DeGLA utilizes self-distillation with an EMA teacher for global alignment and introduces Image-Grounded Contrast (IGC) and Text-Grounded Contrast (TGC) losses with ~2M LLM-generated negative captions for local alignment. Compared to the CE-CLIP baseline, DeGLA shows an average 3.5% improvement across VALSE, SugarCrepe, and ARO compositional benchmarks and a 13.0% average improvement across 11 zero-shot classification datasets. For AI practitioners, DeGLA offers a method to fine-tune vision-language models for improved nuanced understanding (e.g., attribute binding, relations) in multimodal tasks without significantly sacrificing their robust zero-shot transfer abilities.

Read more on arXiv or HuggingFace
Audio
Listen to the summary of the paper.
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery
Simulation


by Leonidas Guibas, Mikaela Angelina Uy, Chanho Park, Jihyeon Je, Phillip Y. Lee

This paper introduces Abstract Perspective Change (APC), a framework enabling vision-language models (VLMs) to perform spatial reasoning from arbitrary viewpoints by simulating mental imagery. The main objective is to overcome the inherent egocentric bias in VLMs and equip them with robust allocentric reasoning capabilities necessary for understanding scenes from perspectives other than the camera's. APC utilizes vision foundation models (object detection, segmentation, orientation estimation) to build a coarse 3D scene abstraction, transforms this abstraction into the reference viewer's egocentric coordinate frame via coordinate transformation, and then prompts the VLM with this transformed representation (either numerically or visually). Experiments show APC significantly outperforms existing VLMs and reconstruction-based approaches, achieving 72.78% accuracy on the challenging 3DSRBench left/right spatial reasoning task using its visual prompt variant (APC-Vis), compared to significantly lower scores for baselines on real images. For AI practitioners, APC provides a concrete methodology to enhance VLM spatial intelligence for tasks requiring perspective shifts (like robotics or embodied AI) by effectively converting allocentric problems into egocentric ones that VLMs can readily solve.

Read more on arXiv or HuggingFace
Audio
Listen to the summary of the paper.
Towards Understanding Camera Motions in Any Video

by Jay Karhade, Daniel Jiang, Stephen624, syCen, zhiqiulin

Introduces CameraBench, a large-scale dataset and benchmark for understanding camera motion primitives in diverse internet videos. The main objective is to evaluate how well current Structure-from-Motion (SfM) and Video-Language Models (VLMs) understand a comprehensive taxonomy of camera motions and to improve this capability. Methodology involves creating a detailed taxonomy with cinematographers, collecting and annotating ~3,000 videos, conducting human studies, and benchmarking 20 diverse models (SfM/SLAM and VLMs) on tasks like classification, VQA, captioning, and retrieval. Primary results show classic SfM struggles with semantic/dynamic content, while VLMs struggle with precise geometry; the best baseline method (MegaSAM) achieves ~50% overall Average Precision (AP) on primitive classification, while fine-tuning a generative VLM (Qwen2.5-VL-7B) boosts performance significantly (~2x improvement), reaching 59.3% AP. AI practitioners can utilize CameraBench's dataset and taxonomy to fine-tune VLMs, substantially improving their ability to interpret both geometric and semantic camera movements for enhanced video understanding applications.

Read more on arXiv or HuggingFace
Audio
Listen to the summary of the paper.
Clinical knowledge in LLMs does not translate to human interactions

by cynddl, sahimo, Chronoszoldyck11, HannahRoseKirk, ambean

Large language models (LLMs) demonstrate high clinical knowledge on benchmarks but fail to improve lay users' medical assessment accuracy in realistic interactive scenarios compared to controls. The study investigated whether providing laypeople with access to high-performing LLMs (GPT-4o, Llama 3, Command R+) improves their ability to identify appropriate medical dispositions and relevant conditions in simulated health scenarios. A randomized controlled trial (N=1298) assigned participants to receive assistance from one of three LLMs or a control group (using typical resources) to assess ten medical vignettes against physician-defined gold standards. While LLMs alone identified relevant conditions in over 90% of cases, participants using LLMs identified relevant conditions in less than 34.5% of cases, significantly underperforming the control group (47.0%, p<0.001), and showed no significant improvement in disposition accuracy. For AI practitioners, this study critically demonstrates that strong performance on static or simulated benchmarks does not predict real-world interactive utility; robust human-user testing focused on interaction dynamics is essential before deploying LLMs for public health applications.

Read more on arXiv or HuggingFace
Audio
Listen to the summary of the paper.