ml4se
499 subscribers
447 photos
1 file
525 links
Machine Learning for Software Engineering
Download Telegram
CYCLE: Learning to Self-Refine the Code Generation

The authors propose CYCLE framework, making an attempt to teach code LMs to self-refine according to the faulty generation in the past and the execution feedback. The framework is evaluated with three popular programming benchmarks: HumanEval, MBPP-Sanitized, and APPS. From the evaluation results, one can conclude that CYCLE is pretty effective at self-refinement, consistently boosting the code generation performance, by up to 63.5% relative improvement.
😁2πŸ‘1
The Larger the Better? Improved LLM Code-Generation via Budget Reallocation

The authors compare large language models with smaller-sized models under fixed budget constraints (i.e., FLOPs and wall-time). The models are used for execution-based code-generation tasks with access to unit-tests.

The findings reveal that generating multiple outputs from a 13B model may lead to gains of up to 15% over a single generation from a 70B model across five tasks. This highlights the potential of using smaller models instead of larger ones.
πŸ‘4😁1
Large Language Model Evaluation Via Multi AI Agents

Despite extensive efforts to examine LLMs from various perspectives, there is a lack of multi-agent AI models specifically designed to evaluate the performance of different LLMs.

The authors introduce a multi-agent AI model that aims to assess and compare the performance of various LLMs. The model consists of 8 distinct AI agents, each responsible for retrieving code based on a common description from different advanced language models.

RQs:
- How does a multi-agent AI system compare in terms of efficiency and accuracy in generating code from a common description?
- What criteria and methods does the verification agent employ to evaluate the performance of various language models in code generation tasks?

The results:
GPT-3.5 Turbo outperformed other models, delivering correct and efficient code solutions in 7 out of 10 tasks. Other models like GPT-4 and GPT-4 Turbo showed varying levels of performance, but none matched the consistency of GPT-3.5 Turbo.
❀1
Exploring and Evaluating Hallucinations in LLM-Powered Code Generation

Despite the promising performance, LLMs are prone to generate hallucinations, which means LLMs might produce outputs that deviate from users’ intent, exhibit internal inconsistencies, or misalign with the factual knowledge, making the deployment of LLMs potentially risky in a wide range of applications.

The study established a taxonomy of hallucinations in LLM-generated code, encompassing 5 primary categories of hallucinations depending on the conflicting objectives and varying degrees of deviation observed in code generation. Furthermore, we systematically analyzed the distribution of hallucinations, exploring variations among different LLMs and their correlation with code correctness.

Based on the results, the authors proposed HalluCode, a benchmark for evaluating the performance of code LLMs in recognizing hallucinations

RQs:
- Distribution of hallucinations
- Correlation between hallucinations and the functional correctness
😱1
Self-Organized Agents (SoA): A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization

In SoA, self-organized agents operate independently to generate and modify code components while seamlessly collaborating to construct the overall codebase. The evaluation on the HumanEval benchmark demonstrates the superior performance compared to Reflexion, a state-of-the-art single-agent systemβ€”SoA achieve a 5% improvement in terms of Pass@1 accuracy.
πŸ”₯4
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

A new paper by researchers at Google claims to give LLMs the ability to work with text of infinite length. The paper introduces Infini-attention, a technique that configures language models in a way that extends their context window while keeping memory and compute requirements constant.

1B model that was fine-tuned on up to 5K sequence length passkey instances solved the 1M length problem.
πŸ€”3πŸ‘1
CodeGemma: Open Code Models Based on Gemma

CodeGemma is a collection of open models specialized for coding applications, built on top of Gemma, an openly available family of language models.
WizardLM-2

WizardLM-2 is a generation of WizardLM models announced by Microsoft AI. The models have improved performance on complex chat, multilingual, reasoning and agent. New family includes three models: WizardLM-2 8x22B, WizardLM-2 70B, and WizardLM-2 7B.

- WizardLM-2 8x22B is the most advanced model, demonstrates highly competitive performance compared to those leading proprietary works and consistently outperforms the existing state-of-the-art opensource models.
- WizardLM-2 70B reaches top-tier reasoning capabilities and is the first choice in the same size.
- WizardLM-2 7B is the fastest and achieves comparable performance with existing 10x larger opensource leading models.
πŸ‘1
NExT: Teaching Large Language Models to Reason about Code Execution

Understanding and reasoning about program execution is a basic skill for human developers. For instance, to debug and correct code, a developer can mentally imitate code execution in natural language. However, LLMs are commonly trained on the textual structure of programs, which could lead to a lack a semantic understanding of how programs execute at run-time. In the paper the authors present NExT, a self-training method to finetune LLMs to reason with program execution given traces. PaLM 2-L trained using NExT yields high-quality natural language rationales and achieves stronger success rates on two program repair tasks.
πŸ”₯1
LoRA+: Efficient Low Rank Adaptation of Large Models

LoRA finetuning as it is currently used in practice is not efficient. The authors proposed a method, LoRA+, that resolves this issue by setting different learning rates for LoRA adapter matrices. The analysis is supported by empirical results confirming the benefits of LoRA+ for both training speed and performance.
πŸ”₯1
StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation

The very first entirely self-aligned code LLM trained with a fully permissive and transparent pipeline.
- weights: https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Prometheus 2 (8x7B) is an open-source evaluator language model. Compared to Prometheus 1 (13B), Prometheus 2 (8x7B) shows improved evaluation performances & supports assessing in pairwise ranking (relative grading) formats as well. It also scores a 72% to 85% agreement with human judgments across multiple pairwise ranking benchmarks.

Prometheus 2 (7B) is a lighter version of Prometheus 2 (8x7B) model with reasonable performances (outperforming Llama-2-70B & on par with Mixtral-8x7B). It achieves at least 80% of the evaluation statistics or performances of Prometheus 2 (8x7B).

GitHub: https://github.com/prometheus-eval/prometheus-eval
Better & Faster Large Language Models via Multi-token Prediction

LLMs are trained with a next-token prediction loss. The authors propose multi-token prediction as an improvement over next-token prediction in training language models for generative or reasoning tasks. The experiments (up to 7B parameters and 1T tokens) show that this is increasingly useful for larger models and in particular show strong improvements for code tasks.
πŸ‘2
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as PPO.

However, in academic benchmarks, the SotA results are often achieved via reward-free methods, such as DPO.

Is DPO truly superior to PPO?

Through theoretical and experimental analysis, the authors explore the limitations of DPO and find that DPO is sensitive to the distribution shift between the base model outputs and preference data. Also DPO fails to improve the performance on challenging tasks such as code generation. PPO demonstrates robust effectiveness across diverse tasks and achieves state-of-the-art results in challenging code competition tasks.