ml4se
499 subscribers
447 photos
1 file
525 links
Machine Learning for Software Engineering
Download Telegram
DevBench: A Comprehensive Benchmark for Software Development

DevBench is a benchmark designed to evaluate LLMs across various stages of the software development lifecycle, including software design, environment setup, implementation, acceptance testing, and unit testing. By integrating these interconnected steps under a single framework, DevBench offers a holistic perspective on the potential of LLMs for automated software development.

The DevBench dataset comprises 22 curated repositories across 4 programming languages (Python, C/C++, Java, JavaScript), covering a wide range of domains such as machine learning, databases, web services, and command-line utilities.

github
πŸ‘2
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

LongRoPE is a method that extends the context length of LLMs to 2048k, while maintaining their capabilities within original shorter context window.

Code will be available at https://github.com/microsoft/
πŸ‘4
Open Release of Grok-1

xAI is releasing the base model weights and network architecture of Grok-1. Grok-1 is a 314 billion parameter MoE model trained from scratch by xAI. The released checkpoint is the raw base model from the Grok-1 pre-training phase, which concluded in October 2023. This means that the model is not fine-tuned for any specific application, such as dialogue.

The weights and the architecture are released under the Apache 2.0 license.

JAX example code for loading and running the Grok-1 open-weights model: https://github.com/xai-org/grok-1
πŸ‘1
LLM4Decompile: Decompiling Binary Code with Large Language Models

The authors released open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code. Experiments indicate that LLM4Decompile has demonstrated the capability to accurately decompile 2% of the assembly code, which achieves a 50% improvement over GPT-4.

Code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile
πŸ”₯2πŸ‘1
Let's create a Tree-sitter grammar

- How to use an external scanner
- Using Tree-sitter’s built-in conflicts resolution
- Syntax highlighting with language injection
- Use the grammar from Neovim for syntax highlighting and textobjects
- Embed the grammar into this Blog for syntax highlighting
πŸ‘1
HiRoPE: Length Extrapolation for Code Models

The authors propose HiRoPE, a training-free solution to the context length limitation in LLMs for long code modeling. The hierarchical structure of source code is integrated into position encoding of LLMs. Experiments demonstrate that HiRoPE achieves stable improvements on diverse code-related tasks.
πŸ‘4
CYCLE: Learning to Self-Refine the Code Generation

The authors propose CYCLE framework, making an attempt to teach code LMs to self-refine according to the faulty generation in the past and the execution feedback. The framework is evaluated with three popular programming benchmarks: HumanEval, MBPP-Sanitized, and APPS. From the evaluation results, one can conclude that CYCLE is pretty effective at self-refinement, consistently boosting the code generation performance, by up to 63.5% relative improvement.
😁2πŸ‘1
The Larger the Better? Improved LLM Code-Generation via Budget Reallocation

The authors compare large language models with smaller-sized models under fixed budget constraints (i.e., FLOPs and wall-time). The models are used for execution-based code-generation tasks with access to unit-tests.

The findings reveal that generating multiple outputs from a 13B model may lead to gains of up to 15% over a single generation from a 70B model across five tasks. This highlights the potential of using smaller models instead of larger ones.
πŸ‘4😁1
Large Language Model Evaluation Via Multi AI Agents

Despite extensive efforts to examine LLMs from various perspectives, there is a lack of multi-agent AI models specifically designed to evaluate the performance of different LLMs.

The authors introduce a multi-agent AI model that aims to assess and compare the performance of various LLMs. The model consists of 8 distinct AI agents, each responsible for retrieving code based on a common description from different advanced language models.

RQs:
- How does a multi-agent AI system compare in terms of efficiency and accuracy in generating code from a common description?
- What criteria and methods does the verification agent employ to evaluate the performance of various language models in code generation tasks?

The results:
GPT-3.5 Turbo outperformed other models, delivering correct and efficient code solutions in 7 out of 10 tasks. Other models like GPT-4 and GPT-4 Turbo showed varying levels of performance, but none matched the consistency of GPT-3.5 Turbo.
❀1
Exploring and Evaluating Hallucinations in LLM-Powered Code Generation

Despite the promising performance, LLMs are prone to generate hallucinations, which means LLMs might produce outputs that deviate from users’ intent, exhibit internal inconsistencies, or misalign with the factual knowledge, making the deployment of LLMs potentially risky in a wide range of applications.

The study established a taxonomy of hallucinations in LLM-generated code, encompassing 5 primary categories of hallucinations depending on the conflicting objectives and varying degrees of deviation observed in code generation. Furthermore, we systematically analyzed the distribution of hallucinations, exploring variations among different LLMs and their correlation with code correctness.

Based on the results, the authors proposed HalluCode, a benchmark for evaluating the performance of code LLMs in recognizing hallucinations

RQs:
- Distribution of hallucinations
- Correlation between hallucinations and the functional correctness
😱1
Self-Organized Agents (SoA): A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization

In SoA, self-organized agents operate independently to generate and modify code components while seamlessly collaborating to construct the overall codebase. The evaluation on the HumanEval benchmark demonstrates the superior performance compared to Reflexion, a state-of-the-art single-agent systemβ€”SoA achieve a 5% improvement in terms of Pass@1 accuracy.
πŸ”₯4
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

A new paper by researchers at Google claims to give LLMs the ability to work with text of infinite length. The paper introduces Infini-attention, a technique that configures language models in a way that extends their context window while keeping memory and compute requirements constant.

1B model that was fine-tuned on up to 5K sequence length passkey instances solved the 1M length problem.
πŸ€”3πŸ‘1
CodeGemma: Open Code Models Based on Gemma

CodeGemma is a collection of open models specialized for coding applications, built on top of Gemma, an openly available family of language models.
WizardLM-2

WizardLM-2 is a generation of WizardLM models announced by Microsoft AI. The models have improved performance on complex chat, multilingual, reasoning and agent. New family includes three models: WizardLM-2 8x22B, WizardLM-2 70B, and WizardLM-2 7B.

- WizardLM-2 8x22B is the most advanced model, demonstrates highly competitive performance compared to those leading proprietary works and consistently outperforms the existing state-of-the-art opensource models.
- WizardLM-2 70B reaches top-tier reasoning capabilities and is the first choice in the same size.
- WizardLM-2 7B is the fastest and achieves comparable performance with existing 10x larger opensource leading models.
πŸ‘1