DevBench: A Comprehensive Benchmark for Software Development
DevBench is a benchmark designed to evaluate LLMs across various stages of the software development lifecycle, including software design, environment setup, implementation, acceptance testing, and unit testing. By integrating these interconnected steps under a single framework, DevBench offers a holistic perspective on the potential of LLMs for automated software development.
The DevBench dataset comprises 22 curated repositories across 4 programming languages (Python, C/C++, Java, JavaScript), covering a wide range of domains such as machine learning, databases, web services, and command-line utilities.
github
DevBench is a benchmark designed to evaluate LLMs across various stages of the software development lifecycle, including software design, environment setup, implementation, acceptance testing, and unit testing. By integrating these interconnected steps under a single framework, DevBench offers a holistic perspective on the potential of LLMs for automated software development.
The DevBench dataset comprises 22 curated repositories across 4 programming languages (Python, C/C++, Java, JavaScript), covering a wide range of domains such as machine learning, databases, web services, and command-line utilities.
github
π2
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
LongRoPE is a method that extends the context length of LLMs to 2048k, while maintaining their capabilities within original shorter context window.
Code will be available at https://github.com/microsoft/
LongRoPE is a method that extends the context length of LLMs to 2048k, while maintaining their capabilities within original shorter context window.
Code will be available at https://github.com/microsoft/
π4
Open Release of Grok-1
xAI is releasing the base model weights and network architecture of Grok-1. Grok-1 is a 314 billion parameter MoE model trained from scratch by xAI. The released checkpoint is the raw base model from the Grok-1 pre-training phase, which concluded in October 2023. This means that the model is not fine-tuned for any specific application, such as dialogue.
The weights and the architecture are released under the Apache 2.0 license.
JAX example code for loading and running the Grok-1 open-weights model: https://github.com/xai-org/grok-1
xAI is releasing the base model weights and network architecture of Grok-1. Grok-1 is a 314 billion parameter MoE model trained from scratch by xAI. The released checkpoint is the raw base model from the Grok-1 pre-training phase, which concluded in October 2023. This means that the model is not fine-tuned for any specific application, such as dialogue.
The weights and the architecture are released under the Apache 2.0 license.
JAX example code for loading and running the Grok-1 open-weights model: https://github.com/xai-org/grok-1
π1
LLM4Decompile: Decompiling Binary Code with Large Language Models
The authors released open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code. Experiments indicate that LLM4Decompile has demonstrated the capability to accurately decompile 2% of the assembly code, which achieves a 50% improvement over GPT-4.
Code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile
The authors released open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code. Experiments indicate that LLM4Decompile has demonstrated the capability to accurately decompile 2% of the assembly code, which achieves a 50% improvement over GPT-4.
Code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile
π₯2π1
Let's create a Tree-sitter grammar
- How to use an external scanner
- Using Tree-sitterβs built-in conflicts resolution
- Syntax highlighting with language injection
- Use the grammar from Neovim for syntax highlighting and textobjects
- Embed the grammar into this Blog for syntax highlighting
- How to use an external scanner
- Using Tree-sitterβs built-in conflicts resolution
- Syntax highlighting with language injection
- Use the grammar from Neovim for syntax highlighting and textobjects
- Embed the grammar into this Blog for syntax highlighting
π1
HiRoPE: Length Extrapolation for Code Models
The authors propose HiRoPE, a training-free solution to the context length limitation in LLMs for long code modeling. The hierarchical structure of source code is integrated into position encoding of LLMs. Experiments demonstrate that HiRoPE achieves stable improvements on diverse code-related tasks.
The authors propose HiRoPE, a training-free solution to the context length limitation in LLMs for long code modeling. The hierarchical structure of source code is integrated into position encoding of LLMs. Experiments demonstrate that HiRoPE achieves stable improvements on diverse code-related tasks.
π4
CYCLE: Learning to Self-Refine the Code Generation
The authors propose CYCLE framework, making an attempt to teach code LMs to self-refine according to the faulty generation in the past and the execution feedback. The framework is evaluated with three popular programming benchmarks: HumanEval, MBPP-Sanitized, and APPS. From the evaluation results, one can conclude that CYCLE is pretty effective at self-refinement, consistently boosting the code generation performance, by up to 63.5% relative improvement.
The authors propose CYCLE framework, making an attempt to teach code LMs to self-refine according to the faulty generation in the past and the execution feedback. The framework is evaluated with three popular programming benchmarks: HumanEval, MBPP-Sanitized, and APPS. From the evaluation results, one can conclude that CYCLE is pretty effective at self-refinement, consistently boosting the code generation performance, by up to 63.5% relative improvement.
π2π1
The Larger the Better? Improved LLM Code-Generation via Budget Reallocation
The authors compare large language models with smaller-sized models under fixed budget constraints (i.e., FLOPs and wall-time). The models are used for execution-based code-generation tasks with access to unit-tests.
The findings reveal that generating multiple outputs from a 13B model may lead to gains of up to 15% over a single generation from a 70B model across five tasks. This highlights the potential of using smaller models instead of larger ones.
The authors compare large language models with smaller-sized models under fixed budget constraints (i.e., FLOPs and wall-time). The models are used for execution-based code-generation tasks with access to unit-tests.
The findings reveal that generating multiple outputs from a 13B model may lead to gains of up to 15% over a single generation from a 70B model across five tasks. This highlights the potential of using smaller models instead of larger ones.
π4π1
Large Language Model Evaluation Via Multi AI Agents
Despite extensive efforts to examine LLMs from various perspectives, there is a lack of multi-agent AI models specifically designed to evaluate the performance of different LLMs.
The authors introduce a multi-agent AI model that aims to assess and compare the performance of various LLMs. The model consists of 8 distinct AI agents, each responsible for retrieving code based on a common description from different advanced language models.
RQs:
- How does a multi-agent AI system compare in terms of efficiency and accuracy in generating code from a common description?
- What criteria and methods does the verification agent employ to evaluate the performance of various language models in code generation tasks?
The results:
GPT-3.5 Turbo outperformed other models, delivering correct and efficient code solutions in 7 out of 10 tasks. Other models like GPT-4 and GPT-4 Turbo showed varying levels of performance, but none matched the consistency of GPT-3.5 Turbo.
Despite extensive efforts to examine LLMs from various perspectives, there is a lack of multi-agent AI models specifically designed to evaluate the performance of different LLMs.
The authors introduce a multi-agent AI model that aims to assess and compare the performance of various LLMs. The model consists of 8 distinct AI agents, each responsible for retrieving code based on a common description from different advanced language models.
RQs:
- How does a multi-agent AI system compare in terms of efficiency and accuracy in generating code from a common description?
- What criteria and methods does the verification agent employ to evaluate the performance of various language models in code generation tasks?
The results:
GPT-3.5 Turbo outperformed other models, delivering correct and efficient code solutions in 7 out of 10 tasks. Other models like GPT-4 and GPT-4 Turbo showed varying levels of performance, but none matched the consistency of GPT-3.5 Turbo.
β€1
Exploring and Evaluating Hallucinations in LLM-Powered Code Generation
Despite the promising performance, LLMs are prone to generate hallucinations, which means LLMs might produce outputs that deviate from usersβ intent, exhibit internal inconsistencies, or misalign with the factual knowledge, making the deployment of LLMs potentially risky in a wide range of applications.
The study established a taxonomy of hallucinations in LLM-generated code, encompassing 5 primary categories of hallucinations depending on the conflicting objectives and varying degrees of deviation observed in code generation. Furthermore, we systematically analyzed the distribution of hallucinations, exploring variations among different LLMs and their correlation with code correctness.
Based on the results, the authors proposed HalluCode, a benchmark for evaluating the performance of code LLMs in recognizing hallucinations
RQs:
- Distribution of hallucinations
- Correlation between hallucinations and the functional correctness
Despite the promising performance, LLMs are prone to generate hallucinations, which means LLMs might produce outputs that deviate from usersβ intent, exhibit internal inconsistencies, or misalign with the factual knowledge, making the deployment of LLMs potentially risky in a wide range of applications.
The study established a taxonomy of hallucinations in LLM-generated code, encompassing 5 primary categories of hallucinations depending on the conflicting objectives and varying degrees of deviation observed in code generation. Furthermore, we systematically analyzed the distribution of hallucinations, exploring variations among different LLMs and their correlation with code correctness.
Based on the results, the authors proposed HalluCode, a benchmark for evaluating the performance of code LLMs in recognizing hallucinations
RQs:
- Distribution of hallucinations
- Correlation between hallucinations and the functional correctness
π±1
Self-Organized Agents (SoA): A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization
In SoA, self-organized agents operate independently to generate and modify code components while seamlessly collaborating to construct the overall codebase. The evaluation on the HumanEval benchmark demonstrates the superior performance compared to Reflexion, a state-of-the-art single-agent systemβSoA achieve a 5% improvement in terms of Pass@1 accuracy.
In SoA, self-organized agents operate independently to generate and modify code components while seamlessly collaborating to construct the overall codebase. The evaluation on the HumanEval benchmark demonstrates the superior performance compared to Reflexion, a state-of-the-art single-agent systemβSoA achieve a 5% improvement in terms of Pass@1 accuracy.
π₯4
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
A new paper by researchers at Google claims to give LLMs the ability to work with text of infinite length. The paper introduces Infini-attention, a technique that configures language models in a way that extends their context window while keeping memory and compute requirements constant.
1B model that was fine-tuned on up to 5K sequence length passkey instances solved the 1M length problem.
A new paper by researchers at Google claims to give LLMs the ability to work with text of infinite length. The paper introduces Infini-attention, a technique that configures language models in a way that extends their context window while keeping memory and compute requirements constant.
1B model that was fine-tuned on up to 5K sequence length passkey instances solved the 1M length problem.
π€3π1
CodeGemma: Open Code Models Based on Gemma
CodeGemma is a collection of open models specialized for coding applications, built on top of Gemma, an openly available family of language models.
CodeGemma is a collection of open models specialized for coding applications, built on top of Gemma, an openly available family of language models.
WizardLM-2
WizardLM-2 is a generation of WizardLM models announced by Microsoft AI. The models have improved performance on complex chat, multilingual, reasoning and agent. New family includes three models: WizardLM-2 8x22B, WizardLM-2 70B, and WizardLM-2 7B.
- WizardLM-2 8x22B is the most advanced model, demonstrates highly competitive performance compared to those leading proprietary works and consistently outperforms the existing state-of-the-art opensource models.
- WizardLM-2 70B reaches top-tier reasoning capabilities and is the first choice in the same size.
- WizardLM-2 7B is the fastest and achieves comparable performance with existing 10x larger opensource leading models.
WizardLM-2 is a generation of WizardLM models announced by Microsoft AI. The models have improved performance on complex chat, multilingual, reasoning and agent. New family includes three models: WizardLM-2 8x22B, WizardLM-2 70B, and WizardLM-2 7B.
- WizardLM-2 8x22B is the most advanced model, demonstrates highly competitive performance compared to those leading proprietary works and consistently outperforms the existing state-of-the-art opensource models.
- WizardLM-2 70B reaches top-tier reasoning capabilities and is the first choice in the same size.
- WizardLM-2 7B is the fastest and achieves comparable performance with existing 10x larger opensource leading models.
π1