PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLM
The authors conducted a largescale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. There is a critical bias towards a limited set of programming concepts.
To address limitations, the authors propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts on a balanced representation of 38 programming concepts across diverse difficulty levels.
The authors conducted a largescale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. There is a critical bias towards a limited set of programming concepts.
To address limitations, the authors propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts on a balanced representation of 38 programming concepts across diverse difficulty levels.
๐2
Introducing Devin, the first AI software engineer
Devin is equipped with common developer tools including the shell, code editor, and browser within a sandboxed compute environmentโeverything a human would need to do their work. Devin can actively collaborate with the user. Devin reports on its progress in real time, accepts feedback.
Devin correctly resolves 13.86%* of the issues end-to-end (SWE-bench), exceeding the previous state-of-the-art of 1.96%.
To hire Devin for engineering work: waitlist.
Devin is equipped with common developer tools including the shell, code editor, and browser within a sandboxed compute environmentโeverything a human would need to do their work. Devin can actively collaborate with the user. Devin reports on its progress in real time, accepts feedback.
Devin correctly resolves 13.86%* of the issues end-to-end (SWE-bench), exceeding the previous state-of-the-art of 1.96%.
To hire Devin for engineering work: waitlist.
๐4
CAM: A Collection of Snapshots of GitHub Java Repositories Together with Metrics
CAM (Classes and Metrics) is an open-source software capable of cloning Java repositories from GitHub, filtering out unnecessary files, parsing Java classes, and computing metrics such as Cyclomatic Complexity, Halstead Effort and Volume, C&K metrics, Maintainability Metrics, LCOM5 and HND, as well as some Git-based Metrics.
The latest archive of 2.2Gb is published on Amazon S3 and includes 532K Java classes with 48 metrics for each class.
github
CAM (Classes and Metrics) is an open-source software capable of cloning Java repositories from GitHub, filtering out unnecessary files, parsing Java classes, and computing metrics such as Cyclomatic Complexity, Halstead Effort and Volume, C&K metrics, Maintainability Metrics, LCOM5 and HND, as well as some Git-based Metrics.
The latest archive of 2.2Gb is published on Amazon S3 and includes 532K Java classes with 48 metrics for each class.
github
๐4
DevBench: A Comprehensive Benchmark for Software Development
DevBench is a benchmark designed to evaluate LLMs across various stages of the software development lifecycle, including software design, environment setup, implementation, acceptance testing, and unit testing. By integrating these interconnected steps under a single framework, DevBench offers a holistic perspective on the potential of LLMs for automated software development.
The DevBench dataset comprises 22 curated repositories across 4 programming languages (Python, C/C++, Java, JavaScript), covering a wide range of domains such as machine learning, databases, web services, and command-line utilities.
github
DevBench is a benchmark designed to evaluate LLMs across various stages of the software development lifecycle, including software design, environment setup, implementation, acceptance testing, and unit testing. By integrating these interconnected steps under a single framework, DevBench offers a holistic perspective on the potential of LLMs for automated software development.
The DevBench dataset comprises 22 curated repositories across 4 programming languages (Python, C/C++, Java, JavaScript), covering a wide range of domains such as machine learning, databases, web services, and command-line utilities.
github
๐2
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
LongRoPE is a method that extends the context length of LLMs to 2048k, while maintaining their capabilities within original shorter context window.
Code will be available at https://github.com/microsoft/
LongRoPE is a method that extends the context length of LLMs to 2048k, while maintaining their capabilities within original shorter context window.
Code will be available at https://github.com/microsoft/
๐4
Open Release of Grok-1
xAI is releasing the base model weights and network architecture of Grok-1. Grok-1 is a 314 billion parameter MoE model trained from scratch by xAI. The released checkpoint is the raw base model from the Grok-1 pre-training phase, which concluded in October 2023. This means that the model is not fine-tuned for any specific application, such as dialogue.
The weights and the architecture are released under the Apache 2.0 license.
JAX example code for loading and running the Grok-1 open-weights model: https://github.com/xai-org/grok-1
xAI is releasing the base model weights and network architecture of Grok-1. Grok-1 is a 314 billion parameter MoE model trained from scratch by xAI. The released checkpoint is the raw base model from the Grok-1 pre-training phase, which concluded in October 2023. This means that the model is not fine-tuned for any specific application, such as dialogue.
The weights and the architecture are released under the Apache 2.0 license.
JAX example code for loading and running the Grok-1 open-weights model: https://github.com/xai-org/grok-1
๐1
LLM4Decompile: Decompiling Binary Code with Large Language Models
The authors released open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code. Experiments indicate that LLM4Decompile has demonstrated the capability to accurately decompile 2% of the assembly code, which achieves a 50% improvement over GPT-4.
Code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile
The authors released open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code. Experiments indicate that LLM4Decompile has demonstrated the capability to accurately decompile 2% of the assembly code, which achieves a 50% improvement over GPT-4.
Code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile
๐ฅ2๐1
Let's create a Tree-sitter grammar
- How to use an external scanner
- Using Tree-sitterโs built-in conflicts resolution
- Syntax highlighting with language injection
- Use the grammar from Neovim for syntax highlighting and textobjects
- Embed the grammar into this Blog for syntax highlighting
- How to use an external scanner
- Using Tree-sitterโs built-in conflicts resolution
- Syntax highlighting with language injection
- Use the grammar from Neovim for syntax highlighting and textobjects
- Embed the grammar into this Blog for syntax highlighting
๐1
HiRoPE: Length Extrapolation for Code Models
The authors propose HiRoPE, a training-free solution to the context length limitation in LLMs for long code modeling. The hierarchical structure of source code is integrated into position encoding of LLMs. Experiments demonstrate that HiRoPE achieves stable improvements on diverse code-related tasks.
The authors propose HiRoPE, a training-free solution to the context length limitation in LLMs for long code modeling. The hierarchical structure of source code is integrated into position encoding of LLMs. Experiments demonstrate that HiRoPE achieves stable improvements on diverse code-related tasks.
๐4
CYCLE: Learning to Self-Refine the Code Generation
The authors propose CYCLE framework, making an attempt to teach code LMs to self-refine according to the faulty generation in the past and the execution feedback. The framework is evaluated with three popular programming benchmarks: HumanEval, MBPP-Sanitized, and APPS. From the evaluation results, one can conclude that CYCLE is pretty effective at self-refinement, consistently boosting the code generation performance, by up to 63.5% relative improvement.
The authors propose CYCLE framework, making an attempt to teach code LMs to self-refine according to the faulty generation in the past and the execution feedback. The framework is evaluated with three popular programming benchmarks: HumanEval, MBPP-Sanitized, and APPS. From the evaluation results, one can conclude that CYCLE is pretty effective at self-refinement, consistently boosting the code generation performance, by up to 63.5% relative improvement.
๐2๐1
The Larger the Better? Improved LLM Code-Generation via Budget Reallocation
The authors compare large language models with smaller-sized models under fixed budget constraints (i.e., FLOPs and wall-time). The models are used for execution-based code-generation tasks with access to unit-tests.
The findings reveal that generating multiple outputs from a 13B model may lead to gains of up to 15% over a single generation from a 70B model across five tasks. This highlights the potential of using smaller models instead of larger ones.
The authors compare large language models with smaller-sized models under fixed budget constraints (i.e., FLOPs and wall-time). The models are used for execution-based code-generation tasks with access to unit-tests.
The findings reveal that generating multiple outputs from a 13B model may lead to gains of up to 15% over a single generation from a 70B model across five tasks. This highlights the potential of using smaller models instead of larger ones.
๐4๐1
Large Language Model Evaluation Via Multi AI Agents
Despite extensive efforts to examine LLMs from various perspectives, there is a lack of multi-agent AI models specifically designed to evaluate the performance of different LLMs.
The authors introduce a multi-agent AI model that aims to assess and compare the performance of various LLMs. The model consists of 8 distinct AI agents, each responsible for retrieving code based on a common description from different advanced language models.
RQs:
- How does a multi-agent AI system compare in terms of efficiency and accuracy in generating code from a common description?
- What criteria and methods does the verification agent employ to evaluate the performance of various language models in code generation tasks?
The results:
GPT-3.5 Turbo outperformed other models, delivering correct and efficient code solutions in 7 out of 10 tasks. Other models like GPT-4 and GPT-4 Turbo showed varying levels of performance, but none matched the consistency of GPT-3.5 Turbo.
Despite extensive efforts to examine LLMs from various perspectives, there is a lack of multi-agent AI models specifically designed to evaluate the performance of different LLMs.
The authors introduce a multi-agent AI model that aims to assess and compare the performance of various LLMs. The model consists of 8 distinct AI agents, each responsible for retrieving code based on a common description from different advanced language models.
RQs:
- How does a multi-agent AI system compare in terms of efficiency and accuracy in generating code from a common description?
- What criteria and methods does the verification agent employ to evaluate the performance of various language models in code generation tasks?
The results:
GPT-3.5 Turbo outperformed other models, delivering correct and efficient code solutions in 7 out of 10 tasks. Other models like GPT-4 and GPT-4 Turbo showed varying levels of performance, but none matched the consistency of GPT-3.5 Turbo.
โค1
Exploring and Evaluating Hallucinations in LLM-Powered Code Generation
Despite the promising performance, LLMs are prone to generate hallucinations, which means LLMs might produce outputs that deviate from usersโ intent, exhibit internal inconsistencies, or misalign with the factual knowledge, making the deployment of LLMs potentially risky in a wide range of applications.
The study established a taxonomy of hallucinations in LLM-generated code, encompassing 5 primary categories of hallucinations depending on the conflicting objectives and varying degrees of deviation observed in code generation. Furthermore, we systematically analyzed the distribution of hallucinations, exploring variations among different LLMs and their correlation with code correctness.
Based on the results, the authors proposed HalluCode, a benchmark for evaluating the performance of code LLMs in recognizing hallucinations
RQs:
- Distribution of hallucinations
- Correlation between hallucinations and the functional correctness
Despite the promising performance, LLMs are prone to generate hallucinations, which means LLMs might produce outputs that deviate from usersโ intent, exhibit internal inconsistencies, or misalign with the factual knowledge, making the deployment of LLMs potentially risky in a wide range of applications.
The study established a taxonomy of hallucinations in LLM-generated code, encompassing 5 primary categories of hallucinations depending on the conflicting objectives and varying degrees of deviation observed in code generation. Furthermore, we systematically analyzed the distribution of hallucinations, exploring variations among different LLMs and their correlation with code correctness.
Based on the results, the authors proposed HalluCode, a benchmark for evaluating the performance of code LLMs in recognizing hallucinations
RQs:
- Distribution of hallucinations
- Correlation between hallucinations and the functional correctness
๐ฑ1
Self-Organized Agents (SoA): A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization
In SoA, self-organized agents operate independently to generate and modify code components while seamlessly collaborating to construct the overall codebase. The evaluation on the HumanEval benchmark demonstrates the superior performance compared to Reflexion, a state-of-the-art single-agent systemโSoA achieve a 5% improvement in terms of Pass@1 accuracy.
In SoA, self-organized agents operate independently to generate and modify code components while seamlessly collaborating to construct the overall codebase. The evaluation on the HumanEval benchmark demonstrates the superior performance compared to Reflexion, a state-of-the-art single-agent systemโSoA achieve a 5% improvement in terms of Pass@1 accuracy.
๐ฅ4