ml4se

PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLM

The authors conducted a largescale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. There is a critical bias towards a limited set of programming concepts.

To address limitations, the authors propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts on a balanced representation of 38 programming concepts across diverse difficulty levels.

👍2

278 viewsedited 15:57

ml4se

Introducing Devin, the first AI software engineer

Devin is equipped with common developer tools including the shell, code editor, and browser within a sandboxed compute environment—everything a human would need to do their work. Devin can actively collaborate with the user. Devin reports on its progress in real time, accepts feedback.

Devin correctly resolves 13.86%* of the issues end-to-end (SWE-bench), exceeding the previous state-of-the-art of 1.96%.

To hire Devin for engineering work: waitlist.

😁4

293 views17:20

ml4se

CAM: A Collection of Snapshots of GitHub Java Repositories Together with Metrics

CAM (Classes and Metrics) is an open-source software capable of cloning Java repositories from GitHub, filtering out unnecessary files, parsing Java classes, and computing metrics such as Cyclomatic Complexity, Halstead Effort and Volume, C&K metrics, Maintainability Metrics, LCOM5 and HND, as well as some Git-based Metrics.

The latest archive of 2.2Gb is published on Amazon S3 and includes 532K Java classes with 48 metrics for each class.

github

👍4

267 viewsedited 17:39

ml4se

DevBench: A Comprehensive Benchmark for Software Development

DevBench is a benchmark designed to evaluate LLMs across various stages of the software development lifecycle, including software design, environment setup, implementation, acceptance testing, and unit testing. By integrating these interconnected steps under a single framework, DevBench offers a holistic perspective on the potential of LLMs for automated software development.

The DevBench dataset comprises 22 curated repositories across 4 programming languages (Python, C/C++, Java, JavaScript), covering a wide range of domains such as machine learning, databases, web services, and command-line utilities.

github

👍2

307 views17:48

ml4se

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

LongRoPE is a method that extends the context length of LLMs to 2048k, while maintaining their capabilities within original shorter context window.

Code will be available at https://github.com/microsoft/

👍4

298 viewsedited 18:25

ml4se

Open Release of Grok-1

xAI is releasing the base model weights and network architecture of Grok-1. Grok-1 is a 314 billion parameter MoE model trained from scratch by xAI. The released checkpoint is the raw base model from the Grok-1 pre-training phase, which concluded in October 2023. This means that the model is not fine-tuned for any specific application, such as dialogue.

The weights and the architecture are released under the Apache 2.0 license.

JAX example code for loading and running the Grok-1 open-weights model: https://github.com/xai-org/grok-1

👍1

318 views02:29

ml4se

LLM4Decompile: Decompiling Binary Code with Large Language Models

The authors released open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code. Experiments indicate that LLM4Decompile has demonstrated the capability to accurately decompile 2% of the assembly code, which achieves a 50% improvement over GPT-4.

Code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile

🔥2👍1

399 viewsedited 02:34

ml4se

Let's create a Tree-sitter grammar

- How to use an external scanner
- Using Tree-sitter’s built-in conflicts resolution
- Syntax highlighting with language injection
- Use the grammar from Neovim for syntax highlighting and textobjects
- Embed the grammar into this Blog for syntax highlighting

👍1

370 views15:16

ml4se

HiRoPE: Length Extrapolation for Code Models

The authors propose HiRoPE, a training-free solution to the context length limitation in LLMs for long code modeling. The hierarchical structure of source code is integrated into position encoding of LLMs. Experiments demonstrate that HiRoPE achieves stable improvements on diverse code-related tasks.

👍4

345 views16:13

ml4se

CYCLE: Learning to Self-Refine the Code Generation

The authors propose CYCLE framework, making an attempt to teach code LMs to self-refine according to the faulty generation in the past and the execution feedback. The framework is evaluated with three popular programming benchmarks: HumanEval, MBPP-Sanitized, and APPS. From the evaluation results, one can conclude that CYCLE is pretty effective at self-refinement, consistently boosting the code generation performance, by up to 63.5% relative improvement.

😁2👍1

304 views17:42

ml4se

The Larger the Better? Improved LLM Code-Generation via Budget Reallocation

The authors compare large language models with smaller-sized models under fixed budget constraints (i.e., FLOPs and wall-time). The models are used for execution-based code-generation tasks with access to unit-tests.

The findings reveal that generating multiple outputs from a 13B model may lead to gains of up to 15% over a single generation from a 70B model across five tasks. This highlights the potential of using smaller models instead of larger ones.

👍4😁1

366 views14:58

ml4se

Large Language Model Evaluation Via Multi AI Agents

Despite extensive efforts to examine LLMs from various perspectives, there is a lack of multi-agent AI models specifically designed to evaluate the performance of different LLMs.

The authors introduce a multi-agent AI model that aims to assess and compare the performance of various LLMs. The model consists of 8 distinct AI agents, each responsible for retrieving code based on a common description from different advanced language models.

RQs:
- How does a multi-agent AI system compare in terms of efficiency and accuracy in generating code from a common description?
- What criteria and methods does the verification agent employ to evaluate the performance of various language models in code generation tasks?

The results:
GPT-3.5 Turbo outperformed other models, delivering correct and efficient code solutions in 7 out of 10 tasks. Other models like GPT-4 and GPT-4 Turbo showed varying levels of performance, but none matched the consistency of GPT-3.5 Turbo.

❤1

360 views12:48

ml4se

Exploring and Evaluating Hallucinations in LLM-Powered Code Generation

Despite the promising performance, LLMs are prone to generate hallucinations, which means LLMs might produce outputs that deviate from users’ intent, exhibit internal inconsistencies, or misalign with the factual knowledge, making the deployment of LLMs potentially risky in a wide range of applications.

The study established a taxonomy of hallucinations in LLM-generated code, encompassing 5 primary categories of hallucinations depending on the conflicting objectives and varying degrees of deviation observed in code generation. Furthermore, we systematically analyzed the distribution of hallucinations, exploring variations among different LLMs and their correlation with code correctness.

Based on the results, the authors proposed HalluCode, a benchmark for evaluating the performance of code LLMs in recognizing hallucinations

RQs:
- Distribution of hallucinations
- Correlation between hallucinations and the functional correctness

😱1

357 views12:56

ml4se

Self-Organized Agents (SoA): A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization

In SoA, self-organized agents operate independently to generate and modify code components while seamlessly collaborating to construct the overall codebase. The evaluation on the HumanEval benchmark demonstrates the superior performance compared to Reflexion, a state-of-the-art single-agent system—SoA achieve a 5% improvement in terms of Pass@1 accuracy.

🔥4

455 views16:57

About

Blog

Apps

Platform