ml4se

CodeGemma: Open Code Models Based on Gemma

CodeGemma is a collection of open models specialized for coding applications, built on top of Gemma, an openly available family of language models.

319 views16:14

ml4se

WizardLM-2

WizardLM-2 is a generation of WizardLM models announced by Microsoft AI. The models have improved performance on complex chat, multilingual, reasoning and agent. New family includes three models: WizardLM-2 8x22B, WizardLM-2 70B, and WizardLM-2 7B.

- WizardLM-2 8x22B is the most advanced model, demonstrates highly competitive performance compared to those leading proprietary works and consistently outperforms the existing state-of-the-art opensource models.
- WizardLM-2 70B reaches top-tier reasoning capabilities and is the first choice in the same size.
- WizardLM-2 7B is the fastest and achieves comparable performance with existing 10x larger opensource leading models.

👍1

408 views07:10

ml4se

NExT: Teaching Large Language Models to Reason about Code Execution

Understanding and reasoning about program execution is a basic skill for human developers. For instance, to debug and correct code, a developer can mentally imitate code execution in natural language. However, LLMs are commonly trained on the textual structure of programs, which could lead to a lack a semantic understanding of how programs execute at run-time. In the paper the authors present NExT, a self-training method to finetune LLMs to reason with program execution given traces. PaLM 2-L trained using NExT yields high-quality natural language rationales and achieves stronger success rates on two program repair tasks.

🔥1

434 views04:15

ml4se

LoRA+: Efficient Low Rank Adaptation of Large Models

LoRA finetuning as it is currently used in practice is not efficient. The authors proposed a method, LoRA+, that resolves this issue by setting different learning rates for LoRA adapter matrices. The analysis is supported by empirical results confirming the benefits of LoRA+ for both training speed and performance.

🔥1

365 views17:08

ml4se

LLM4Code'24

Accepted papers: https://llm4code.github.io/papers/

326 views15:35

ml4se

StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation

The very first entirely self-aligned code LLM trained with a fully permissive and transparent pipeline.
- weights: https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1

316 views15:52

ml4se

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Prometheus 2 (8x7B) is an open-source evaluator language model. Compared to Prometheus 1 (13B), Prometheus 2 (8x7B) shows improved evaluation performances & supports assessing in pairwise ranking (relative grading) formats as well. It also scores a 72% to 85% agreement with human judgments across multiple pairwise ranking benchmarks.

Prometheus 2 (7B) is a lighter version of Prometheus 2 (8x7B) model with reasonable performances (outperforming Llama-2-70B & on par with Mixtral-8x7B). It achieves at least 80% of the evaluation statistics or performances of Prometheus 2 (8x7B).

GitHub: https://github.com/prometheus-eval/prometheus-eval

397 views16:03

ml4se

ICLR 2024

6—11 May, Schedule

Workshops (some):
- Representational Alignment , papers
- Privacy Regulation and Protection in Machine Learning , papers
- LLM Agents
- How Far Are We From AGI
- Secure and Trustworthy Large Language Models
- Bridging the Gap Between Practice and Theory in Deep Learning

Papers
ICLR Proceedings at OpenReview

👏2

372 views17:10

ml4se

Better & Faster Large Language Models via Multi-token Prediction

LLMs are trained with a next-token prediction loss. The authors propose multi-token prediction as an improvement over next-token prediction in training language models for generative or reasoning tasks. The experiments (up to 7B parameters and 1T tokens) show that this is increasingly useful for larger models and in particular show strong improvements for code tasks.

👍2

297 views15:32

ml4se

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as PPO.

However, in academic benchmarks, the SotA results are often achieved via reward-free methods, such as DPO.

Is DPO truly superior to PPO?

Through theoretical and experimental analysis, the authors explore the limitations of DPO and find that DPO is sensitive to the distribution shift between the base model outputs and preference data. Also DPO fails to improve the performance on challenging tasks such as code generation. PPO demonstrates robust effectiveness across diverse tasks and achieves state-of-the-art results in challenging code competition tasks.

327 views15:40

ml4se

Open sourcing IBM’s Granite code models

IBM is releasing a family of Granite code models to the open-source community.

- paper
- github: https://github.com/ibm-granite
- models: https://huggingface.co/ibm-granite

296 views15:50

ml4se

Large Language Models Cannot Self-Correct Reasoning Yet

The research indicates that LLMs struggle to self-correct their responses without external feedback, and at times, their performance even degrades after self-correction.

324 views15:59

ml4se

AgentBench: Evaluating LLMs as Agents

AgentBench is a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent’s reasoning and decision-making abilities.

github: https://github.com/THUDM/AgentBench

355 views16:13

ml4se

AutoDev: Automated AI-Driven Development

One more agent-based framework for software engineering tasks (Microsoft). AutoDev enables AI Agents to autonomously interact with repositories, perform actions, and tackle complex software engineering tasks.

RQs:
- How effective is AutoDev in a code generation task?
- How effective is AutoDev in test generation task?
- How efficient is AutoDev in completing tasks?

The evaluation on the HumanEval dataset for code and test generation showcased high results, achieving a Pass@1 score of 91.5 for code generation—a second-best result on the leaderboard at the time of writing, and the best among approaches requiring no extra training data. AutoDev also good in test generation with a Pass@1 score of 87.8%, achieving a 99.3% coverage from passing tests.

314 views14:05

ml4se

From Human-to-Human to Human-to-Bot Conversations in Software Engineering

Similarities and differences between human-to-human and human-to-bot conversations. Conversations between a software developer and
1. a fellow software developer
2. an NLU-based chatbot
3. an LLM-based chatbot

325 views14:32

About

Blog

Apps

Platform