Salesforce’s CodeT5 system can understand and generate code 1

Salesforce’s CodeT5 system can understand and generate code

Posted Tuesday, Sep 14, 2021 by Jeff Safire

By Kyle Wiggers       Sep 7, 2021

Salesforce CodeT5Image Credit: VeniThePooh via Getty

AI-powered coding tools, which generate code using machine learning algorithms, have attracted increasing attention over the last decade. In theory, systems like OpenAI’s Codex could reduce the time people spend writing software as well as computational and operational costs. But existing systems have major limitations, leading to undesirable results like errors.

In search of a better approach, researchers at Salesforce open-sourced a machine learning system called CodeT5, which can understand and generate code in real time. The team claims that CodeT5 achieves state-of-the-art performance on coding tasks including code defect detection, which predicts whether code is vulnerable to exploits, and clone detection, which predicts whether two code snippets have the same functionality.

Novel design

As the Salesforce researchers explain in a blog post and paper, existing AI-powered coding tools often rely on model architectures “suboptimal” for generation and understanding tasks. They adapt conventional natural language processing pretraining techniques to source code, ignoring the structural information in programming language that’s important to comprehending the code’s semantics.

By contrast, CodeT5 incorporates code-specific knowledge, taking code and its accompanying comments to endow the model with better code understanding. As a kind of guidepost, the model draws on both the documentation and developer-assigned identifiers in codebases (e.g., “binarySearch”) that make code more understandable while preserving its semantics.

CodeT5 builds on Google’s T5 (Text-to-Text Transfer Transformer) framework, which was first detailed in a paper published in 2020. It reframes natural language processing tasks into a unified text-to-text-format, where the input and output data are always strings of text — allowing the same model to be applied to virtually any natural language processing task.

Read more →

This article originally appeared at on Sep 7, 2021


No Comments