This article is part of our special. IEEE Journal Watch Series In partnership with IEEE Xplore.
Programmers have spent decades writing code for AI models, and now, in a full-circle moment, AI is being used to write code. But how does an AI code generator compare to a human programmer?
A study published in the June issue of IEEE Transactions on Software Engineering Evaluated OpenAI's ChatGPT generated code for functionality, complexity and security. The results show that ChatGPT has a very wide range of success when it comes to generating functional code—with success rates anywhere from 0.66 percent to as good as 89 percent—depending on the difficulty of the task, programming Language, and many other factors.
While in some cases an AI generator can generate better code than humans, the analysis also reveals some security concerns with AI-generated code.
Yutian Tang is a lecturer at the University of Glasgow who was involved in the study. He notes that AI-based code generation can provide some benefits in terms of increasing productivity and automating software development tasks—but it's important to understand the strengths and limitations of these models.
“By conducting a comprehensive analysis, we can uncover potential problems and limitations in code generation based on ChatGPT… [and] Improve breeding techniques,” explains Tang.
To explore these limitations in more detail, his team attempted to test GPT-3.5's ability to solve 728 coding problems from the LeetCode testing platform in five programming languages: C, C++, Java, JavaScript, and Python.
“A reasonable hypothesis for why the ChatGPT algorithm might do better with problems before 2021 is that these problems are seen more frequently in the training dataset.” Yutian Tang, University of Glasgow
Overall, ChatGPT was pretty good at solving problems in various coding languages — but especially when trying to solve coding problems that existed on LeetCode before 2021. For example, it was able to successfully generate functional code for easy, medium and hard problems. rates of about 89, 71 and 40 percent respectively.
“However, when it comes to algorithmic problems after 2021, ChatGPT's ability to generate functionally correct code suffers. It sometimes fails to understand the meaning of queries, even at a simple level. to problems,” Tang notes.
For example, ChatGPT's ability to generate functional code for “easy” coding problems dropped from 89 percent to 52 percent after 2021. And the ability to generate functional code for “hard” problems still dropped from 40 percent to 0.66 percent after that time.
“A reasonable hypothesis for the ChatGPT algorithm to perform better with problems before 2021 is that these problems are frequently seen in the training dataset,” says Tang.
Basically, as the coding evolves, ChatGPT has not yet been exposed to new problems and solutions. It lacks human critical thinking skills and can only solve problems it has encountered before. This may explain why it is so much better at solving old coding problems than new ones.
“ChatGPT can generate incorrect code because it doesn't understand the meaning of algorithmic problems.” Yutian Tang, University of Glasgow
Interestingly, ChatGPT is able to generate code for at least 50 percent human-solvable latecode problems with small runtime and memory overheads.
The researchers also explored ChatGPT's ability to fix its own coding errors after receiving feedback from LeetCode. They randomly selected 50 coding scenarios where ChatGPT initially produced incorrect coding, either because it did not understand the content or the problem at hand.
While ChatGPT was good at fixing compilation errors, it was generally not good at fixing its own errors.
“ChatGPT may generate incorrect code because it does not understand the meaning of the algorithm's problems, thus, this simple error feedback information is not enough,” explains Tang.
The researchers also found that the code generated from ChatGPT had a number of vulnerabilities, such as a missing null test, but many of these were easily fixed. Their results also show that the code produced in C was the most complex, followed by C++ and Python, which have similar complexity to code written by humans.
Tangs says that, based on these findings, it's important that developers using ChatGPT provide additional information to help ChatGPT better understand problems or avoid vulnerabilities.
“For example, when facing more complex programming problems, developers can provide as much relevant information as possible, and quickly tell ChatGPT what potential threats to be aware of,” says Tang.
From your site articles
Related articles around the web