“Since the rise of major language models like ChatGPT there have been many anecdotal reports of students submitting AI-generated work as test assignments and getting good grades. So, we tested his university's examination system against AI cheating in a controlled experiment,” says researcher Peter Scharf at the University of Reading's School of Psychology and Clinical Language Sciences.
His team created more than 30 fake psychic accounts and used them to submit ChatGPT-4-generated answers to test questions. The anecdotal reports were correct—the use of AI went largely undetected, and, on average, ChatGPT scored better than human students.
Rules of Engagement
Scharf's team submitted AI-generated work to five undergraduate modules, covering the classes required during all three years of study for a bachelor's degree in psychology. Assignments were either 200-word responses to short questions or more extensive essays, approximately 1,500 words long. “The test markers were not aware of the experiment. In a way, the study participants did not know they were participating in the study, but we got the necessary permissions to go ahead with it,” he said. Scarf claims.
Short submissions were prepared by simply copying and pasting the test questions into ChatGPT-4 and indicating that the answer should be under 160 words. Articles were solicited in the same way, but the number of keywords was increased to 2,000. By setting limits this way, Scarfe's team can get ChatGPT-4 to produce content close enough to the desired length. “Apart from the essay, these responses were to be submitted unedited, where we applied minimal formatting,” says Scharf.
In total, Scharf and his colleagues threw 63 AI-generated submissions into the test system. Even without any editing or attempts to hide the use of AI, 94 percent of them went undetected, and about 84 percent scored better than a randomly selected group of students taking the same test (about half grade better).
“We had a series of detailed meetings with the people who marked these exams and they were quite surprised,” says Scharf. Part of the reason they were surprised was that most of the AI submissions that were detected weren't flagged because they were too repetitive or robotic — they were flagged because they were too good.
Which begs the question: What do we do about it?
AI hunting software
“During this study we did a lot of research on AI-generated content detection techniques,” says Scharf. “One such tool is OpenAI's GPTZero. Others include AI writing detection. Included are systems such as those made by Turniton, a company that specializes in providing tools for plagiarism detection.
“The problem with tools like this is that they usually perform well in the lab, but their performance drops significantly in the real world,” Scharf explained. OpenAI claims that GPTZero can flag AI-generated text as “probable” AI 26 percent of the time, with a rather alarming 9 percent false positive rate. Turniton's system, on the other hand, was advertised as detecting 97 percent of ChatGPT and GPT-3 transcripts in the lab with only one false positive in a hundred attempts. But, according to Scharf's team, the released beta version of the system performed significantly worse.