Large language models (LLMs), such as the GPT-4 model that underpins the widely used conversational platform ChatGPT, have surprised users with their ability to understand written prompts and generate appropriate responses in different languages. Some of us may then wonder: are the texts and responses generated by these models so realistic that they could be confused with those written by humans?
Researchers at the University of San Diego recently attempted to answer this question by running a Turing test, a well-known method named after computer scientist Alan Turing, designed to assess the extent to which a machine demonstrates human intelligence.
The results of this test, presented in a pre-published article on the arXiv server, suggest that people have difficulty distinguishing between the GPT-4 model and a human agent when interacting with them in a two-person conversation.
“The idea for this paper actually came from a course that Ben was running in LLM,” Cameron Jones, co-author of the paper, told Tech Xplore.
“In the first week we read classic articles about the Turing test and discussed whether an LLM could pass it and whether or not it would matter. As far as I know, no one had tried it at this point, so I decided to build an experiment to test this as part of my class project, and we then launched the first public exploratory experiment.
The first study by Jones and supervised by Bergen, a professor of cognitive science at UC San Diego, yielded some interesting results, suggesting that GPT-4 could pass for human in about 50 percent of interactions. However, their exploratory experiment did not well control for certain variables that could influence the results. They therefore decided to conduct a second experiment, giving the results presented in their recent article.
“As we conducted the studies, we discovered that other people were also doing great work in this area, including the ‘human or not’ game by Jannai et al,” Jones said. “We created an online version of the 2-player game in which human participants would face either another human or an AI model.”
During each trial of the two-player game used by Jones and colleagues, a human interrogator interacts with a “witness,” who can be either a human or an AI agent. The interrogator asks the witness a series of questions to try to determine whether or not he is human.
“The conversations lasted up to five minutes and at the end the interrogator determined whether he thought the witness was a human or an AI,” Jones explained. “In five minutes, participants could talk about whatever they wanted (except saying offensive things, which we used a filter to avoid).”
In this two-player computer game, the researchers deployed three different LLMs as potential controls, namely the GPT-4, GPT 3.5, and ELIZA models. They found that while users could often determine that ELIZA and GPT-3.5 were machines, their ability to determine whether GPT-4 was human or machine was no better than the likelihood that they would succeed s They chose at random (that is, by luck).
“Although real humans were more successful, persuading interrogators that they were human two-thirds of the time, our results suggest that in the real world, people might not be able to reliably say whether they talk to a human or an AI system,” Jones said.
“In fact, in the real world, people might be less aware of the possibility that they are talking to an AI system, so the rate of deception might be even higher. I think that might have implications for the kinds of things that AI can do systems will be used, whether to automate customer-facing tasks, or to be used for fraud or misinformation.
Jones and Bergen’s Turing test results suggest that LLMs, particularly GPT-4, are barely distinguishable from humans during brief conversations. These observations suggest that people may soon become increasingly suspicious of other people they interact with online, as they may no longer know whether they are humans or robots.
The researchers now plan to update and reopen the public Turing test they designed for this study, to test some additional hypotheses. Their future work may gather other interesting insights into the extent to which people can distinguish between humans and LLMs.
“We want to offer a three-person version of the game, where the interrogator talks to a human and an AI system simultaneously and has to figure out who is who,” Jones added.
“We also want to test other types of AI setups, for example allowing agents to access live news and weather, or a “notepad” where they can take notes before responding . Finally, we want to test whether the AI is convincing. our abilities extend to other areas, like convincing people to believe lies, vote for specific policies, or donate money to a cause.
More information:
Cameron R. Jones et al, People cannot tell GPT-4 apart from a human in a Turing test, arXiv (2024). DOI: 10.48550/arxiv.2405.08007
arXiv
© 2024 Science X Network
Quote: People struggle to distinguish humans from ChatGPT in five-minute chat conversations, tests show (June 16, 2024) retrieved June 17, 2024 from https://techxplore.com/news/2024-06-people -struggle-humans-chatgpt -minute.html
This document is subject to copyright. Except for fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for information only.