GPT-4 has passed the Turing test, researchers claim
We are interacting with artificial intelligence (AI) online not only more than ever — but more than we realize — so researchers asked people to converse with four agents, including one human and three different kinds of AI models, to see whether they could tell the difference.
The “Turing test,” first proposed as “the imitation game” by computer scientist Alan Turing in 1950, judges whether a machine’s ability to show intelligence is indistinguishable from a human. For a machine to pass the Turing test, it must be able to talk to somebody and fool them into thinking it is human.
Scientists decided to replicate this test by asking 500 people to speak with four respondents, including a human and the 1960s-era AI program ELIZA as well as both GPT-3.5 and GPT-4, the AI that powers ChatGPT. The conversations lasted five minutes — after which participants had to say whether they believed they were talking to a human or an AI. In the study, published May 9 to the pre-print arXiv server, the scientists found that participants judged GPT-4 to be human 54% of the time,
ELIZA, a system pre-programmed with responses but with no large language model (LLM) or neural network architecture, was judged to be human just 22% of the time. GPT-3.5 scored 50% while the human participant scored 67%.
“Machines can confabulate, mashing together plausible ex-post-facto justifications for things, as humans do,” Nell Watson, an AI researcher at the Institute of Electrical and Electronics Engineers (IEEE), told Live Science.
“They can be subject to cognitive biases, bamboozled and manipulated, and are becoming increasingly deceptive. All these elements mean human-like foibles and quirks are being expressed in AI systems, which makes them more human-like than previous approaches that had little more than a list of canned responses.”
The study — which builds on decades of attempts to get AI agents to pass the Turing test — echoed common concerns that AI systems deemed human will have “widespread social and economic consequences.”
The scientists also argued there are valid criticisms of the Turing test being too simplistic in its approach, saying “stylistic and socio-emotional factors play a larger role in passing the Turing test than traditional notions of intelligence.” This suggests that we have been looking in the wrong place for machine intelligence.
“Raw intellect only goes so far. What really matters is being sufficiently intelligent to understand a situation, the skills of others and to have the empathy to plug those elements together. Capabilities are only a small part of AI’s value — their ability to understand the values, preferences and boundaries of others is also essential. It’s these qualities that will let AI serve as a faithful and reliable concierge for our lives.”
Watson added that the study represented a challenge for future human-machine interaction and that we will become increasingly paranoid about the true nature of interactions, especially in sensitive matters. She added the study highlights how AI has changed during the GPT era.
“ELIZA was limited to canned responses, which greatly limited its capabilities. It might fool someone for five minutes, but soon the limitations would become clear,” she said. “Language models are endlessly flexible, able to synthesize responses to a broad range of topics, speak in particular languages or sociolects and portray themselves with character-driven personality and values. It’s an enormous step forward from something hand-programmed by a human being, no matter how cleverly and carefully.”