ChatGPT struggles to evaluate heart risk—but it could still help cardiologists
The authors used ChatGPT-4 to review each randomized case five different times, asking it to deliver a risk assessment over the patient variables provided. They wanted to learn how ChatGPT’s responses would correlate to TIMI and HEART scores—and how consistent its answers would be if reviewing the same case five different times.
Overall, the team found that ChatGPT-4 “showed high correlation” with the two risk scores. However, the LLM frequently delivered different risk scores when reviewing the same patient case multiple times. In addition, when reviewing data from the third dataset featuring 44 health variables multiple times, ChatGPT-4 often disagreed with its own previous responses.
“ChatGPT was not acting in a consistent manner,” Heston said in a statement. “Given the exact same data, ChatGPT would give a score of low risk, then next time an intermediate risk, and occasionally, it would go as far as giving a high risk.”
According to the authors, this inconsistency could be seen as a good thing when turning to ChatGPT for other uses. In medicine, however, consistent answers are vital.
“We found there was a lot of variation, and that variation in approach can be dangerous,” Heston said. “It can be a useful tool, but I think the technology is going a lot faster than our understanding of it, so it’s critically important that we do a lot of research, especially in these high-stakes clinical situations.”
Reviewing their findings, the group did conclude their study with a positive perspective on the potential of LLMs such as ChatGPT.
“ChatGPT could be excellent at creating a differential diagnosis and that’s probably one of its greatest strengths,” Heston said. “If you don’t quite know what’s going on with a patient, you could ask it to give the top five diagnoses and the reasoning behind each one. So it could be good at helping you think through a problem, but it’s not good at giving the answer.”
Click here to read the full analysis. Lawrence M. Lewis, MD, an emergency medicine specialist with Washington University in St. Louis, served as the study’s co-author.
ChatGPT’s latest update promises improvements in ‘logical reasoning’
In April, OpenAI announced the launch of a new GPT-4 Turbo with “improved capabilities in writing, math, logical reasoning and coding.” Would this latest iteration deliver better heart risk assessments? Researchers are likely already working toward an answer.
Paid ChatGPT users now have access to GPT-4 Turbo.