GPT-4 didn’t ace the bar exam after all, MIT research suggests — it didn’t even break the 70th percentile
GPT-4 didn’t actually score in the top 10% on the bar exam after all, new research suggests.
OpenAI, the company behind the large language model (LLM) that powers its chatbot ChatGPT, made the claim in March last year, and the announcement sent shock waves around the web and the legal profession.
Now, a new study has revealed that the much-hyped 90th-percentile figure was actually skewed toward repeat test-takers who had already failed the exam one or more times — a much lower-scoring group than those who generally take the test. The researcher published his findings March 30 in the journal Artificial Intelligence and Law.
“It seems the most accurate comparison would be against first-time test takers or to the extent that you think that the percentile should reflect GPT-4’s performance as compared to an actual lawyer; then the most accurate comparison would be to those who pass the exam,” study author Eric Martínez, a doctoral student at MIT’s Department of Brain and Cognitive Sciences, said at a New York State Bar Association continuing legal education course.
Related: AI can ‘fake’ empathy but also encourage Nazism, disturbing study suggests
To arrive at its claim, OpenAI used a 2023 study in which researchers made GPT-4 answer questions from the Uniform Bar Examination (UBE). The AI model’s results were impressive: It scored 298 out of 400, which placed it in the top tenth of exam takers.
But it turns out the artificial intelligence (AI) model only scored in the top 10% when compared with repeat test takers. When Martínez contrasted the model’s performance more generally, the LLM scored in the 69th percentile of all test takers and in the 48th percentile of those taking the test for the first time.
Martínez’s study also suggested that the model’s results ranged from mediocre to below average in the essay-writing section of the test. It landed in the 48th percentile of all test takers and in the 15th percentile of those taking the test for the first time.
To investigate the results further, Martínez made GPT-4 repeat the test again according to the parameters set by the authors of the original study. The UBE typically consists of three components: the multiple-choice Multistate Bar Examination (MBE); the Multistate Performance Test (MPT) that makes examinees perform various lawyering tasks; and the written Multistate Essay Examination (MEE).
Martínez was able to replicate the GPT-4’s score for the multiple-choice MBE but spotted “several methodological issues” in the grading of the MPT and MEE parts of the exam. He noted that the original study did not use essay-grading guidelines set by the National Conference of Bar Examiners, which administers the bar exam. Instead, the researchers simply compared answers to “good answers” from those in the state of Maryland.
This is significant. Martínez said that the essay-writing section is the closest proxy in the bar exam to the tasks performed by a practicing lawyer, and it was the section of the exam the AI performed the worst in.
“Although the leap from GPT-3.5 was undoubtedly impressive and very much worthy of attention, the fact that GPT-4 particularly struggled on essay writing compared to practicing lawyers indicates that large language models, at least on their own, struggle on tasks that more closely resemble what a lawyer does on a daily basis,” Martínez said.
The minimum passing score varies from state to state between 260 and 272, so GPT-4’s essay score would have to be disastrous for it to fail the overall exam. But a drop in its essay score of just nine points would drag its score to the bottom quarter of MBE takers and beneath the fifth percentile of licensed attorneys, according to the study.
Martínez said his findings revealed that, while undoubtedly still impressive, current AI systems should be carefully evaluated before they are used in legal settings “in an unintentionally harmful or catastrophic manner.”
The warning appears to be timely. Despite their tendency to produce hallucinations — fabricating facts or connections that don’t exist — AI systems are being considered for multiple applications in the legal world. For example, on May 29, a federal appeals court judge suggested that AI programs could help interpret the contents of legal texts.
In response to an email about the study’s findings, an OpenAI spokesperson referred Live Science to “Appendix A on page 24” of the GPT-4 technical report. The relevant line there reads: “The Uniform Bar Exam was run by our collaborators at CaseText and Stanford CodeX.”