Generative AI

Which generative AI model did best on the CPA Exam? Depends on section


ChatGPT (after bombing as 3.5 and then passing as 4.0) is no longer the only large language model to pass the CPA Exam, but it does remain the top performer overall—however, like any human accountant, it has its strengths and weaknesses.

These were part of the findings of a recent paper from Case Western Reserve University and accounting automation solutions provider AIGENCY. The researchers systematically evaluated the performance of Google Gemini, ChatGPT-4, Claude, Mixtral, and Llama-2b on multiple-choice questions from CPA test preparation tools.

Overall, they found that ChatGPT-4 scored the best, with Claude 3-opus coming at a close second, followed by Google Gemini Advanced, then Mixtral-8x7b-32768. Llama 2B-70b-4096 did the worst.

Source: William Zacher Jr. & Sanmukh Kuppannagari

However, as the results show, not every model did uniformly well on all sections. ChatGPT, while a strong performer overall, was especially good at BAR. Meanwhile, though its weakest point is REG, it did better on this exam than any other model. Claude was the best performer in the AUD section. Though its weakest point was FAR, even there its performance was second only to ChatGPT. Gemini was the second strongest performer in BAR, but did not do so well on REG. Mixtral, overall, had decent enough scores compared to a human but would only pass BAR, making it a mediocre player versus its peers. Llama was the only one that would not pass any section, and it did especially poorly on REG. It was also the only one who did worse than a human—average scores for human test takers on REG was 59.19%, according to the paper.

“The study revealed that while some LLMs have made significant advances in mimicking the complex decision-making skills required for CPA exams, there remains variability in performance across different sections of the test. This variability underlines the importance of tailored training and specialization in developing LLMs for professional applications such as the CPA exams,” said the paper.

To perform the test, the researchers drew their multiple choice questions from the Becker CPA test preparation suite. Google Gemini, Claude, and ChatGPT-4 were accessed via their online platforms. Mixtral and Llama-2b models were accessed through the Groq platform, an advanced computational infrastructure for high-speed AI processing. The questions were directly copied and pasted into the AI platforms from Becker’s test preparation material without any additional prompting or modification to ensured that each AI model received the questions in their original form as they would appear in a CPA exam context.

Becker’s platform randomized the questions in batches of 15 questions, which the research said further mitigated potential selection bias. The tester, responsible for inputting the questions into the AI models, deliberately refrained from reading or evaluating the questions beforehand to prevent any unconscious bias in the prompting process. For each question, the tester selected the AI model’s first response marked as ‘correct,’ irrespective of any variations in the explanations or outputs provided by different models.

Each AI model was subjected to each MCQ section of the CPA test three times, allowing for a comprehensive assessment of its performance across multiple attempts. The criterion for determining an AI model’s success in this study was achieving a passing score, defined as an average score of 75 or higher, on any given section.

The researchers said the data indicates there is no one universal model for all tasks, so it is important to use the right model for the right applications. For example, the paper concluded that ChatGPT is “The only real option for zero-shot BAR automation,” as “No other model came close to its performance, and it had a relatively narrow variance,” meaning that ChatGPT-4 could be used to help with automated financial statement preparation or additional forecasting. On the other hand, the researchers said Claude was probably better on auditing-related tasks, which the paper said “is a solid indication that it can be used for fraud detection and internal control validation.”

“It is apparent from the results that there is no clear-cut winner. Most companies utilizing AI to perform financial administration functions should use a software infrastructure that allows them to use multiple task-dependent AI models,” said the paper’s conclusion.

However, the researchers did advise that “Model selection for AI in an applied accounting setting should avoid Llama-2B, which performed worse than any other model in every section.”



Source

Related Articles

Back to top button