Are AI Benchmarks Reliable? – Spiceworks
- AI companies have been using benchmarks to market their products and services as the best in the business, claiming to have one-upped their competitors.
- While AI benchmarks offer a measure of large language models’ technical prowess, are they reliable differentiators of what forms the basis of generative AI tools?
The advent of the age of generative AI raises a pertinent question: Which large language model (LLM) is the best of the bunch? More importantly, how do we measure it?
AI benchmarks can be tricky for LLM tools, considering they need to be tested for accuracy, truthfulness, relevance, context, and other subjective parameters, as opposed to hardware, where compute speed is a defining criterion.
Over the years, several AI benchmarks have been created as technical tests designed to assess specific functions, such as question-answering, reasoning, coding, text generation, image generation, etc.
AI benchmarks also involve objective comparison, capability assessment (summarization and inferencing), generalization, and robustness in handling complex language constructs and tracking progress.
AI companies have been using these tests to market their products and services as the best in the business, claiming to have one-upped their competitors. LLMs released recently have already surpassed humans on several benchmarks. On others, they have yet to match up to us.
For instance, Gemini Ultra topped the multi-task language understanding or MMLU benchmark with a score of 90%, followed by Claude 3 Opus (88.2%), Leeroo (86.64%), and GPT-4 (86.4%). MMLU is a knowledge test for 57 subjects, including elementary mathematics, U.S. history, computer science, law, etc.
Meanwhile, Claude 3 Opus scored just over 50% in scientific reasoning under the graduate-level GPQA benchmark. GPT-4 Turbo (with knowledge cutoff until April 2024) scored 46.5%, and GPT-4 Turbo (Jan 2024) scored over 43%.
So, there is some truth to the claim that AI tools are at par with what we envisioned. However, since AI benchmarks offer task-specific evaluations, their usage across domain-agnostic, general-purpose applications still needs to be improved. Thus, are they at par with where we expected them to be?
Spiceworks News & Insights examines why AI benchmarks are inconsistent in their assessments and inappropriate for comparison.
Limitations of AI Benchmarks
AI benchmarks have multiple challenges associated with providing a general comparison of LLMs. These include:
1. Lack of standardization
Ralph Meier, manager of engines and algorithms at Hyland, told Spiceworks that AI benchmarks lack the appropriate standardization because of their diverse applications and requirements, lack of consensus on evaluation criteria, especially for responsible AI capabilities (transparency, explainability, and data privacy), and resource constraints.
“AI systems are being applied to a wide range of domains and tasks, each with their own specific requirements and nuances. Developing standardized benchmarks that can accurately capture the performance and limitations of AI models across all these diverse applications is a significant challenge,” Meier said.
“Evaluating cutting-edge AI models can be prohibitively expensive and time-consuming, especially for independent researchers or smaller organizations. There is a tendency for broader adoption of open-source benchmarks (e.g., the ones listed above), but this comes with the added risk of training data set contamination with information used or associated with certain benchmarks.”
Rakesh Yadav, founder and CEO of Aidaptive, expects AI benchmarking standardization in some domains. “In the coming years, I expect AI benchmarks to be established for at least a narrow set of use cases and that, eventually, there will be a standard process for adapting benchmarks continuously with innovation.”
See More: Top Three LLMs Compared: GPT-4 Turbo vs. Claude 3 Opus vs. Gemini 1.5 Pro
2. Most AI benchmarks are outdated
The breakneck speed of LLM development over the past few years has made “it difficult for benchmarks to keep up with the latest advancements and capabilities. By the time a benchmark is developed and adopted, newer models may have already surpassed its scope, leading to inconsistencies in evaluation,” Meier added.
For instance, a report co-authored by the state-run Institute of Scientific and Technical Information of China noted that U.S. organizations released 11 LLMs in 2020, 30 in 2021, and 37 in 2021. At the same time, Chinese companies released 2, 30, and 28 LLMs in 2020, 2021, and 2022 respectively.
By May of 2023, U.S. companies had rolled out 18 LLMs, while Chinese companies saw the launch of 18 LLMs.
“There is a need for updated benchmarks that can assess the end-to-end performance of AI systems in real-world applications, including pre-processing, post-processing, and interactions with other systems and humans. This will help bridge the gap between narrow task-specific benchmarks and the broader requirements of deploying AI solutions in complex, dynamic environments,” Meier said.
“Overall, while existing benchmarks have played a crucial role in advancing AI research and development, the rapid progress in the field, particularly in generative AI, necessitates the creation of new, more comprehensive, and transparent benchmarks that can better evaluate the capabilities and limitations of the latest AI models.”
3. Vested interests
Yadav repeatedly highlighted that current AI benchmarks are created by organizations with a specific profit-making agenda. Most of the more prominent technology companies have invested billions into AI research or in companies that build AI tools and services. “Currently, these benchmarks are being built by corporations that have profit-based motivations and are inherently biased by their own business needs (rightfully so),” Yadav said.
“Ideally, there would be government-funded benchmarks or standards established by a consortium of large corporations, without any biases, that are under constant research to ensure these standards are updated alongside new developments. That said, this is a developing field under heavy innovation.”
4. Benchmark specific-issues
The picture AI benchmarks paint is often askew, considering specific prompt engineering techniques can manipulate the results. An LLM’s response, measured as its performance, is contingent on how the prompt is formulated.
Google was criticized for claiming Gemini Ultra outperformed OpenAI’s GPT-4. The criticism (and ridicule by some) stemmed from the company’s use of the Chain of Thought or CoT@32 prompt engineering technique to get a higher benchmark score in MMLU instead of 5-shot learning.
this is pretty weird
usually when you benchmark… you compare the results of the same exact test…
Took someone else mentioning this for me to notice
— Bryan Kyritz (@kyritzb) December 6, 2023
According to Meier, AI benchmarks can also be biased and limited. Benchmarks often have inherent biases and limitations that can skew the results. For instance, multiple-choice tests can be brittle. Even small changes like reordering the answer choices can significantly impact an LLM’s score.”
Meanwhile, an LLM is as good as the data it was trained on. Test data often differs from real-world data, which can be a big issue after launching the models. “Many LLMs are trained on vast datasets that may overlap with the data used in benchmarks. This can lead to models simply memorizing and regurgitating test examples rather than demonstrating a true understanding of the underlying task. Consequently, high benchmark scores may not accurately reflect real-world performance.”
Yadav concurs. He cites the lack of training data about real-world use cases as a significant factor in why AI benchmarks can fail in evaluating LLMs. “We only discover shortcomings when a model doesn’t perform in the real world (after launching). There is plenty of press around when models by large corporations don’t perform, but not much exists to help inform training pre-launch,” Yadav said.
See More: Exploring AI’s Growth: KubeCon + CloudNativeCon Europe 2024
5. Benchmarks are narrow in scope
AI benchmarks can be too narrow in scope when used individually. They are often used to assess specific tasks. Both Yadav and Meier cited the limiting factor of AI benchmarks being unable to gauge the overall workings of LLMs broadly.
“Many current benchmarks focus narrowly on specific tasks like image classification or natural language processing, failing to capture AI systems’ broader capabilities and real-world applications. They often overlook crucial aspects such as handling uncertainty, ambiguity, adversarial inputs, and interactions with humans and other systems,” Meier opined.
He added that AI benchmarks should go beyond measuring inference performance and evaluate the end-to-end performance of AI systems in real-world applications. If not, it can severely limit our ability to weigh various LLMs’ applicability across multiple use cases.
“Depending on the model and use case, there might not be accurate benchmarks to evaluate its efficacy,” Yadav said. “Unless actual issues or challenges arise, it’s impossible to evaluate whether or not a specific model performs well enough for a new use case.”
The Future of AI Benchmarks
AI benchmarks are needed to gauge the ability of LLMs to handle open-ended queries and generate logical responses that don’t stray off the context of the interaction. AI benchmarking also needs to evolve concerning humans in a natural and engaging manner. This includes multimodal assessments.
“As AI systems are increasingly expected to process and integrate information from multiple modalities, such as text, images, audio, and video, benchmarks should assess the ability of AI models to perform cross-modal reasoning, understanding, and generation tasks. These tasks are crucial for applications like virtual assistants, content creation, and multimedia analysis. Such capabilities are very important for seamless, multimodal AI/Human interaction,” Meier said.
AI benchmarking needs to be developed independently with industry consultants to achieve this. “In my opinion, the bigger focus is that AI benchmarking tools require dedicated organizations focused on fueling innovation to establish and maintain benchmarks for specific use cases,” Yadav continues.
“This is easier said than done since it requires a combination of folks who are technically proficient in large-scale data processing (that’s a small list), have a deep understanding of machine learning (also a very small population), and are also experts in the domain that the model is being applied to.”
“There are some ideas that can address this, like continuing to foster active learning by keeping a running feedback loop from domain experts and incorporating that knowledge into the next training run. Using a combination of other machine learning techniques (e.g., reinforcement learning) to test out conditions while building LLMs would also go a long way.”
What problems do you see with current AI benchmarks? Share with us on LinkedIn, X, or Facebook. We’d love to hear from you!
Image source: Shutterstock