Generative AI

Are AI Benchmarks Reliable? – Spiceworks


  • AI companies have been using benchmarks to market their products and services as the best in the business, claiming to have one-upped their competitors.
  • While AI benchmarks offer a measure of large language models’ technical prowess, are they reliable differentiators of what forms the basis of generative AI tools?

The advent of the age of generative AI raises a pertinent question: Which large language model (LLM) is the best of the bunch? More importantly, how do we measure it?

AI benchmarks can be tricky for LLM tools, considering they need to be tested for accuracy, truthfulness, relevance, context, and other subjective parameters, as opposed to hardware, where compute speed is a defining criterion.

Over the years, several AI benchmarks have been created as technical tests designed to assess specific functions, such as question-answering, reasoning, coding, text generation, image generation, etc.

AI benchmarks also involve objective comparison, capability assessment (summarization and inferencing), generalization, and robustness in handling complex language constructs and tracking progress.

AI companies have been using these tests to market their products and services as the best in the business, claiming to have one-upped their competitors. LLMs released recently have already surpassed humans on several benchmarks. On others, they have yet to match up to us.

For instance, Gemini Ultra topped the multi-task language understanding or MMLU benchmark with a score of 90%, followed by Claude 3 Opus (88.2%), Leeroo (86.64%), and GPT-4 (86.4%). MMLU is a knowledge test for 57 subjects, including elementary mathematics, U.S. history, computer science, law, etc.

Meanwhile, Claude 3 Opus scored just over 50% in scientific reasoning under the graduate-level GPQA benchmark. GPT-4 Turbo (with knowledge cutoff until April 2024) scored 46.5%, and GPT-4 Turbo (Jan 2024) scored over 43%.

So, there is some truth to the claim that AI tools are at par with what we envisioned. However, since AI benchmarks offer task-specific evaluations, their usage across domain-agnostic, general-purpose applications still needs to be improved. Thus, are they at par with where we expected them to be?

Spiceworks News & Insights examines why AI benchmarks are inconsistent in their assessments and inappropriate for comparison.

Limitations of AI Benchmarks

AI benchmarks have multiple challenges associated with providing a general comparison of LLMs. These include:

1. Lack of standardization

Ralph Meier, manager of engines and algorithms at Hyland, told Spiceworks that AI benchmarks lack the appropriate standardization because of their diverse applications and requirements, lack of consensus on evaluation criteria, especially for responsible AI capabilities (transparency, explainability, and data privacy), and resource constraints.

“AI systems are being applied to a wide range of domains and tasks, each with their own specific requirements and nuances. Developing standardized benchmarks that can accurately capture the performance and limitations of AI models across all these diverse applications is a significant challenge,” Meier said.

“Evaluating cutting-edge AI models can be prohibitively expensive and time-consuming, especially for independent researchers or smaller organizations. There is a tendency for broader adoption of open-source benchmarks (e.g., the ones listed above), but this comes with the added risk of training data set contamination with information used or associated with certain benchmarks.”

Rakesh Yadav, founder and CEO of Aidaptive, expects AI benchmarking standardization in some domains. “In the coming years, I expect AI benchmarks to be established for at least a narrow set of use cases and that, eventually, there will be a standard process for adapting benchmarks continuously with innovation.”

See More: Top Three LLMs Compared: GPT-4 Turbo vs. Claude 3 Opus vs. Gemini 1.5 Pro

2. Most AI benchmarks are outdated

The breakneck speed of LLM development over the past few years has made “it difficult for benchmarks to keep up with the latest advancements and capabilities. By the time a benchmark is developed and adopted, newer models may have already surpassed its scope, leading to inconsistencies in evaluation,” Meier added.

For instance, a report co-authored by the state-run Institute of Scientific and Technical Information of China noted that U.S. organizations released 11 LLMs in 2020, 30 in 2021, and 37 in 2021. At the same time, Chinese companies released 2, 30, and 28 LLMs in 2020, 2021, and 2022 respectively.

By May of 2023, U.S. companies had rolled out 18 LLMs, while Chinese companies saw the launch of 18 LLMs.

“There is a need for updated benchmarks that can assess the end-to-end performance of AI systems in real-world applications, including pre-processing, post-processing, and interactions with other systems and humans. This will help bridge the gap between narrow task-specific benchmarks and the broader requirements of deploying AI solutions in complex, dynamic environments,” Meier said.

“Overall, while existing benchmarks have played a crucial role in advancing AI research and development, the rapid progress in the field, particularly in generative AI, necessitates the creation of new, more comprehensive, and transparent benchmarks that can better evaluate the capabilities and limitations of the latest AI models.”

3. Vested interests

Yadav repeatedly highlighted that current AI benchmarks are created by organizations with a specific profit-making agenda. Most of the more prominent technology companies have invested billions into AI research or in companies that build AI tools and services. “Currently, these benchmarks are being built by corporations that have profit-based motivations and are inherently biased by their own business needs (rightfully so),” Yadav said.

“Ideally, there would be government-funded benchmarks or standards established by a consortium of large corporations, without any biases, that are under constant research to ensure these standards are updated alongside new developments. That said, this is a developing field under heavy innovation.”

4. Benchmark specific-issues

The picture AI benchmarks paint is often askew, considering specific prompt engineering techniques can manipulate the results. An LLM’s response, measured as its performance, is contingent on how the prompt is formulated.

Google was criticized for claiming Gemini Ultra outperformed OpenAI’s GPT-4. The criticism (and ridicule by some) stemmed from the company’s use of the Chain of Thought or CoT@32 prompt engineering technique to get a higher benchmark score in MMLU instead of 5-shot learning.