Is LLM Training Data Running Out?

June 19, 2024

52 6 minutes read

Untitled design6 — is AI training data running out.jpg

AI model training is a rapidly growing and highly capital-, compute-, power-, and data-intensive process.
Funds can be procured, computing is advancing at full speed, power generation can be sourced through cleaner methods, and AI talent is rapidly emerging.
However, sourcing publicly available data will be a major problem for AI companies at the turn of the decade.
How can companies ensure AI development doesn’t stall?

According to Stanford University’s AI Index Report 2024, the United States produced 61 noteworthy machine learning models in 2023, followed by China’s 15, France’s eight, Germany’s five, and Canada’s four. In the same year, 149 foundation models were released globally.

The rapid pace of development has spilled into 2024, with the industry setting unprecedented expectations and delivering innovation, growth, and higher integration.

Financially, training a single large language model (LLM) costs tens of millions of dollars. For instance, OpenAI CEO Sam Altman said the company spent over $100 million to train GPT-4. Google spent an estimated $191 million on computing to train Gemini Ultra. Having deep pockets and a propensity to splurge helps.

The demand for AI computing power is doubling every 100 days, according to Intelligent Computing: The Latest Advances, Challenges, and Future researchOpens a new window

Further, energy-guzzling LLM training may have alternative solutions. One example is Microsoft planning to build a small-scale reactor to replace fossil fuels for its data center and computing needs, hiring a director of nuclear tech to oversee its plans, and signing an agreement to source power from Helion Energy and its nuclear fusion tech.

Meanwhile, the demand for AI talent in the U.S. almost doubled from 8,611 in May 2023 to over 16,000 in May 2024, according to UMD-LinkUp AI MapsOpens a new window

The survey noted 17% higher application growth in the past two years in job posts that mention artificial intelligence or generative AI than in job posts with no such mentions. Additionally, 57% of professionals responded positively toward learning more about AI.

What is unclear is how AI companies plan to keep up with the rising demand for data, the building blocks of an LLM’s consciousness, so to speak.

See More: Will Prompt Engineering Change What It Means to Code?

Data Requirements for LLM Training

Let’s look at some numbers:

LLM
Tokens Release Date

GPT 2 ~10 billion
June 2018

GPT 3
~300 billion Feb 2019

Claude ~400 billion
Dec 2021

Gopher
~300 billion Dec 2021

LaMDA 168 billion
Jan 2022

PaLM
~780 billion April 2022

Llama 1.4 trillion
Feb 2023

GPT-4
~13 trillion Mar 2023

PaLM 2 3.6 trillion
May 2023

Llama 2
2 trillion Jul 2023

Claude 2 NA
Jul 2023

Grok-1
NA Nov 2023

Gemini 1.0 NA
Dec 2023

Claude 3
NA Mar 2024

Llama 3 NA
April 2024

Gemini 1.5 Pro
NA May 2024

GPT-4o NA
May 2024

Since companies have started withholding the training data size, consider the first ten models on the above list. Here’s how steep the rise is when it comes to using training data:

Tokens Used for LLM Training Over The Years

So far, an exponentially higher data ingestion rate for LLM training has been the primary vector of progress. Experts predict this cannot be sustainable over the long term.

See More: AI Benchmarks: Why GenAI Scoreboards Need an Overhaul

Are Companies Running Out of Training Data?

Research suggests that data scarcity is indeed a possibility for LLM training shortly. According to trends analyzed by Epoch AI, tech companies will exhaust publicly available data for LLM training between 2026 and 2032.

“The exact point in time at which this data would be fully utilized depends on how models are scaled. If models are trained compute-optimally, there is enough data to train a model with 5e28 floating-point operations (FLOP), a level we expect to be reached in 2028. But recent models, like Llama 3, are often ‘overtrained’ with fewer parameters and more data to make them more compute-efficient during inference,” Epoch AI researchers noted.

Source: Will we run out of data? Limits of LLM scaling based on human-generated data studyOpens a new window

The crucial thing to note here is that in approximately a decade of generative AI’s existence, companies will have depleted freely available human-generated information in the form of articles, blogs, social media discussions, papers, etc. This rate of data ingestion in LLM training is thus considerably higher than the rate at which humans are producing data.

Moreover, tokens from multiple modalities (text, image, audio, video) for LLM training are also not enough as current video and image stocks are not large enough to prevent a data bottleneck, per the research. Here’s how many tokens each data modality corresponds to:

Common Crawl: 130 trillion

Indexed web: 510 trillion

The whole web: 3100 trillion

Images: 300 trillion

Video: 1350 trillion

Publicly available data scraped from the web forms the bedrock of LLM training, but is the situation dire (for AI companies)?

See More: GenAI in Legal Industry: Why Intelligent Document Processing Matters?

How Can AI Companies Scale LLMs Without Public Data?

Experts have advised overcoming the data scarcity problem through multiple methods involving offline information, synthetic data, and LLM efficiency improvements.

1. Cut deals with publishers for non-public data

Paywalled and non-indexed data can be leveraged to train LLMs, provided they are appropriately compensated for respective copyrights.

Content licensing is already a multi-million dollar reality since Google joined hands with Reddit for its data for $60 million. Similarly, OpenAI has signed deals with the Associated Press, Axel Springer, Le Monde, Prisa Medi, and Financial Times.

Offline information, such as books, manuscripts, magazines, etc., can be digitized and licensed for the correct fee.

Moreover, research data, including genomics, financial, scientific databases, etc., can be high-quality data in the right context.

Finally, non-indexed deep web data from social media (Facebook, Instagram, Twitter) and other sources and instant messengers remain untapped. Unfortunately, the former can be lower quality than web data, while the latter violates user privacy.

2. LLM advancements

Refining LLM architecture to consume less data to produce the same result can help contain unchecked data ingestion. Techniques such as reinforcement learning have helped attain sample efficiency gains.

Additionally, data enriching and high-quality sample filtering optimize Pareto efficiency, leading to higher LLM performance and training efficiency improvements, according to findings in the How to Train Data-Efficient LLMsOpens a new window

“It’s important to note that the relationship between data quantity and language model performance is not always linear. In some cases, doubling the training data may yield diminishing returns in metrics like perplexity or downstream task accuracy,” noted Sunil Ramlochan, enterprise AI strategist and founder of PromptEngineering.org.

“Determining how much data is needed to train a language model is an empirical question best answered through systematic experimentation at different orders of magnitude. By measuring model performance across varying data scales and considering factors like model architecture, task complexity, and data quality, NLP practitioners can make informed decisions about resource allocation and continuously optimize their language models over time.”

Transfer learning, i.e., when a model is initially pre-trained on a data-rich task before being finetuned on a downstream task, can be considered viable for AI training. One of the conclusions researchers derived from Exploring the Limits of Transfer Learning with a Unified Text-to-Text TransformerOpens a new window

“One beneficial use of transfer learning is the possibility of attaining good performance on low-resource tasks. Low-resource tasks often occur (by definition) in settings where one lacks the assets to label more data.”

3. Synthetic data

When obtaining real-world data becomes problematic (such as websites banning web crawlers), expensive, or downright impossible after the well runs dry, synthetic data will come to the rescue. Gartner predicted that by 2026, 75% of businesses will use generative AI to create synthetic customer data.

Synthetic data has the benefit of having similar mathematical patterns without the information of the original data from which it was derived. It is algorithmically generated via computer simulations filled with new scenarios that an LLM can gorge upon.

Synthetic data can prove highly beneficial in limiting organizations’ dependence on internet data for LLM training. It bears the same correlations and statistical properties as real-world data but can go beyond artificially creating and introducing situations that can enrich the training experience.

Colossal amounts of synthetic data can be created relatively quickly with the help of existing LLMs, enabling faster development and project turnaround times. To that end, methods like DeepMind’s Reinforced Self-Training (ReST) for Language ModelingOpens a new window

However, synthetic data has limitations, including biased responses, inaccuracies, hallucinations, and security and privacy risks. It can also be quite simplistic and thus fail to capture the nuances of real-world scenarios.

Machine-generated synthetic data may also cause LLMs to become an echo chamber of poor outputs, leading to what researchers call Mad Autophagy Disorder (MAD). In Self-Consuming Generative Models Go MADOpens a new window

LLM	Tokens	Release Date
GPT 2	~10 billion	June 2018
GPT 3	~300 billion	Feb 2019
Claude	~400 billion	Dec 2021
Gopher	~300 billion	Dec 2021
LaMDA	168 billion	Jan 2022
PaLM	~780 billion	April 2022
Llama	1.4 trillion	Feb 2023
GPT-4	~13 trillion	Mar 2023
PaLM 2	3.6 trillion	May 2023
Llama 2	2 trillion	Jul 2023
Claude 2	NA	Jul 2023
Grok-1	NA	Nov 2023
Gemini 1.0	NA	Dec 2023
Claude 3	NA	Mar 2024
Llama 3	NA	April 2024
Gemini 1.5 Pro	NA	May 2024
GPT-4o	NA	May 2024

June 19, 2024
52 6 minutes read

Facebook Twitter LinkedIn Tumblr Pinterest Reddit WhatsApp

Share
Facebook Twitter LinkedIn Tumblr Pinterest Reddit VKontakte Share via Email Print