The data that powers AI is disappearing fast


For years, people building powerful artificial intelligence systems have used massive amounts of text, images, and videos scraped from the internet to train their models.

Today, this data is running out.

Over the past year, many of the largest web sources used to train AI models have restricted the use of their data, according to a study released this week by the Data Provenance Initiative, a research group led by MIT.

The study, which examined 14,000 web domains included in three commonly used AI training datasets, found an “emerging consent crisis” as publishers and online platforms took steps to prevent their data from being collected.

The researchers estimate that across the three datasets, called C4, RefinedWeb, and Dolma, 5% of all data and 25% of data from the highest-quality sources were restricted. These restrictions are implemented through the Robots Exclusion Protocol, a decades-old method that allows website owners to prevent automated bots from crawling their pages using a file called robots.txt.

The study also found that up to 45% of data in one set, C4, had been restricted by websites’ terms of service.

“We’re seeing a rapid decline in consent to data use on the web, which will have implications not only for AI companies, but also for researchers, academics and non-commercial entities,” Shayne Longpre, lead author of the study, said in an interview.

Data is the main ingredient of today’s generative AI systems, fueled by billions of examples of text, images, and videos. Much of this data is scraped from public websites by researchers and compiled into large datasets, which can be downloaded and used freely, or supplemented with data from other sources.

It’s by learning from this data that generative AI tools like OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude can write, code, and generate images and videos. The more high-quality data these models are fed, the better their results tend to be.

For years, AI developers have been able to collect data fairly easily. But the rise of generative AI in recent years has led to tensions with the owners of that data, many of whom fear being used as training material for AI, or at least want to be paid for it.

Faced with mounting criticism, some publishers have put up paywalls or changed their terms of service to limit the use of their data for AI training. Others have blocked automated web crawlers used by companies like OpenAI, Anthropic and Google.

Sites like Reddit and StackOverflow have started charging AI companies for access to data, and a few publishers have filed lawsuits, including the New York Times, which sued OpenAI and Microsoft for copyright infringement last year, alleging that the companies used news articles to train their models without permission.

In recent years, companies like OpenAI, Google, and Meta have made significant efforts to collect more data to improve their systems, including transcribing YouTube videos and circumventing their own data policies.

More recently, some AI companies have struck deals with publishers, including the Associated Press and News Corp, which owns the Wall Street Journal, giving them ongoing access to their content.

But widespread data restrictions can pose a threat to AI companies, which need a constant supply of high-quality data to keep their models up to date.

The data could also pose a problem for smaller AI companies and academic researchers who rely on public datasets and can’t afford to license data directly from publishers. Common Crawl, a dataset of billions of pages of web content maintained by a nonprofit organization, has been cited in more than 10,000 academic studies, Longpré said.

It’s unclear exactly which popular AI products were trained on these sources, as few developers disclose the full list of data they use. But datasets derived from Common Crawl, including C4 (which stands for Colossal, Cleaned Crawled Corpus), have been used by companies like Google and OpenAI to train previous versions of their models. Spokespeople for Google and OpenAI declined to comment.

Yacine Jernite, a machine learning researcher at Hugging Face, a company that provides tools and data to AI developers, characterized the consent crisis as a natural response to the AI ​​industry’s aggressive data collection practices.

“Unsurprisingly, we are seeing a backlash from data creators after the texts, images and videos they share online are used to develop commercial schemes that sometimes directly threaten their livelihoods,” he said.

He warned, however, that if all AI training data were to be obtained through licensing agreements, it would exclude “researchers and civil society from participation in the governance of the technology.”

Stella Biderman, executive director of EleutherAI, a nonprofit AI research organization, echoed those fears.

“Big tech companies already have all the data,” she said. “The data licensing change doesn’t retroactively revoke that permission, and the main impact is on the players who come in later, who are typically either small startups or researchers.”

AI companies have argued that their use of public data from the web is legally protected by fair use laws. But collecting new data has become trickier. Some AI executives I spoke with worry about reaching the “data wall”—the point where all the training data on the public internet has been exhausted and the rest has been hidden behind paywalls, blocked by robots.txt, or locked into proprietary agreements.

Some companies believe they can expand their databases by using synthetic data—data generated by AI systems—to train their models. But many researchers doubt that current AI systems can generate enough high-quality synthetic data to replace the human-generated data they lose.

Another challenge is that while publishers can try to prevent AI companies from scraping their data by placing restrictions in their robots.txt files, these requests are not legally binding, and compliance with these restrictions is voluntary. (Think of it as a “do not access” sign for data, but without the force of law.)

Major search engines honor these opt-out requests, and several large AI companies, including OpenAI and Anthropic, have publicly stated that they do so as well. But other companies, including the AI-powered search engine Perplexity, have been accused of ignoring them. Perplexity CEO Aravind Srinivas told me that the company respects publishers’ data restrictions. He added that while the company has worked with third-party web crawlers in the past that didn’t always follow the robots exclusion protocol, it has “made adjustments with our vendors to ensure that they are respecting the robots.txt file when crawling on behalf of Perplexity.”

Longpré said one of the study’s key findings is that we need new tools to give website owners more granular ways to control how their data is used. Some sites might object to AI giants using their data to train chatbots for profit, but might be willing to let a nonprofit or educational institution use the same data, he said. Right now, there’s no effective way for them to distinguish between those uses or block one while allowing the other.

But there’s also a lesson to be learned from this experience for big AI companies, which have treated the internet as an all-you-can-eat data buffet for years, without offering the owners of that data any value in return. Eventually, if you exploit the web, the web will eventually shut down.



Source link

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top