AI training data carries a price tag that only Big Tech can afford.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Data is at the heart of today's most advanced AI systems, but it's costing more and more — putting it out of reach for all but the wealthiest tech companies.

Last year, James Betker, a researcher at OpenAI, wrote a post on his personal blog about the nature of creative AI models and the datasets they are trained on. In it, Betker contends that training data — not the model's design, architecture or any other feature — is the key to increasingly sophisticated, capable AI systems.

“Trained on the same data set for a long time, almost every model converges at the same point,” Beitker wrote.

Is Betker right? Is the training data the biggest determinant of what the model can do, whether it's answering a question, drawing a human hand, or creating a realistic cityscape?

This is certainly understandable.

Statistical machines.

Generative AI systems are essentially probabilistic models – a huge pile of data. They infer, based on many examples, which data makes the most “sense” to put where (for example, the word “go” before “in the market” in the sentence “I go to the market”). “). It seems intuitive, then, that the more examples the model has to run, the better the performance of the models trained on those examples.

“It seems like the performance gains are coming from the data,” Kyle Lowe, a senior applied research scientist at the Allen Institute for AI (AI2), an AI research nonprofit, told TechCrunch, “at least once. When you have a stable training setup”

Lowe cited the example of Meta's Llama 3, a text generation model released earlier this year, that outperforms AI2's own OLMO model despite being very similar architecturally. Llama 3 was trained on significantly more data than OLMo, which Lo believes explains its superiority on many popular AI benchmarks.

(I will point out here that the benchmarks widely used in the AI ​​industry today are not the best measure of a model's performance, but outside of qualitative tests like our own, they are one of the few measures Let's do what we have to do.)

This does not mean that training on increasingly large datasets is a sure path to increasingly better models. The models operate on a “garbage in, garbage out” paradigm, Lowe notes, and so data curation and quality matter a lot, perhaps more than quantity.

“It is possible for a small model with carefully designed data to outperform a large model,” he added. “For example, the Falcon 180B, a larger model, ranked 63rd on the LMSYS benchmark, while the Llama 2 13B, a much smaller model, ranked 56th.”

In an interview with TechCrunch last October, OpenAI researcher Gabriel Goh said that higher-quality annotations helped improve image quality in DALL-E 3, OpenAI's text-to-image model, compared to its predecessor DALL-E 2. Contributed a lot. That's the main source of improvement,” he said. “Text annotations are much better than they are. [with DALL-E 2] – It's not even comparable.”

Many AI models, including DALL-E 3 and DALL-E 2, are trained with human-annotated label data so that a model can learn to associate these labels with other observed features of that data. For example, a model fed many pictures of cats with annotations for each species will eventually “learn” to associate such terms. Bobtail And short hair with their distinctive visual characteristics.

bad behavior

Experts like Lowe worry that the increased emphasis on large, high-quality training datasets will centralize AI development among a few players with billion-dollar budgets who can afford to acquire those sets. A major innovation in artificial intelligence or underlying architecture could disrupt the status quo, but neither seems imminent.

“Overall, organizations that control content that is potentially useful for AI development are incentivized to lock down their content,” Lowe said. “And as access to data closes, we're essentially blessing some early movers on acquiring data and pulling up the ladder so that no one else can access the data to catch up.”

In fact, while the race to get more training data hasn't led to unethical (and perhaps illegal) behavior like secretly hoarding copyrighted material, it has given tech giants deep pockets to spend on data licensing. is blessed with

Generative AI models such as OpenAI are trained mostly on images, text, audio, videos and other data — some copyrighted — obtained from public web pages (including, problematically, AI-generated ones). The OpenAIs of the world claim that fair use protects them from legal retaliation. Many rights holders disagree – but, at least for now, there's not much they can do to stop the practice.

There are many examples of generative AI vendors that obtain massive datasets from questionable sources to train their models. OpenAI reportedly transcribed more than a million YouTube videos without YouTube's blessing — or the creators' blessing — to feed its flagship model GPT-4. Google recently expanded its terms of service to allow it to tap public Google Docs, restaurant reviews on Google Maps and other online content for its AI products. And Metta is said to have considered legal action for training its models on IP-protected content.

Meanwhile, companies large and small are relying on workers from third world countries who are paid just a few dollars an hour to create annotations for training sets. Some of these annotators — employed by big startups like Scale AI — work literally days to complete tasks that expose them to graphic depictions of violence and bloodshed with no benefits or future. Guaranteed gigs.

Increased cost

In other words, even the data deals above aren't exactly fostering an open and equitable AI ecosystem.

OpenAI has spent hundreds of millions of dollars licensing content from news publishers, stock media libraries and more to train its AI models — a budget far greater than that of most academic research groups, nonprofits and startups. Is. Meta has weighed acquiring publisher Simon & Schuster for the rights to the e-book excerpts (Simon & Schuster was eventually sold to private equity firm KKR for $1.62 billion in 2023).

With the market for AI training data expected to grow from about $2.5 billion within a decade to nearly $30 billion now, data brokers and platforms are rushing to get top dollar — in some cases their own. Over the objections of user bases.

Stock media library Shutterstock has struck deals with AI vendors ranging from $25 million to $50 million, while Reddit claims to have made hundreds of millions from licensing data to organizations like Google and OpenAI. Few platforms with abundant data have been formally assembled over the years. do not have Signed contracts with generative AI developers, it seems – from Photobucket to Tumblr to Q&A site Stack Overflow.

It's platforms' data to sell — at least depending on which legal arguments you believe. But in most cases, consumers aren't seeing a penny of profit. And it's hurting the broader AI research community.

“Smaller players won't be able to afford these data licenses, and therefore won't be able to develop or study AI models,” Lowe said. “I worry that this could lead to a lack of independent scrutiny of AI development practices.”

Independent efforts.

If there is a ray of sunshine in the darkness, it is the few independent, non-profit efforts to create large-scale datasets that anyone can use to train creative AI models.

EleutherAI, a grassroots non-profit research group that began in 2020 as a loose Discord collective, is working with the University of Toronto, AI2 and independent researchers to build The Pile v2, which It is basically a collection of billions of text passages obtained from the public domain. .

In April, AI startup Hugging Face released FineWeb, a filtered version of CommonCrawl — a dataset maintained by the nonprofit CommonCrawl, containing billions of web pages — that it claims Hugging Face improves model performance on many benchmarks.

A few attempts to release open training datasets, such as group LAION's image sets, have run up against equally serious ethical and legal challenges, such as copyright, data privacy, and others. But some of the more dedicated data curators have vowed to do better. Pile v2, for example, removes copyrighted material found in its progenitor dataset, The Pile.

The question is whether any of these open-ended efforts can hope to keep pace with Big Tech. As far as data collection and curation resources are concerned, the answer is unlikely—at least not until some research advances level the playing field.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment