Large-scale AI training datasets, or corpora, are called the “backbones of large language models.” But EleutherAI, the organization that created one of the world’s largest datasets, an 825 GB open-source diverse text corpora called the Pile, became a target in 2023 amid a growing uproar over the datasets’ legal and Focused on moral implications. The most popular LLMs, from OpenAI’s GPT-4 to Meta’s Llama.
EleutherAI, a grassroots non-profit research group that began in 2020 as a loosely-knit Discord collective that sought to understand how OpenAI’s new GPT-3 worked, was named after Sal was placed in one of several AI-based generative lawsuits. Former Arkansas Gov. Mike Huckabee and other authors filed a lawsuit in October alleging that their books were taken without consent and included in Books3, a controversial dataset containing 180,000 books. There are more works than and was included as part of the Pile Project (Books3). Uploaded by Sean Presser in 2020, it was removed from the internet in August 2023 after a legal notice from a Danish anti-piracy group.)
But far from stopping work on its dataset, EleutherAI is now building an updated version of the Pile dataset in collaboration with a number of organizations, including the University of Toronto and the Allen Institute for AI, as well as independent researchers. In a joint interview with VentureBeat, Stella Biederman, a principal scientist and mathematician at Booz Allen Hamilton who is also executive director at EleutherAI, and Aviya Skowron, EleutherAI’s head of policy and ethics, said the updated Pile dataset is a few months old. is at a distance. Finalized
The new stack is expected to be bigger and ‘significantly better’
Biederman said the new LLM training dataset will be even larger and is expected to be “significantly better” than the old dataset.
“There’s going to be a lot of new data,” Biederman said. “Some, he said, will be data that hasn’t been seen anywhere before and “that we’re working on this kind of excavation, that’s really exciting. It’s going to happen.”
Pile v2 includes more recent data from the original dataset, which was released in December 2020 and was used to build language models including the Pythia suite and Stability AI’s Stable LM suite. It will also include better preprocessing: “When we built the pile, we had never trained an LLM before,” Biederman explained. “We’ve trained about a dozen now, and know a lot about how to clean data in ways that make it suitable for LLMs.”
The updated data set will also include better quality and more diverse data. “We’ll have more books than the original pile, for example, and a more diverse representation of non-academic nonfiction domains,” he said.
The original pile consists of 22 sub-datasets, including Books3 but also PubMed Central, Arxiv, Stack Exchange, Wikipedia, YouTube subtitles and, oddly enough, Enron emails. Biederman pointed out that the pile LLM training dataset is the most well-documented in the world by its creator. The goal of developing the pile was to build a vast new dataset, containing billions of text segments, to match the scale used by OpenAI to train GPT-3.
Pile was a unique AI training dataset when it was released.
Biederman said, “In 2020, Pile was a very important thing, because there was nothing like it. At the time, he explained, there was a large publicly available text corpora, C4, which Google had created for different languages. was used to train the models.
“But C4 isn’t nearly as big as Pyle and it’s also much less diverse,” he said. “This is a really high-quality common crawl scrap.” (The Washington Post analyzed C4 in an April 2023 investigation that “set out to analyze one of the data sets to fully reveal the types of proprietary, personal, and often offensive websites that goes into the AI’s training data.”)
Instead, EleutherAI tried to be more discerning and identified categories of information and topics that it wanted the model to know about.
“It wasn’t really anything anyone had ever done before,” he explained. “More than 75% of the pile was chosen from specific topics or domains, where we wanted the model to know things about it—let’s give it as much meaningful information as possible about the world, about things that We care.”
Skowron explained that EleutherAI’s “general position is that model training for copyrighted data is fair use”. But he pointed out that “there is currently no major language model on the market that is not trained on copyrighted data,” and that one goal of the Pile v2 project is to address some of the issues surrounding copyright and data licensing. Trying to solve. .
He described the creation of a new PILE dataset to reflect this effort: it includes public domain data, both old works that have entered the public domain in the United States and texts that have never before been in copyright. was, such as the documents produced. government or legal filings (such as Supreme Court opinions); Text licensed under Creative Commons; code under an open source license; Text with a license that expressly permits redistribution and reuse—some open access scientific articles fall into this category. and a separate category for small datasets for which researchers have express permission from rights holders.
Criticism of AI training datasets became mainstream after ChatGPT.
Concerns over the impact of AI training datasets are not new. For example, in 2018 AI researchers Joy Boolamwini and Timnit Gebru wrote a paper in which large image datasets led to racial bias within AI systems. And legal battles over large image training datasets began in mid-2022, shortly after the public began to realize that popular text-to-image generators such as Midgerni and Stable Diffusion were trained on large image datasets. were mostly scraped from the Internet.
However, since the release of OpenAI’s ChatGPT in November 2022, criticism of the datasets used to train LLMs and image generators has grown considerably, particularly regarding copyright concerns. A flurry of AI-based lawsuits brought by artists, writers and publishers began, culminating in the New York Times filing last month against OpenAI and Microsoft, which many believe is supreme. May end up in court.
But more serious, troubling allegations have emerged recently – including the ease of creating deepfake revenge porn thanks to the large image corpora that trained text-to-image models, as well as the thousands of images of child sexual abuse at LAION. discovery 5B image dataset — which led to its removal last month.
The debate around AI training data is extremely complex and important.
Biderman and Skowron argue that the debate surrounding AI training data is far more complex and nuanced than the media and critics of AI let on — even when it comes to issues that are clearly vexing and wrong. Such as the child sexual abuse images found in LAION. -5B
For example, Biderman said that the mechanism used by people flagging LAION content is not legally accessible to the LAION organization, which he said secures the images. Removal becomes difficult. And the resources to screen datasets for these types of images may not be available in advance.
“There seems to be a huge disconnect between the way organizations try to combat this content and what will make their resources useful for people who want to screen datasets.”
When it comes to other concerns, such as the impact on creative workers whose work is used to train AI models, “a lot of them are upset and hurt,” Biederman said. “I totally understand where they’re coming from that point of view.” But he pointed out that some creators uploaded works under the license to the Internet without realizing that years later AI training datasets, including Common Crawl, could use works under those licenses.
“I think a lot of people in the 2010s, if they had the Magic Eight Ball, would have made different licensing decisions,” he said.
Still, EleutherAI didn’t have a magic eight ball either — and Biderman and Skowron agree that when Pile was built, AI training datasets were primarily used for research, where licenses and copyrights were used. Speaking of which, there are wide discounts.
“AI technologies have only recently made the leap from something that would be primarily thought of as a research product and scientific artifact with the primary purpose of fabrication,” Biederman said. Google had put some of these models into commercial use in the past, he explained, but “training on very large, mostly webscript datasets, it became a question more recently.”
Skowron said that to be fair, legal scholars like Ben Sobel had been thinking about AI issues and the legal issue of “fair use” for years. But even many people at OpenAI, “who do you think would know about the product pipeline”, he explained, didn’t realize the public, commercial implications of ChatGPT.
EleutherAI says open datasets are safer to use.
While this may seem counterintuitive to some, Biderman and Skowron also maintain that AI models trained on open datasets like Pyle are safer to use, because visibility into the data is what results in AI models. It helps to use it safely and ethically. Context
“Achieving many policy goals or ethical ideals requires a lot of visibility that people want,” Skowron said, including thorough documentation of the training. “And for many research questions you need real access to datasets, including those of great interest to copyright holders such as conservation.”
For now, Biderman, Skowron and their colleagues at EleutherAI are continuing their work on the latest version of Pile.
“It’s been about a year and a half in the works and about two months of meaningful work — I’m optimistic that we’ll be training and releasing models this year,” Biederman said. “I’m curious to see how big a difference it makes. If I had to guess… it would make a small but meaningful one.”
Mission of VentureBeat To be a digital town square for technology decision makers to learn about transformative enterprise technology and transactions. Explore our briefing.