OpenAI destroyed AI training data. The crew that collected it is gone.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Newly unsealed documents in a class-action lawsuit brought by the Authors Guild against OpenAI show that the startup deleted two large data sets, named “books1” and “books2,” which Its GPT-3 AI models were used for training.

Lawyers for the Authors Guild said in court filings that the datasets potentially include “more than 100,000 published books” and are central to its allegations that OpenAI used copyrighted material to train AI models. used.

For several months, the guild has been receiving information from OpenAI about datasets. The company initially resisted, citing privacy concerns, before eventually revealing that it had deleted all copies of the data, according to legal filings reviewed by Business Insider.

High-quality training data is a key part of the powerful AI models that are taking the tech world by storm. OpenAI and other companies used data from the Internet to build these models, including many books. Many of the companies that created this information want to be paid to provide intelligence to these new AI products. Tech companies don't want to be forced to pay. The dispute is now being fought out in several court cases.

In a 2020 white paper, OpenAI described the Book1 and Books2 datasets as “a corpora of Internet-based books” and said they made up 16% of the training data that went into building GPT-3. The white paper also states that “books1” and “books2” together contain 67 billion tokens of data, or the equivalent of about 50 billion words. For comparison, the King James Bible contains 783,137 words.

The unsealed letter from OpenAI's attorneys, labeled “Top Secret – Attorneys View Only,” states that Books 1 and Books 2 will be used for model training in late 2021. was discontinued and the datasets were deleted in mid-2022. -The usage letter states that none of the other data used to train GPT-3 has been deleted and the Authors Guild has been offered access to these other data sets.

The unsealed documents also reveal that the two researchers who produced Books 1 and 2 are no longer employed by OpenAI. OpenAI initially declined to reveal the identities of the two employees.

The startup has since identified employees for Writers Guild's lawyers but has not publicly released their names. OpenAI has requested the court to keep the names of the two employees as well as information about the datasets under seal. The Writers Guild has opposed this, citing the public's right to know. The dispute continues.

“The models powering ChatGPT and our API today were not developed using these datasets,” OpenAI said in a statement on Tuesday. “These datasets, created by former employees who are no longer with OpenAI, were last used in 2021 and were deleted due to non-use in 2022.”

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment