OpenAI, Google and other tech companies train their chatbots with large amounts of data from books, Wikipedia articles, news stories and other sources on the Internet. But in the future, they hope to use something called synthetic data.
That’s because tech companies can tap into the high-quality text the Internet has to offer to develop artificial intelligence. And companies face copyright lawsuits from authors, news organizations and computer programmers for using their creations without permission. (In one such case, The New York Times sued OpenAI and Microsoft.)
He believes artificial data will help reduce copyright issues and increase the supply of training materials needed for AI, what to know about it.
What is synthetic data?
This is artificial intelligence generated data.
Does this mean that tech companies want AI to be trained by AI?
Yes. Instead of training AI models with text written by people, tech companies like Google, OpenAI and Anthropic hope to train their technology with data generated by other AI models.
Does synthetic data work?
Absolutely not. AI models get things wrong and make things up. They have also been shown to pick up biases that appear in the Internet data from which they are trained. So if companies use AI to train AI, they can amplify their flaws.
Is synthetic data widely used by tech companies right now?
A number of tech companies are experimenting with this. But because of the potential shortcomings of artificial data, it is not a major part of building AI systems today.
So why do tech companies say artificial data is the future?
Companies think they can improve the way they create synthetic data. OpenAI and others have explored a technique where two different AI models combine to produce synthetic data that is more useful and reliable.
An AI model generates data. Then a second model judges the data, just as a human would, deciding whether the data is good or bad, accurate or not. AI models are better at making decisions than actually writing text.
“If you give a technology two things, it’s very good at choosing which one looks good,” said Nathan Lyle, chief executive of AI startup SynthLabs.
The idea is that this will provide the high-quality data needed to train an even better chatbot.
Does this technique work?
in a sense. It all comes down to this second AI model. How good is it at judging text?
Anthropic has been the most vocal about its efforts to make this work. It fine tunes other AI models using a “constitution” developed by the company’s researchers. It teaches the model to choose texts that support certain principles, such as freedom, equality and a sense of fraternity, or life, liberty and personal security. Anthropic’s approach is known as “constitutional AI.”
Here’s how the two AI models work together to generate synthetic data using processes like Anthropics:
Even so, humans need to make sure the second AI model stays on track. This limits how much synthetic data the process can generate. And researchers disagree on whether an approach like Anthropics will continue to improve AI systems.
Can synthetic data help companies circumvent the use of copyrighted information?
AI models generating synthetic data were themselves trained on human-generated data, much of which was copyrighted. So copyright holders can still argue that companies like OpenAI and Anthropic use copyrighted text, images and video without permission.
Jeff Cloone, a computer science professor at the University of British Columbia who previously worked as a researcher at OpenAI, said AI models could eventually become more powerful than the human brain in some ways. But they will do so because they have learned from the human mind.
“To borrow from Newton: AI stands on the shoulders of large human data sets and sees further,” he said.