Using Generative AI to Improve Software Testing MIT News

Generative AI is getting a lot of attention for its ability to generate text and images. But that media represents only a fraction of the data that permeates our society today. Data is generated every time a patient passes through a medical system, a storm affects a flight, or a person interacts with a software application.

Using generative AI to create realistic synthetic data around these scenarios can help organizations treat patients more efficiently, reroute planes, or optimize software platforms—specifically In scenarios where real-world data is limited or sensitive.

For the past three years, MIT spinout DataCebo has offered a creative software system called Synthetic Data Vault to help organizations create synthetic data for tasks like testing software applications and machine learning models.

Synthetic Data Vault, or SDV, has been downloaded more than 1 million times, with more than 10,000 data scientists using an open-source library to create synthetic tabular data. The founders — Principal Research Scientist Kalyan Weeramachani and alumna Neha Pataki ’15, SM ’16 — believe the company’s success is due to SDV’s ability to revolutionize software testing.

SDV goes viral.

In 2016, Weeramachani’s group at the Data to AI Lab unveiled a suite of open-source generative AI tools to help organizations create synthetic data that matches the statistical properties of real data.

Companies can use synthetic data instead of sensitive information in programs while also preserving statistical relationships between data points. Companies can also use simulated data to run new software through simulations to see how it performs before releasing it to the public.

Weeramachani’s group faced this problem as it was working with companies that wanted to share their data for research.

“MIT helps you look at all these different use cases,” explains Pitaki. “You work with finance companies and healthcare companies, and all those projects are useful for developing solutions across industries.”

In 2020, researchers founded DataCebo to build more SDV features for large organizations. Since then, the use cases have been as impressive as they have been varied.

With DataCebo’s new flight simulator, for example, airlines can plan for rare weather events in a way that would be impossible using only historical data. In another application, SDV users synthesized medical records to predict health outcomes for patients with cystic fibrosis. A Norwegian team recently generated simulated student data using SDV to evaluate whether different admissions policies were meritocratic and bias-free.

In 2021, data science platform Kaggle hosted a competition for data scientists who used SDV to create synthetic datasets to avoid using proprietary data. Nearly 30,000 data scientists participated, developing solutions and predicting outcomes based on realistic company data.

And as DataCebo has grown, it has stayed true to its MIT roots: All of the company’s current employees are MIT alumni.

Supercharging software testing

While their open source tools are being used for a variety of use cases, the company is focused on increasing its traction in software testing.

“You need data to test these software applications,” Weeramacheni says. “Traditionally, developers manually write scripts to generate simulated data. With models created using SDV, you can learn from a sample of collected data and then sample a large amount of simulated data. (which has the same properties as real data), or create specific scenarios and edge cases, and use the data to test your application.”

For example, if a bank wants to test a program designed to reject transfers from accounts that don’t have funds, simulating many accounts making simultaneous transactions. Will be. Doing this with manually created data will take a long time. With DataCebo’s generative models, users can create any edge case they want to test.

“It’s common for industries to have data that is somewhat sensitive,” says Pitaki. “Often when you’re in a domain with sensitive data you’re dealing with regulations, and Even if there aren’t legal regulations, it’s in companies’ best interests to be diligent about who has access to what at what time. Therefore, synthetic data is always better from a privacy perspective.

Measurement of synthetic data

Veeramachaneni believes that DataCebo is driving the field of what he calls artificial enterprise data, or data generated from user behavior on large companies’ software applications.

“This type of enterprise data is complex, and unlike language data, it doesn’t have universal availability,” says Weeramacheni. “When people use our publicly available software and report back if certain patterns work, we learn a lot from these unique patterns, and that allows us to improve our algorithms. From one point of view, we’re creating a collection of these complex patterns, easily accessible to language and images.”

DataCebo has also recently released features to improve the utility of SDV, including tools to evaluate the “realism” of generated data, called the SDMetrics library, as well as models called SDGym. A way to compare the performance of

“It’s about making sure organizations trust this new data,” Weeramacheni says. “[Our tools offer] Programmable synthetic data, which means we allow businesses to insert their own specific insights and insights to create more transparent models.

As companies in every industry rush to adopt AI and other data science tools, DataCebo is ultimately helping them in a way that is more transparent and accountable.

“In the next few years, synthetic data from generative models will replace all data work,” says Veeramachni. “We believe that 90 percent of enterprise operations can be done with artificial data.”

Leave a Comment