The 20th Century’s Industrial Age ran on oil, coal, and gas. As these cheap, abundant fossil fuels were put to work, they transformed the world economy, augmenting the utility of labor by a thousandfold. In the early decades of the industrial revolution, the booming progress concealed the dangers hidden in fossil fuels. As the decades advanced, however, it became clear that fossil fuels caused substantial harm to the environment – and that they were not abundant enough to sustain the entire globe’s development. The scarcity of usable fuel, along with the risks of its mis-use, began to hamper economic growth.
Today’s Information Age runs on data. Cheap, abundant data has transformed the world economy, augmenting the utility of AI algorithms by a thousandfold – perhaps more. But just as Big Oil brought big problems, Big Data brings big dilemmas, too. Training AI systems requires massive volumes of data. Sometimes the necessary data is simply not available – in risky sectors, there just might not be enough long tail or black swan events to train the algorithm safely. Other times, the data is available but problematic – government classification, strict privacy regulations, and restrictive data-sharing laws can make it impossible to use. As much as 60% of the world’s data may be “unextractable” (to continue the fuel analogy) due to such concerns. Scarcity of useful data, along with the risks of mis-use, has begun to hamper economic growth.
When it comes to fossil fuels, the problems of pollution and scarcity have forced industry to look for alternatives such as solar, wind, and nuclear power. But imagine a technology existed whereby a company could take a barrel of oil and make a synthetic duplicate of it. Imagine, further, that the synthetic barrel of oil it created had more extractable energy per gallon as the original oil, but none of the pollutants. The engine of industry could run on cheap, abundant, high-energy, pollutant-free synthetic oil forever. Such a technology would be world-changing, almost magical.
Such a technology does exist – but for data, not oil. It’s called synthetic data and it’s every bit as magical and world-changing as you imagine.
Synthetic data, broadly defined, consists of information generated artificially by a computer using a simulation or algorithm. It stands in contrast to real data, which is extracted organically from real behavior in the real world. Now, remember how the synthetic oil in our analogy above wasn’t just a duplicate of the original oil, but was actually cleansed of pollutants?
Well, the same is true of synthetic data.
The analogical equivalent to having pollutants in fuel is having protected information in data. The information might be protected because it can be used to personally identify individuals, because it violates privacy regulations, because it exposes classified information or trade secrets, and or many other reasons. Whatever the reason, the useful information in the data ends up “polluted” by the protected information.
Synthetic data removes the pollutant! It creates a verifiable synthetic “twin” dataset with the same statistical properties of the original data, but without including any confidential or personal information. The result is a data set rich in value and free from risks. The new data will look, act, and feel realistic for the purposes of data modeling and analysis, but won’t contain any of the protected information.
Here’s an example. Suppose a healthcare company wants to train an AI to understand a rare disease such as Multiple acyl-CoA Dehydrogenase Deficiency (MADD) based on the biomarkers and symptomology of patients with the disease. However, the real-world patient data includes protected health information (PHI) under HIPAA, meaning it could not be used without securing every patient’s express written consent. That’s quite a problem.
By deploying a synthetic data solution, the company could generate “synthetic patients” with MADD, with biomarkers, demographics, and symptomology that statistically resemble those of the original patients, but which exclude all of the PHI from the original data set. Problem solved!
The extent to which the synthetic data is similar to the real data is referred to as its accuracy, while the extent to which it excludes protected information is called its privacy. Accuracy makes synthetic data useful; privacy makes it useable. Obviously it’s preferable that both accuracy and privacy be as good as possible. However, different types of synthetic data can vary greatly in accuracy and privacy. Not all synthetic data is created equal!
The oldest type of synthetic data, simulated synthetic data, is generated indirectly, by running simulations of physical systems that produce data. The simulated activity that takes place in the simulation creates simulated data that parallels the real data one would get from the real activity taking place in the real world. Simulated synthetic data is private, but not necessarily accurate. It will have the same statistical properties of real data only to the extent that the simulation that generated it has the same physical properties as the system it simulates. Any videogame developer will tell you that is no easy feat. This problem limits simulated synthetic data’s utility to systems that can be easily simulated. (One use case is training the AI for driverless cars.)
A newer type of synthetic data is model-generated synthetic data. Model-generated synthetic data is created by an algorithm that creates new information with the same statistical properties of the original information, eliminating the need to build a simulation of the system that created that real data. The first model-generated synthetic data was developed by Donald Rubin, professor of statistics at Harvard University, in 1993 for use in the US Census. A number of refinements to Rubin’s approach have been developed since then, with a number of companies deploying different techniques to synthesizing data:
- Parametric techniques deploy, e.g., marginal distributions and copulas to synthesize the data. Parametric techniques are able to achieve good privacy but typically have poor accuracy or require significant human-tuning efforts.
- Machine learning techniques use GANs (generative adversarial networks), autoencoders (another neural network topology) or decision trees to synthesize the data. Machine learning techniques have good accuracy, especially if the generation technique has been built for a specific purpose. However, the privacy they generate is questionable – at best.
The newest type of synthetic data is clean synthetic data. Clean synthetic data is generated by a combination of machine-learning techniques and differentially private mechanisms. Clean synthetic data enjoys strong accuracy for virtually any use case while simultaneously providing good privacy. It’s better than both simulated synthetic data and model-generated data. Diveplane’s GEMINAI™ synthetic data platform is the leading platform for the generation of clean synthetic data.
Let’s return to our example. The health care company needs to decide what type of synthetic data to generate and use. To create simulated synthetic data, the company would need to have an incredibly complex simulation of the human body and the disease process, then run that simulation to produce the simulated data. No such simulation exists, and creating one would be a decades-long effort. So simulated synthetic data is out. It’s simply impossible to create for this (and most other) use cases.
To create model-generated or clean synthetic data, however, the company would just need to plug data extracted from a community of real-world patients with that disease into the appropriate platform. In a matter of hours and days, they’d have synthesized the data. But could they use it?
If the company chose a model-generated synthetic data platform that relies on parametric techniques, the data accuracy might not be good enough to provide useful results. If they chose a platform that relies solely on machine learning techniques, the data privacy might not be good enough to be useable without violating HIPPA. Only if the company generates clean synthetic data using a platform like Diveplane’s GEMINAI can they rely on the data being both accurate and private.
GEMINAI’s clean synthetic data has an additional advantage. Remember that our imaginary synthetic oil was not just pollutant-free, it also had more extractable energy than the original oil. The analogical equivalent of energy in oil is value in data. Synthetic data created using GEMINAI can actually be more valuable than the real data it is derived from.
How is that possible? Imagine that our hypothetical healthcare company wants to specifically understand late-onset MADD. Therefore, it only wants patient data from adults who develop the disease in middle age. However, the vast majority of the MADD patient data comes from neonatal patients, who are born with the disease. As such, even if the real-world data could be used, it wouldn’t work – there’s not enough real adult data for the AI to learn on, while having it learn on neonatal patients would yield bad outcomes.
But by deploying GEMINAI, the company could use the data from the small number of actual patients of the appropriate age to create data for a large number of virtual patients with the same range of ages and symptoms. So the synthetic data doesn’t just copy over the value of the old data – it creates new value!
Synthetic data is truly revolutionary technology and, when cleanly generated using the latest technology, it can unlock incredible value while protecting the privacy and confidentiality of personal or classified information. If your enterprise is not using synthetic data yet, it will be soon. Synthetic data is the future of the Information Age, and that future is clean, private, and accurate.
2 Comments