Usefulness of accurate synthesized data for enterprise data science and business intelligence.
We sometimes allow ourselves to use the term synthetic data in this post. Whilst we recognise the term has been loosely used in the market recently, we refer to the accurate data synthesized by a precise generative model of an original data set when we use the term synthetic. The purpose of having synthetic data can vary and ranges from “anonymising sensitive data sets for sharing with partners” and “testing compatibility issues of an enterprise system” to a more sophisticated application of synthetic data to data science tasks. It turns out synthetic data has to have impeccable accuracy to be useful for advanced data science and business intelligence tasks.
At Synthesized we have seen an increased demand in our offering from regulated companies in the last 6 months. Globally, over $3.5bn is already spent on different data security solutions according to Gartner and the budget is only to increase in the next 4 years according to International Data Corporation.
Figure 1: Techniques to provision data for machine learning, data science and business intelligence.
Techniques to provision data for machine learning, data science and business intelligence.
We have identified four main ways of provisioning tabular data from data sources to internal data science and BI teams in the enterprise market at the moment:
- Provisioning raw original data.
In view of the fact that about 92% of executives are concerned about compliance regulations around data provisioning and sharing, it may come as a surprise that many corporates (about 40% from our recent analysis) still opt to carry out data science projects with original sensitive data at risk of a potential (sometimes accidental) data breach.
- Provisioning (pseudo-) anonymised data.
Various anonymization methods have become an important feature of many enterprise products at the moment such as DataRobot and H2O for example. In fact, most of the widely used databases offer techniques such as differential privacy and k-anonymity by default.
- Provisioning encrypted data (incl. asymmetric and homomorphic encryption).
Fully homomorphic encryption enables computation on encrypted data without leaking any information about the underlying data. In a nutshell, a party can encrypt some input data, while another party, that does not have access to the decryption key, can perform some computation on this encrypted input. The final result is also encrypted, and it can be recovered only by the party that possesses the secret key. It is natural to classify both (pseudo-) anonymised and encryption as 1-to-1 mapping methods. The data can typically be used for sharing and static testing but has limitations when using it in data science and machine learning tasks, source.
- Provisioning accurate synthesized data.
Synthesized data is an output of a precise generative model of original data which ensures that the statistical properties of data are preserved so that it is impossible to distinguish whether data is synthesized or original with “the naked eye”, and also with advanced statistical tools as demonstrated in the next section.
Accurate synthetic data for privacy-preserved data science and business intelligence.
As new artificial intelligence methods continue to disrupt our daily life, the main obstacle moves from developing new methods to acquiring data for validating their performance. Unfortunately, by exploiting actual data businesses are confronted with legal hurdles as privacy laws becoming increasingly strict all over the planet. Accurate synthesized data is a powerful alternative.
Figure 2: Illustration of the process of data synthesis.
To quantify the similarity of the synthesized data to the original data and its usefulness for enterprise data science and business intelligence the Synthesized software provides a comprehensive statistical assessment. We here illustrate two example studies of tabular datasets provided by our partners in Europe.
- For insurance credit data provided by one of our partners in Europe, the software showed that not only the distance between key high-dimensional properties was small as can be seen below, but also that advanced conditional properties were similar.
Figure 3: Comparison of Average Age grouped by delinquency in both original and synthesized data sets.
- The task of predicting whether a client will repay a loan is a critical business need for lending businesses. For the data provided by a financial service company dedicated to providing loans to the unbanked population, the software showed the advanced conditional properties such as the frequency of employment days were similar.
Figure 4: Comparison of the frequency of employment days in both original and synthesized data sets .
This is just the beginning…
In the next blog post, we will show how to make accurate synthetic data privacy-preserving as well. In fact, we argue that the following relative comparison holds.
Figure 5: Optimal trade-off between privacy and accuracy of accurate synthetic data.
Reach out to us for an early access to the blog post. Our product demo is available and can be requested here as well. We’re hiring for both engineering and commercial roles! Send a resume/request to email@example.com for more details.
This post is prepared by the Synthesized Core Team.