Misconception 1: Synthetic data refers to anonymised data.
While there has been a lot of interest in synthetic data recently, we encountered different views on what type of data “synthetic data” actually refers to. In the context of anonymisation, “synthetic” is frequently used interchangeably with “modified”, that is, original data altered in some systematic way to make it more difficult to identify original data points. We distinguish between three approaches to synthetic data currently available on the market:
- Anonymised data, produced by a 1-to-1 transformation from original data. Examples include noise obfuscation, masking, or encryption. Crucially, each anonymised data point corresponds to an original point in some form, which makes this approach closely preserving information about the source data.
- Artificial data, produced by an explicit probabilistic model via sampling. An arbitrary number of artificial data points can be sampled, with no direct connection to original data apart from what informed model building. This approach requires, and its quality is limited by, expert domain knowledge.
- Synthetic data, produced by a learned generative model of original data. While not in a 1-to-1 correspondence with original data, general tendencies like global statistical patterns of synthetic data are nonetheless informed by it. Deep learning techniques can be leveraged to improve the faithfulness of generated data.
Figure 1: Three approaches to “synthetic” data.
In the scientific community, simulated data sometimes refers to either of the latter two, that is, the output of a generative model of some sort. We distinguish between data originating from a generative model with an explicit probabilistic structure (often hand-built) as artificial, and data from a generative model learned from data via unsupervised machine learning as synthetic. Real-world data is often ultra high-dimensional, with complex correlations, and following non-standard distributions, which conflicts with the limiting assumptions underlying many explicit probabilistic models. However, recent breakthroughs in machine learning make learning such complex distributions purely from data possible.
Misconception 2: Anonymised data can safely be shared.
As we have seen, anonymised data is obtained by a 1-to-1 mapping from original data which, by its nature, has fundamental weaknesses. At its heart lies the trade-off between preserving as much useful information as possible, while simultaneously removing as much revealing details as necessary. This means that anonymisation techniques inherently degrade data quality at the cost of preventing de-anonymisation attacks. Other approaches do not face the same problem, since generative models by definition abstract from individual data points and instead learn aggregate tendencies.
It is thus not surprising that there have been a number of data breaches in the past, despite the use of anonymised data, like the attack on the anonymised Netflix Prize Dataset. Such breaches can have horrendous implications on a company’s reputation, as illustrated by the Anthem, Inc. and the JPMorgan Chase case. Researchers have identified various methods of de-anonymisation attacks on anonymised data. Some recent examples include:
- An attack on a novel noise obfuscation method
- Limitations of differential privacy
- Attacks on k-anonymity
Figure 2: Netflix data de-anonymisation.
What unites these incidents is that anonymisation procedures are used carelessly as a panacea for any privacy-related concerns about data sharing, without understanding its fundamental limitations. However, with the advent of new regulations such as GDPR and increasing public attention to privacy issues, the costs of such accidental slips have increased dramatically, putting additional pressure on data sharing activities within and between companies.
Misconception 3: A reliable anonymisation method is expensive.
There is a number of open-source data anonymisation frameworks available, most notably including ARX. ARX offers various obfuscating data transformations and analysis methods such as k-anonymisation. Note that perfect anonymisation in itself is not a difficult task, but can be achieved easily via heavy scrambling and noising of original data. The challenge is consequently: how to make anonymised or simulated data useful? As we have argued above, the usefulness of anonymised data is directly related to how close the 1-to-1 mapping can be chosen given certain privacy requirements. In the case of simulated data, however, even the definition of usefulness is far from obvious, which will be the focus of our next blogpost.
Next research post: How do we Define Usefulness of Simulated Data?