Part of Exploring how to create mock patient data (synthetic data) from real patient data

The challenge

This project aimed to provide others with a simple, re-usable way of generating safe and effective synthetic data to be used in technologies that improve health and social care.

Using real patient data for research and development carries with it safety and privacy concerns about the anonymity of the people behind the information. Various anonymisation techniques can be used to turn data into a form that does not directly identify individuals and where re-identification is not likely to take place. However, it is very difficult to entirely remove the chance of re-identification. Wide release of anonymised data will always carry some risks. Synthetic data aims to remove the need for such concerns because there is no “real patient” connected with the data.

There are many ways to generate synthetic data. One common challenge with synthetic data approaches is that they are usually configured specifically for a dataset. This is a problem because it means a significant amount of work is needed to update them for use with a different data source.

Additionally, once data has been produced, it can be difficult to know whether it is actually useful.

In a partnership project, the NHS Transformation Directorate’s Analytics Unit and the NHS AI Lab Skunkworks team sought to further improve an existing synthetic data generation model (called SynthVAE) and develop a framework for generating synthetic data that could be shared for others to re-use.

We have tried to explain the technical terms used in this case study. If you would like to explore the language of AI and data science, please visit our AI Dictionary.

Last edited: 24 April 2025 4:18 pm

The challenge

Chapters