What we did

The teams explored how SynthVAE could be used to generate synthetic data, how that data would be evaluated and how the whole process could be documented for others to re-use.

If you would like greater technical detail about this project, please read the version on the Skunkworks Github website.

They sought to:

increase the range of synthetic data types that SynthVAE can generate
create a standard series of checks that can be carried out on the data produced, so that people can better understand its characteristics
implement a structure to allow users to run the full functionality with a single piece of code.

To be able to increase SynthVAE’s range of capabilities, the teams needed an input dataset containing a number of different data types in order to broaden the range of the data produced.

The teams chose to work from a starting dataset that was already in the public domain. This meant people wishing to use the code after release could access and use the same dataset with which the project was developed. MIMIC-III was selected because the size and variety of its data would enable them to produce an input file that would closely match the broad range of typical hospital data.

From the raw MIMIC-III files, they produced a single dataset containing treatment provided by a hypothetical set of patients. It looked similar to datasets that might be encountered in a real hospital setting, helping to keep this project as relevant as possible to anyone wishing to explore the use of synthetic data for health and care.

1. Adapting SynthVAE

SynthVAE was originally written primarily to generate synthetic data from both continuous data (data with an infinite number of values) and categorical data (data that can be divided into groups). The inclusion of other data types (like dates) in the new input dataset meant SynthVAE needed to be adapted to take the new set of variables.

2. Producing synthetic data

Having sourced suitable data and created a useful input file, it was possible to use the input file to train a SynthVAE model that could generate synthetic data. The model was used to generate a synthetic dataset containing several million entries, a substantial increase on volumes previously produced using SynthVAE.

This wasn’t without challenges, as SynthVAE hadn’t been substantially tested using dates or large volumes of data. However, SynthVAE was successfully adapted to produce a synthetic version of the input data from MIMIC-III.

3. Creating a checking process

In order to evaluate the privacy, quality and utility of the synthetic data produced, a set of checks were needed. There is currently no industry standard, so the teams chose a range of evaluation approaches designed to provide the broadest possible assessment of the data.

The process aimed to check whether the synthetic data was a good substitute for the real data, without causing a change in performance (also known as the utility). The additional checks that were added aimed to make the evaluation of utility more robust, for example by checking there are no identical records in the synthetic and real datasets, but also to provide visual aids to allow the user to see what differences are present in the data.

These checks were combined and their results collected in a web-based report, to allow results to be packaged and shared with any data produced.

4. Creating a pipeline

Finally, the teams pulled these steps into a single workflow process for others to follow.

The input data generation, SynthVAE training, synthetic data production and output checking processes were chained together, creating a single flow to train a model, produce synthetic data and then evaluate the final output.

To make the end-to-end process as user-friendly as possible, a pipelining library called QuantumBlack’s Kedro was employed. This allowed each step in the workflow to be linked to the next, meaning users can run all parts of the process with a single command. It also gives users the ability to control the definitions within the pipeline and change it according to their needs.

If you would like greater technical detail about this project, read the version on the Skunkworks Github website.

Last edited: 24 April 2025 4:19 pm

1. Adapting SynthVAE

2. Producing synthetic data

3. Creating a checking process

4. Creating a pipeline

Chapters