Part of Exploring how to create mock patient data (synthetic data) from real patient data

Outcomes and lessons learned

Previous Chapter

What we did

Current Chapter

Current chapter – Outcomes and lessons learned

View all

Next Chapter

What next

The resulting code enables users to see how:

an input dataset can be constructed from an open-source dataset, MIMIC-III
SynthVAE can be adapted to be trained on a new input dataset with mixed data-types
SynthVAE can be used to produce synthetic data
synthetic data can be evaluated to assess it’s privacy, quality and utility
a pipeline can be used to tie together steps in a process for a simpler user experience

By using the set of evaluation techniques, concerns around the quality of the synthetic data can be directly addressed and measured using the variety of metrics produced as part of the report.

The approach outlined here is not intended to demonstrate a perfectly performing synthetic data generation model, but instead to outline a pipeline that enables the generation and evaluation of synthetic data. Things like overfitting to the training data, and the potential for bias will be highlighted by the evaluation metrics but will not be remedied.

It’s important to emphasise that concerns around re-identification are reduced by using synthetic data but not completely removed. Looking at privacy metrics for the synthetic dataset will help the user to understand how well privacy has been preserved, but re-identification may still be possible.

Last edited: 24 April 2025 4:18 pm

Outcomes and lessons learned

Chapters