Part of Exploring how to create mock patient data (synthetic data) from real patient data
Outcomes and lessons learned
The resulting code enables users to see how:
- an input dataset can be constructed from an open-source dataset, MIMIC-III
- SynthVAE can be adapted to be trained on a new input dataset with mixed data-types
- SynthVAE can be used to produce synthetic data
- synthetic data can be evaluated to assess it’s privacy, quality and utility
- a pipeline can be used to tie together steps in a process for a simpler user experience
By using the set of evaluation techniques, concerns around the quality of the synthetic data can be directly addressed and measured using the variety of metrics produced as part of the report.
The approach outlined here is not intended to demonstrate a perfectly performing synthetic data generation model, but instead to outline a pipeline that enables the generation and evaluation of synthetic data. Things like overfitting to the training data, and the potential for bias will be highlighted by the evaluation metrics but will not be remedied.
It’s important to emphasise that concerns around re-identification are reduced by using synthetic data but not completely removed. Looking at privacy metrics for the synthetic dataset will help the user to understand how well privacy has been preserved, but re-identification may still be possible.
Last edited: 24 April 2025 4:18 pm