Artificial data pilot
Artificial data sets provide users with large volumes of data that share some of the characteristics of real data while protecting patient confidentiality. They are designed to model the structure of real data but are completely artificial – they do not contain any actual patient records. We are piloting this new service with a limited number of artificial data sets.
About the pilot
We hold over 200 healthcare data sets. These data sets are fundamental for analysis and research to accelerate the discovery of new treatments and helping the NHS to plan better services.
Since patient data is sensitive, only approved users, with approved projects, are allowed to access and analyse data. As such, applying for access can be a lengthy process. Prior to receiving access, it can be difficult for prospective data users to know what:
- fields are present
- the data looks like
- data sets are available
We are piloting this new service to help users understand the columns, data types and approximate value ranges that would appear in real data. Whilst our data sets can be used to improve understanding of data sets and data platforms, and for building and testing data pipelines, they cannot be used to analyse data.
They can be used in advance or, for certain use cases, instead of real data which is only approved to requestors where there is an appropriate legal basis. This reduces bottlenecks for new users of data and data platforms and minimises the amount of personal data being processed for specific tasks.
This artificial data represents the formatting, structure and volume of the original data set. It does not preserve relationships between fields and it is not possible to use artificial data to reidentify individuals.
During this pilot we hope to understand how artificial data can be valuable to our users, identify other datasets for which it would be useful to provide artificial extracts, and gather further feedback on how to enhance the service.
Anticipated benefits
- Users can collaborate and share code across different environments where real data is not accessible minimising the risk to patient privacy
- Using artificial data minimises the use of real data during development and testing work
- Users can develop and test code prior to accessing real data
- Artificial data can be used for teaching purposes to train the next generation of data engineers, scientists and clinicians
Access artificial data sets
There are currently 3 artificial data sets available from Hospital Episode Statistics (HES). All artificial data is generated entirely from anonymous, non-identifiable aggregated data.
What is the difference between a sample and a full data set?
A sample data set contains 10,000 rows and a full data set contains a million rows.
Artificial data generator
Explore the code used to generate this artificial data. This codebase could be adopted to generate your own artificial data.
Artificial data generated by this method imitates some of the statistical properties of real data, but never contains any real patient data.
The artificial data generator uses aggregate, anonymised data to randomly create artificial records. Statistical relationships between columns are not preserved, meaning that individual records are not an accurate representation of individual real records. Any attempt to reverse engineer the artificial data generator would only yield the aggregate statistics on which the artificial data is based.
For example, each artificial record in a data set which contains records describing genders, ages, ICD 10 diagnoses, appointment dates and times, would be generated by randomly picking a value for each field independently and then putting these together to form a ‘complete record’. Most records generated in such a way would be unrealistic. For example, artificial patient records may have ‘geriatrics’ as the department being visited, and an age of 5 years old.
Only by coincidence would some artificial records resemble real records.
How the data is generated
The artificial data generator randomly generates artificial data by sampling from anonymous univariate frequency distributions derived from real data.
Diagram: Overview of process to generate artificial data
Aggregation and anonymisation
The original data is aggregated on a column-by-column basis, with each column treated independently. The outputs of this stage are frequency tables of unique values in each column. At this stage key identifiers (such as patient ID) are removed and small numbers are suppressed to prevent reidentification at a later stage.
The aggregated data is anonymous, with each column aggregated independently of the others. For example, the aggregates could be used to approximately determine the number of female patients or the number of 49-year-old patients in the real data, but not the number of 49-year-old females.
Random data generation
The aggregated data is used to randomly generate artificial data. Each column, for each record, is generated at random. For example, artificial records assigned a relatively low age may have a geriatric diagnosis code. This lack of record level realism ensures the artificial data is anonymous so that it cannot be used to reidentify individuals. It also means that it would not be possible to gain insights or build statistical models that would transfer onto real data.
Since the artificial data is generated using anonymised data, the artificial data itself is anonymous – it only contains as much information as the anonymous aggregate statistics from which it is derived.
Post processing
Once artificial data has been generated, some basic checks and rules are applied to make the data appear more realistic.
For example, randomly generated birth and death dates may be swapped to ensure sensible ordering.
Finally, randomly generated dummy values for identifying fields (such as ‘patient’ ID’) are added. These fields are generated based on template patterns structured to ensure unrealistic values with the correct properties (for example, data type and length).
Data Access Request Service
To apply for access to real healthcare data please visit our Data Access Request Service (DARS).
Feedback
We welcome your feedback. If you want to share your comments and experience of using our artificial data sets, or would like to suggest ways to improve this content, please use our artificial data feedback form.
Register for updates
Be the first to hear about new artificial data sets when they are made available.
Further information
Open data is data that can be used and shared by anyone, for any purpose. We make this data publicly available to improve transparency in health and care.
Hospital Episode Statistics (HES) is a curated data product containing details about admissions, outpatient appointments and historical accident and emergency attendances at NHS hospitals in England.
The Hospital Episode Statistics (HES) Data Dictionary is intended for use by all users of HES data. An NHS data dictionary works in the same way as a normal dictionary, but contains information about data items.
Last edited: 24 January 2025 3:58 pm