Artificial data pilot

Artificial data sets provide users with large volumes of data that share some of the characteristics of real data while protecting patient confidentiality. They are designed to model the structure of real data but are completely artificial – they do not contain any actual patient records. We are piloting this new service with a limited number of artificial data sets.

About the pilot

We hold over 200 healthcare data sets. These data sets are fundamental for analysis and research to accelerate the discovery of new treatments and helping the NHS to plan better services.

Since patient data is sensitive, only approved users, with approved projects, are allowed to access and analyse data. As such, applying for access can be a lengthy process. Prior to receiving access, it can be difficult for prospective data users to know what:

fields are present
the data looks like
data sets are available

We are piloting this new service to help users understand the columns, data types and approximate value ranges that would appear in real data. Whilst our data sets can be used to improve understanding of data sets and data platforms, and for building and testing data pipelines, they cannot be used to analyse data.

They can be used in advance or, for certain use cases, instead of real data which is only approved to requestors where there is an appropriate legal basis. This reduces bottlenecks for new users of data and data platforms and minimises the amount of personal data being processed for specific tasks.

This artificial data represents the formatting, structure and volume of the original data set. It does not preserve relationships between fields and it is not possible to use artificial data to reidentify individuals.

During this pilot we hope to understand how artificial data can be valuable to our users, identify other datasets for which it would be useful to provide artificial extracts, and gather further feedback on how to enhance the service.

Anticipated benefits

Users can collaborate and share code across different environments where real data is not accessible minimising the risk to patient privacy
Using artificial data minimises the use of real data during development and testing work
Users can develop and test code prior to accessing real data
Artificial data can be used for teaching purposes to train the next generation of data engineers, scientists and clinicians

Access artificial data sets

There are currently 3 artificial data sets available from Hospital Episode Statistics (HES). All artificial data is generated entirely from anonymous, non-identifiable aggregated data.

What is the difference between a sample and a full data set?

A sample data set contains 10,000 rows and a full data set contains a million rows.

Artificial data - HES user notice

txt 496 bytes

Artificial HES A&E sample

txt

Artificial HES A&E sample (24.3MB)

Artificial HES A&E full

txt

Artificial HES A&E full (2.4GB)

Artificial HES Admitted Patient Care sample

txt

Artificial HES Admitted Patient Care sample (43.9MB)

Artificial HES Admitted Patient Care full

txt

Artificial HES Admitted Patient Care full (4.3GB)

Artificial HES Outpatient sample

txt

Artificial HES Outpatient sample (28.5MB)

Artificial HES Outpatient full

txt

Artificial HES Outpatient full (2.8GB)

Artificial data generator

Explore the code used to generate this artificial data. This codebase could be adopted to generate your own artificial data.

Artificial data generated by this method imitates some of the statistical properties of real data, but never contains any real patient data.

Image description

This diagram shows how data generation is completely isolated from the aggregation, with a review and sign off process separating the 2 steps.

Boxes on the left with a blue dashed border represent different scopes where real data is accessed. Within each scope, only a single data set is accessible and is completely isolated from other data sets.
Data flows from left to right through the metadata scraper to produce anonymous aggregates for each data set.
The red box to the right of centre represents the disclosure control process which has been agreed with our chief statistician and Statistical Disclosure Control Panel. Aggregate data is blocked from exiting the scope where it was created if the checks are not passed.
Once all checks are passed, the data flows to the circle on the right of the image which represents the data generator.
Finally, anonymous artificial data sets flow out of the data generator to be safely shared with users.

The artificial data generator uses aggregate, anonymised data to randomly create artificial records. Statistical relationships between columns are not preserved, meaning that individual records are not an accurate representation of individual real records. Any attempt to reverse engineer the artificial data generator would only yield the aggregate statistics on which the artificial data is based.

For example, each artificial record in a data set which contains records describing genders, ages, ICD 10 diagnoses, appointment dates and times, would be generated by randomly picking a value for each field independently and then putting these together to form a ‘complete record’. Most records generated in such a way would be unrealistic. For example, artificial patient records may have ‘geriatrics’ as the department being visited, and an age of 5 years old.

Only by coincidence would some artificial records resemble real records.

How the data is generated

The artificial data generator randomly generates artificial data by sampling from anonymous univariate frequency distributions derived from real data.

Diagram: Overview of process to generate artificial data

The main steps and outputs in creating the aggregates of real data and using them to generate artificial data.

Image description

The diagram illustrates the main steps and outputs in creating the aggregates of real data and using them to generate artificial data.

The diagram flows from top to bottom. The first box represents the original data set.
Each column is aggregated independently by the metadata scraper in the second box (yellow).
The third box shows the univariate frequency distributions created by the metadata scraper. Each field is completely independent, with no relationships are preserved in this aggregated data.
The fourth box (red) represents the disclosure control process, which ensures that the aggregate data is non-identifiable.
The fifth box (green) represents the step which generates the data. Anonymous aggregates flow into this process to independently generate fields in the artificial data. Post processing is applied to make the artificial records appear more realistic.
Finally, the generated fields are combined to form ‘complete records’.

Aggregation and anonymisation

The original data is aggregated on a column-by-column basis, with each column treated independently. The outputs of this stage are frequency tables of unique values in each column. At this stage key identifiers (such as patient ID) are removed and small numbers are suppressed to prevent reidentification at a later stage.

The aggregated data is anonymous, with each column aggregated independently of the others. For example, the aggregates could be used to approximately determine the number of female patients or the number of 49-year-old patients in the real data, but not the number of 49-year-old females.

Random data generation

The aggregated data is used to randomly generate artificial data. Each column, for each record, is generated at random. For example, artificial records assigned a relatively low age may have a geriatric diagnosis code. This lack of record level realism ensures the artificial data is anonymous so that it cannot be used to reidentify individuals. It also means that it would not be possible to gain insights or build statistical models that would transfer onto real data.

Since the artificial data is generated using anonymised data, the artificial data itself is anonymous – it only contains as much information as the anonymous aggregate statistics from which it is derived.

Post processing

Once artificial data has been generated, some basic checks and rules are applied to make the data appear more realistic.

For example, randomly generated birth and death dates may be swapped to ensure sensible ordering.

Finally, randomly generated dummy values for identifying fields (such as ‘patient’ ID’) are added. These fields are generated based on template patterns structured to ensure unrealistic values with the correct properties (for example, data type and length).

Data Access Request Service

To apply for access to real healthcare data please visit our Data Access Request Service (DARS).

Feedback

We welcome your feedback. If you want to share your comments and experience of using our artificial data sets, or would like to suggest ways to improve this content, please use our artificial data feedback form.

Register for updates

Be the first to hear about new artificial data sets when they are made available.

Subscribe for updates

Further information

Supporting open data and transparency

Open data is data that can be used and shared by anyone, for any purpose. We make this data publicly available to improve transparency in health and care.

Hospital Episode Statistics (HES)

Hospital Episode Statistics (HES) is a curated data product containing details about admissions, outpatient appointments and historical accident and emergency attendances at NHS hospitals in England.

Hospital Episode Statistics Data Dictionary

The Hospital Episode Statistics (HES) Data Dictionary is intended for use by all users of HES data. An NHS data dictionary works in the same way as a normal dictionary, but contains information about data items.

Last edited: 24 January 2025 3:58 pm