Skip to main content

The Simulacrum

The Simulacrum is a synthetic cancer data set that contains artificial patient-like cancer data to help researchers gain insights into cancer in England.

Introduction

The Simulacrum is synthetic cancer data that imitates some of the data collected and curated by the National Disease Registration Service (NDRS). The Simulacrum looks and feels like the real cancer data, but contains no private patient information and is entirely made up of artificial patient records. It is free to use and allows anyone to use realistic record-level cancer data, safe in the knowledge that there is no danger of breaching patient confidentiality. 

The Simulacrum preserves the data structure and some statistical properties of the real NDRS data with a high degree of accuracy, meaning it can be used to support research on NDRS data. However, since the Simulacrum only approximates the original data, analysis results derived from the Simulacrum should not be used for clinical decision-making.

The Simulacrum was developed and built by Health Data Insight CIC (HDI). There are two versions of Simulacrum; the most recent version, Simulacrum v2, was released in April 2023 and includes synthetic versions of subsets of:

  • the National Cancer Registration Data set (NCRD)
  • the Radiotherapy Data set (RTDS)
  • the Systemic Anti-Cancer Therapy Data set (SACT)
  • the molecular diagnostic testing data set

Using the Simulacrum

The Simulacrum can be used to:

  1. Learn about NDRS data by exploring Simulacrum to learn about NDRS table structures, tables linkages and the information included in NDRS data variables
  2. Run preliminary analyses on the Simulacrum and get an approximate idea of what analysis results could look like if run on NDRS data
  3. Write and test code for analysis using the Simulacrum which can then be used to run on the real NDRS data.

This can be helpful if you are considering applying for access to NDRS data. For example, it enables you to:

  • understand whether NDRS data source is suitable for your research
  • gain knowledge about NDRS data that can help you with writing and submitting a data access request application
  • utilise your time while waiting for access by allowing you to become familiar with NDRS data and begin writing analysis code

It is also possible to request that code written on the Simulacrum is run on the real data to produce anonymous outputs. Anonymous outputs can then be safely released without risk to patient privacy, thereby enabling research using NDRS data without ever requiring to give direct access to the original data records.

Click on the box below learn more about how Simulacrum can be used to support research on NDRS data and the process for submitting code written using Simulacrum data.


Find out more

To learn more about the Simulacrum, download it and start analysing it, please go to the Simulacrum website.

You can also find out more by watching our Simulacrum bitesize video. There you can learn about the data Simulacrum includes, how it was generated, its data quality and general use cases.

To learn more about how to formulate queries using the Simulacrum please refer to HDI’s Simulacrum User Guide or NDRS’s guide to getting started with SQL and extracting data.

Last edited: 31 January 2025 11:39 am