Skip to main content

Guide to using the Simulacrum to support NDRS data requests

Current Chapter

Guide to using the Simulacrum to support NDRS data requests


Background

The Simulacrum is synthetic cancer data which imitates some of the data held securely by the National Disease Registration Service (NDRS) within the National Health Service (NHS) England. The data collected by NDRS was held in a database called the Cancer Analysis System (CAS). The Simulacrum looks and feels like the real cancer data held within the CAS but does not contain any real patient information. Anyone can use it to learn more about cancer in England without compromising patient privacy. Also, because the Simulacrum data schema is the same as the real one in the CAS, the Simulacrum can be used to write and test code to run queries that, with the right permissions and ethical approval, can be run on the real data.

 

The Simulacrum maintains many of the statistical properties of the original data with a high degree of accuracy, for example, the distributions of individuals variables and correlations between variables in the data. This means that one can run queries on the data and get a preliminary idea of what results would look like if run on the real data. However, there are limitations: the more complex the data query, the more approximate the results. For this reason, results from the Simulacrum should not be used for clinical decision-making.

 

Instead, Simulacrum can be used by researchers to learn about NDRS data and assess whether it is a suitable data source for their research before requesting data access. Researchers can also plan their analysis and write code using Simulacrum to produce analysis outputs from the data, before making a request for a data release. Because the Simulacrum has the same data schema and structure to NDRS data, this code can be run on the real data to produce real data outputs. Such requests can be made directly to the NDRS analytical team or through NHS England’s Data Access Request Service (DARS).

 

The Simulacrum was developed by Health Data Insight (HDI) CIC in partnership with NDRS. There have been multiple releases of Simulacrum based on different subsets of NDRS data, which are freely available for download from the Simulacrum website. HDI also provide a data request service for bespoke analysis.

 

The purpose of this document is to provide external researchers, including charities, academics, NHS organisations and industry partners, with a clear understanding of the Simulacrum data and guidance on how to write and submit code written on Simulacrum data alongside a request for data release. It is an updated version of the previous NCRAS Guide to using the Simulacrum and submitting code.

 

For more detailed guidance on how to formulate queries on the Simulacrum data, please refer to the Simulacrum User Guide.

 

 

Last edited: 19 February 2025 11:06 am