Part of Guide to using the Simulacrum to support NDRS data requests

Characteristics of the Simulacrum and how it can be used

The Simulacrum contains data about synthetic patients and their tumour diagnoses, treatments and molecular diagnostic tests. There are currently two available versions of Simulacrum. The most recent version, Simulacrum v2.1.0 (released April 2023) contains data on synthetic patients diagnosed between 2016 and 2019, such as age and gender, and data about their synthetic tumours, such as staging and pathology information (simulated from the National Cancer Registration Dataset). Like in real life, the synthetic patients can have multiple tumours. The vital status of each synthetic patient has also been simulated so researchers can analyse survival using the Simulacrum data.

Synthetic records for the patient’s treatments and somatic (tumour genetic) tests have also been simulated. These include details about the Systemic Anti-Cancer Therapy (SACT) treatments (most commonly chemotherapy), radiotherapy treatments and somatic genomic tests received. With this data, researchers can analyse the treatments following diagnosis and the types of genetic mutations and aberrations seen in tumours.

The Simulacrum preserves structural properties of the real data, such as the data schema, table structures and linkages between tables. It also preserves many of the statistical properties of the real data with a high degree of accuracy, e.g., the statistical distribution of values within each data variable and strong correlations in the data. For example, the Simulacrum largely captures the correlation seen between cancer site and gender, such that breast cancer patients are typically female while lung cancer patients are roughly evenly split between male and female, as seen as in the real data. However, as the statistical properties are only approximately preserved, Simulacrum will not always be reflective of the real data, e.g., we see synthetic male Ovarian cancer patients in Simulacrum in larger numbers than the real data.

Therefore, Simulacrum data should not be used to answer epidemiological questions or make clinical decisions as answers will only be approximate and may not be reflective of answers derived from the real data. Instead, it can be used to support the preparation of hypotheses and the structuring of questions, so the questions can later be asked using real patient data on the CAS. We outline three main use cases for researchers:

Learn about NDRS data through data exploration.
Run preliminary analysis and feasibility tests.
Write and test code for analysis.

Learn about NDRS data

Researchers can explore Simulacrum data to learn more about NDRS data, for example, its table structures, table linkages and the information included in NDRS data variables. This provides a more detailed view than looking at the available data dictionaries online.

It’s worth noting that, since each version of Simulacrum is based on a snapshot of the CAS, the data structure can differ from the newest snapshot of CAS due to changes over time. This may include things like changes to table structure, data variable names and values. These differences are likely to be small can be easily adjusted for when making data requests. For information where current differences exist, please get in touch with HDI at [email protected].

Preliminary analysis and feasibility tests

Researchers can also run feasibility tests on the Simulacrum to see whether NDRS data is a suitable resource for their research. Examples of feasibility questions that the Simulacrum can help to answer include:

what is the completeness of a data variable of interest?
is a particular SACT drug or somatic genomic test recorded in CAS?
what are the morphology codes recorded in CAS for specific cancers?
can cancer cohorts not routinely reported on be defined in CAS?

Such information can be useful for researchers in advance of applying for access to NDRS data through DARS. It can help them to decide if NDRS data is suitable for their research and then to plan their DARS application. By answering such questions, it is possible to decide which data items are needed and define the patient cohort of interest for analysis.

Researchers can also run preliminary analyses to get an approximate idea of what analysis results might look like if run on NDRS data. For “simple” analysis results are quite accurate when compared to the real data. However, the more “complex” the analysis, the more approximate results are. For this reason, analysis on Simulacrum should never be used for clinical decision-making.

Write and test code for analysis

Researchers can also use the Simulacrum to write and test code to run analysis. Due to similar data structure to NDRS data, once researchers have written their code and refined their queries, they can request that the code is run on the CAS data to produce real results. For minor differences between the Simulacrum and newest CAS data snapshot, code can be adjusted by the analyst processing the request. With the right ethical and legal requirements, these results can then be released.

The submitting code to NDRS for a data release section outlines the process for submitting code, while developing code using Simulacrum for a data release request section provides advice on writing code that fulfils submission requirements. For guidance on formulating analysis to produce optimal outputs from CAS data, please refer to the Simulacrum User Guide.

Last edited: 15 April 2024 11:26 am