Part of Guide to using the Simulacrum to support NDRS data requests
Developing code using Simulacrum for a data release request
This section provides guidance on how best to develop code using Simulacrum to supplement requests to the NDRS analytical team for an anonymous data release. It also describes alternative routes for data release where code does not produce anonymous data outputs. These are also outlined in the decision tree for using the Simulacrum and submitting code to NDRS section
The NDRS analytical team are unfortunately unable to provide support to understand the Simulacrum data or any detailed technical advice. For help with the formulation of analysis on the Simulacrum, please refer to the Simulacrum User Guide or get in contact with HDI at [email protected]. The Simulacrum User Guide provides examples of data queries, advice on how to link tables and some considerations that should be made when querying the data, including some data quality aspects. For specific examples of SQL queries on the CAS data please refer to the NCRAS SQL query guide.
When processing a request supplemented by Simulacrum code, a NDRS analyst will need to interpret the code, run the code on the real data and then apply quality assurance of outputs for validity and privacy. Therefore, the researcher should ensure that the code is easily useable by the NDRS analyst, i.e., the code should:
- be written in an appropriate analytical programming language,
- be clearly structured and interpretable,
- produce anonymous data outputs.
Programming language
While researchers can analyse the Simulacrum using their preferred analytical package, if they want to request that NDRS run their code on the real NDRS data, the code must be written in an appropriate programming language that is currently used by the NDRS analytical team.
Since the NDRS data is held in an Oracle SQL database called the Cancer Analysis System (CAS), code for data extraction from the CAS must be written in PL/SQL language. For analytical modelling of data that is already extracted from the CAS, code must be written in R or Stata, which are the standard languages that NDRS analysts use.
Structured and interpretable code
To minimise the time taken for the overall process, the researcher should prepare code that is error-free, clearly set out and interpretable, e.g., by commenting code and providing an analytical plan. This will allow for NDRS analysts to easily understand and adapt the code to extract data from the database and run data analysis.
Anonymous outputs
The researcher submitting code should ensure, to the best of their ability, that the data outputs produced will be anonymous and therefore releasable by the NDRS analytical team.
Anonymity assessments should be done according the ISB Anonymisation Standard, which describes the standard anonymisation processes for health and social care data and how to assess the risk of extra information being used to try to reveal the identity of individuals. It includes a set of standard anonymisation plans that can be used to reduce this risk and to ensure the release of non-identifying data. The plan used depends on the type of data outputs being released, which fall into two categories: aggregate data or individual level data. For example, for individual level data, a common anonymisation plan would follow “Plan 3” whereby the data are derived to “weak” k-anonymity by reducing the detail in indirect identifiers. If the code produces NDRS data outputs that do not pass the standard for anonymity, then the request will be rejected or require statistical disclosure control to be applied before release and may be re-directed to DARS.
It is possible that the exact nature of the data outputs is not known until the code has been run on the real data, so making the assessment can be difficult. However, the NDRS analytical team will only accept requests where the external researcher has sufficiently demonstrated that the data outputs are likely to meet the ISB Anonymisation Standard, e.g., by inspecting outputs produced when the code is run on Simulacrum data.
The researcher should supply their draft anonymity assessment to NDRS along with relevant evidence and justification for the anonymisation plan selected. It will then be reviewed by an NDRS analyst and the NDRS Caldicott Guardian. If the researcher is not able to undertake such an assessment themselves to demonstrate that the requested data is anonymous, or the conclusion is challenged by the NDRS Caldicott Guardian, the researcher should instead make either a formal data release request to DARS or should contact HDI at [email protected] to explore alternative routes.
If the external researcher requires data which has a higher risk of being identifiable and which does not meet the ISB Anonymisation Standard (whether aggregate or depersonalised row-level), they will need to make a formal request to the DARS.
Last edited: 30 October 2023 9:57 am