Skip to main content

Part of Introduction to the Secure Data Environment

Data sets and Databricks

Current Chapter

Current chapter – Data sets and Databricks


Overview

This chapter covers how to use data sets with Databricks in NHS England’s Secure Data Environment (SDE).


Datasets and Databricks

In the SDE, the main way to access the data is through software called Databricks. RStudio is also available, which is covered in the next chapter.

Other tools, such as Stata, may be made available in the environment where licensing fees apply. If you would like to learn more about optional tools, you can email the service team at [email protected].
 

Transcript for the data sets and Databricks video

Slide 1

In this video, I'm going to give you an overview of the data sets available in the SDE and how use Databricks. Please be aware that the interface of Databricks may have changed slightly since this video was recorded.

Slide 2

The data held in the SDE is safe as it has been de-identified and is used in line with minimisation principles to protect patient confidentiality.

Slide 3

Databricks is the where the data is stored and is the main way to access it within the SDE. This includes NHS data that is real-only and a collaboration database with read and write permissions for users to save tables.

Slide 4

To access the data, first open Databricks by selecting the Databricks icon on the virtual desktop. Then you'll be asked to log in and to complete two-factor authentication. The data is viewable in the main menu pane labelled ‘catalog’. The data sets available to you are based on your organisation’s Data Sharing Agreement (DSA) with NHS England. NHS Data is held in the read only ‘dars_nic’ database. There is one collaboration database in the environment, ‘dsa_collab’. This collab database enables SDE users to save tables for all colleagues to see and access. Reference data used for NHS England publication is stored in the ‘dss_corporate’ database. Examples of which include ethnicity and SNOMED code mappings. The code scripts section below this video demonstrates how to show the databases, tables and save tables.

Slide 5

Slide 6

The workspaces section is where you store notebooks in Databricks.

A notebook is a coding environment made of cells like Jupyter or R. You can save code in these, view data and make visualisations. The further links section below this video contains guides on how to use notebooks within Databricks.

You have a personal workspace which only you can access by default. The shared workspace is shared with all users on your organisation's SDE Environment. You can place notebooks in the shared workspace and other users will be able to see the cell contents and outputs without needing to rerun the notebook.

Repos are how you interact with GitLab from Databricks. These are the recommended ways of sharing code as it allows full version control. The repo section is where code that is being cloned from GitLab is stored, although you first need to set up a connection.

Slide 7

Slide 8

The main menu pane ‘Workflow’ is where Databricks jobs can be created.

Workflows allow you to run code on a specified schedule. More information on Databricks workflows can be found in the further links section below.

Slide 9

Slide 10

The compute pane on the main menu gives you information on the available clusters.

To run code users will need to ensure the cluster is running and then connect their notebooks. The compute pane on the main menu gives you information on the available clusters. More information on Databricks compute can be found in the further links section below.

Slide 11

This brings us to the end of this video.


When does the data get updated? 

Different data sets and data pipelines have their own external update schedules. These might be daily, weekly, monthly or even quarterly. In the SDE, this data is updated at the end of every month if there is an update present.


Analysing data in Databricks

Now that you know how to access data within the SDE, we will show you how to use Databricks to view and analyse this data.

Databricks is a notebook environment that contains cells, each of which can be used to write programming code, query the databases and visualise outputs.
You can view and analyse data within Databricks by using SQL and Python or you can connect to Databricks using RStudio, which is covered in the next chapter.

Within Databricks, you can also use PySpark, a variant of python. These can utilise the distributed processing of Databricks to query large amounts of records. You are also able to use many native Python packages within Databricks notebooks themselves.


Code scripts for accessing data in Databricks

Available database

%sql

SHOW DATABASES

 

%python

display(spark.sql(“show databases”))

Available tables

%sql

SHOW TABLES IN dss_corporate

SELECT * FROM dss_corporate.calendar_financial_year

 

%python

display(spark.sql(“show tables in dss_corporate”)

reference_table = spark.sql(“select * from dss_corporate.calendar_financial_year”)

display(“reference_table”)

Saving tables to a database

%sql

CREATE OR REPLACE TABLE demo_collab.demo_test (pseudo_id INT, psudi_name STRING, department STRING) USING DELTA

 

%sql

INSERT INTO demo_collab.demo_test (pseudo_id, pseudo_name, department)

VALUES

(1, ‘John Doe’, ‘A’),

(2, ‘Jane Smith’, ‘B’),

(3, “Michael Johnson’, ‘C’)

 

%sql

SELECT * FROM demo_collab.demo_test

 

%python

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([

StructField(“pseudo_id”, IntegerType(), True),

StructField(“pseudo_name”, StringType(), True),

StructField(“department”, StringType(), True)

])

sampleData =[

 (1, ‘John Doe’, ‘A’),

(2, ‘Jane Smith’, ‘B’),

(3, ‘Michael Johnson’, ‘C’)

]

test_df = spark.createDataFrame(sampleData, schema = schema)

test_df.write.format(”delta”).saveAsTable(”demo_collab.test_pyspark_table”)

display(test_df)


Further optional reading


Last edited: 15 November 2024 12:53 pm