Part of Introduction to the Secure Data Environment
Data sets and Databricks
Overview
This chapter covers how to use data sets with Databricks in NHS England’s Secure Data Environment (SDE).
Datasets and Databricks
In the SDE, the main way to access the data is through software called Databricks. RStudio is also available, which is covered in the next chapter.
Other tools, such as Stata, may be made available in the environment where licensing fees apply. If you would like to learn more about optional tools, you can email the service team at [email protected].
When does the data get updated?
Different data sets and data pipelines have their own external update schedules. These might be daily, weekly, monthly or even quarterly. In the SDE, this data is updated at the end of every month if there is an update present.
Analysing data in Databricks
Now that you know how to access data within the SDE, we will show you how to use Databricks to view and analyse this data.
Databricks is a notebook environment that contains cells, each of which can be used to write programming code, query the databases and visualise outputs.
You can view and analyse data within Databricks by using SQL and Python or you can connect to Databricks using RStudio, which is covered in the next chapter.
Within Databricks, you can also use PySpark, a variant of python. These can utilise the distributed processing of Databricks to query large amounts of records. You are also able to use many native Python packages within Databricks notebooks themselves.
Code scripts for accessing data in Databricks
Further optional reading
Below show some additional links that may be useful.
Using Databricks in SDE
Notebook Basics
Databricks SQL
Databricks Repos
Databricks Workflow
Databricks Compute
Last edited: 15 November 2024 12:53 pm