Part of Using Databricks in the NHS England Secure Data Environment

Best practice and further information

Previous Chapter

Creating reusable processes

Current Chapter

Current chapter – Best practice and further information

Best practice

The following best practice should be observed to help you work in Databricks efficiently whilst maintaining consideration for other SDE users.

Consider others when making changes to tables and notebooks

do not ‘drop’, ‘truncate’ or ‘delete’ other users’ tables without consulting with them
do not delete or edit other users’ notebooks without consulting with them

Handle unrecognised folders

If you notice a new folder that you and your colleagues do not recognise, raise a service request on 0300 303 5035 or via email at [email protected].

Use lowercase for table and database names

Avoid using uppercase text for table and database names as this will result in the query failing.

Use meaningful identifiers

Prefix your tables and notebooks with a meaningful identifier so that you can easily recognise them. We recommend using your initials or a project identifier code.

Test your code before running

Test your code to ensure it works before you run it on Databricks. Use a testing framework such as ‘pytest’, ‘unittest’ or ‘doctest’ so the code can be easily re-tested later.

Allow a code to finish running when creating a table

Always allow a code to finish running when a table is being created or altered and do not cancel it part way through. If the run is cancelled, an error will occur and you will be prevented from creating a table with the same name.

Store code centrally for re-use

If you wish to re-use code, add it to a central function or a notebook of its own. Use dbutils.notebook.run or %run command to re-use the code in other analytical pipelines.

Delete temporary tables after use

Delete any temporary tables that were created as intermediate tables when executing a notebook. Deleting tables saves storage, especially if the notebook is scheduled for daily execution.

Further information

Refer to the following documentation for further information.

ARTICLE

Databricks Documentation

How-to guidance and reference information for the Databricks Data Science and Engineering, Databricks Machine Learning, and Databricks SQL persona-based environments.

ARTICLE

Databricks Runtime 6.4 (Unsuppported)

Release notes providing information on Databricks Runtime 6.4.

ARTICLE

Databricks Runtime 6.4 for Machine Learning (Unsupported)

Release notes providing information on Databricks Runtime 6.4 for Machine Learning.

ARTICLE

Apache Spark

A look at Python and SQL Spark APIs.

ARTICLE

Spark Built in Functions

A look at Spark functions.

ARTICLE

PySpark Documentation

ARTICLE

The Internals of Spark SQL

Deep dive into how Spark SQL works.

ARTICLE

Visualisations

Information on the built-in support for Databricks notebook charts and visualisations.

ARTICLE

Delta Lake

This guide helps you quickly explore the main features of Delta Lake, an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.

Last edited: 11 January 2024 1:57 pm

Previous Chapter

Creating reusable processes

Best practice and further information

Best practice

Consider others when making changes to tables and notebooks

Handle unrecognised folders

Use lowercase for table and database names

Use meaningful identifiers

Test your code before running

Allow a code to finish running when creating a table

Store code centrally for re-use

Delete temporary tables after use

Further information

Chapters