Part of Using Databricks in the NHS England Secure Data Environment

Improving code efficiency

Writing efficient code can reduce the time it takes to run the code. It can also reduce the burden on the system, help optimise performance and avoid memory issues.

We have listed some common issues and quick wins that can make your code run more efficiently below.

Avoid creating large chains of views

A view is a virtual table whose contents are defined by a query

A large chain of views can take a long time to run and potentially reduce service performance for other users and overburden the query optimiser. Views should therefore only be calculated when needed, such as when writing to a table/pandas dataframe, making a graph, or displaying results.

Your query will run faster if you convert key stages of the chain into tables. If the output of a view is going to be used multiple times, or joined to other items, you should consider making it a table.

Example

CREATE TABLE <database_name>.<table_name>AS

<Select statement>;

ALTER TABLE <database_name>.<table_name>OWNER TO

<database_name>

Reduce data sample before performing complex operations

Complex operations, such as JOIN and DISTINCT, put a significant computational strain on the system and can reduce performance. We recommend users filter out unnecessary data before performing a complex operation.

When using a WHERE clause to reduce a data sample, the WHERE clause should come before the complex operation, if possible. This will optimise processing by taking up less memory in the cluster, allowing the query to run faster.

Less efficient coding

In this example, the data is reduced after performing the JOIN operation. As a result, it uses system resources less efficiently.

Example

SELECT *

FROM db.table_a

INNER JOIN db.table_b

ON table_a.column1 = table_b.column1

WHERE table_a.column1 IS NOT NULL

Efficient coding (recommended):

In this example, the data is reduced before performing the JOIN operation. As a result, it uses system resources more efficiently.

Example

SELECT *

FROM

(SELECT *

FROM db.table_a

WHERE column1 IS NOT NULL

) table_a

INNER JOIN db.table_b

ON table_a.column1 = table_b.column1

Efficient use of coding for large comparisons

A query such as WHERE column_a NOT IN (SELECT column_a FROM …) will run into memory issues and take a long time to run if the SELECT clause is very large. The same would be true for a query which includes WHERE EXISTS (SELECT 1 FROM … WHERE…).

Query processing is distributed across Databricks’ compute resources, so using a JOIN or ANTI JOIN query will reduce the burden on the system, helping your code to run more efficiently.

You can use NOT IN, IN or EXISTS when the list of records are small (less than 1,000 records). For example, WHERE column NOT IN (‘code_a’, ‘code_b’).

Example for SQL

Less efficient coding

In this example, the SELECT clause from table_b will need to be run for each piece of data in table_a. If table_b is large, there are likely to be memory issues.

SELECT *

FROM db.table_a

WHERE column1 NOT IN (SELECT DISTINCT column1 FROM db.table_b)

Efficient Coding (Recommended)

In this example, table_b is reduced and an ANTI JOIN clause is used to reduce the records in table_a by keeping only those records that don’t have a matching column1 value in table_b.

Example

SELECT *

FROM db.table_a

LEFT ANTI JOIN (SELECT DISTINCT column1

FROM db.table_b) table_b

ON table_a.column1 = table_b.column1

The same principles also apply for Pyspark.

Example for Pyspark

Less efficient coding

in this example, the SELECT clause from table_b will need to be run for each piece of data in table_a. If table_b is large, there will be memory issues.

Example

sdf_a = spark.table('db.table_a')

sdf_b = spark.table('db.table_b')

sdf_b_massive_list = (sdf_b.select('column1').distinct()

.toPandas()['column1'].to_list())

sdf_filtered = sdf_a.where(~sdf_a['column1'].isin(sdf_b_massive_list))

Efficient coding (recommended)

In this example, table_b is reduced and an ANTI JOIN clause is used to reduce the records in table_a by keeping only those records that don’t have a matching column1 value in table_b.

Example

sdf_a = spark.table('db.table_a')

sdf_b = spark.table('db.table_b')

sdf_b_cutdown = sdf_a.select('column1').distinct()

sdf_filtered = sdf_a.join(sdf_b_cutdown, sdf_a.column1 == sdf_b_cutdown.column1, 'left_anti')

Last edited: 11 January 2024 1:30 pm

Improving code efficiency

Avoid creating large chains of views

Reduce data sample before performing complex operations

Less efficient coding

Efficient coding (recommended):

Efficient use of coding for large comparisons

Example for SQL

Less efficient coding

Efficient Coding (Recommended)

Example for Pyspark

Less efficient coding

Efficient coding (recommended)

Chapters