Part of Using Databricks in the NHS England Secure Data Environment
Improving code efficiency
Writing efficient code can reduce the time it takes to run the code. It can also reduce the burden on the system, help optimise performance and avoid memory issues.
We have listed some common issues and quick wins that can make your code run more efficiently below.
Avoid creating large chains of views
A view is a virtual table whose contents are defined by a query
A large chain of views can take a long time to run and potentially reduce service performance for other users and overburden the query optimiser. Views should therefore only be calculated when needed, such as when writing to a table/pandas dataframe, making a graph, or displaying results.
Your query will run faster if you convert key stages of the chain into tables. If the output of a view is going to be used multiple times, or joined to other items, you should consider making it a table.
Example
CREATE TABLE <database_name>.<table_name>AS
<Select statement>;
ALTER TABLE <database_name>.<table_name>OWNER TO
<database_name>
Reduce data sample before performing complex operations
Complex operations, such as JOIN and DISTINCT, put a significant computational strain on the system and can reduce performance. We recommend users filter out unnecessary data before performing a complex operation.
When using a WHERE clause to reduce a data sample, the WHERE clause should come before the complex operation, if possible. This will optimise processing by taking up less memory in the cluster, allowing the query to run faster.
Less efficient coding
In this example, the data is reduced after performing the JOIN operation. As a result, it uses system resources less efficiently.
Example
SELECT *
FROM db.table_a
INNER JOIN db.table_b
ON table_a.column1 = table_b.column1
WHERE table_a.column1 IS NOT NULL
Efficient coding (recommended):
In this example, the data is reduced before performing the JOIN operation. As a result, it uses system resources more efficiently.
Example
SELECT *
FROM
(SELECT *
FROM db.table_a
WHERE column1 IS NOT NULL
) table_a
INNER JOIN db.table_b
ON table_a.column1 = table_b.column1
Efficient use of coding for large comparisons
A query such as WHERE column_a NOT IN (SELECT column_a FROM …) will run into memory issues and take a long time to run if the SELECT clause is very large. The same would be true for a query which includes WHERE EXISTS (SELECT 1 FROM … WHERE…).
Query processing is distributed across Databricks’ compute resources, so using a JOIN or ANTI JOIN query will reduce the burden on the system, helping your code to run more efficiently.
You can use NOT IN, IN or EXISTS when the list of records are small (less than 1,000 records). For example, WHERE column NOT IN (‘code_a’, ‘code_b’).
Example for SQL
Less efficient coding
In this example, the SELECT clause from table_b will need to be run for each piece of data in table_a. If table_b is large, there are likely to be memory issues.
SELECT *
FROM db.table_a
WHERE column1 NOT IN (SELECT DISTINCT column1 FROM db.table_b)
Efficient Coding (Recommended)
In this example, table_b is reduced and an ANTI JOIN clause is used to reduce the records in table_a by keeping only those records that don’t have a matching column1 value in table_b.
Example
SELECT *
FROM db.table_a
LEFT ANTI JOIN (SELECT DISTINCT column1
FROM db.table_b) table_b
ON table_a.column1 = table_b.column1
The same principles also apply for Pyspark.
Example for Pyspark
Less efficient coding
in this example, the SELECT clause from table_b will need to be run for each piece of data in table_a. If table_b is large, there will be memory issues.
Example
sdf_a = spark.table('db.table_a')
sdf_b = spark.table('db.table_b')
sdf_b_massive_list = (sdf_b.select('column1').distinct()
.toPandas()['column1'].to_list())
sdf_filtered = sdf_a.where(~sdf_a['column1'].isin(sdf_b_massive_list))
Efficient coding (recommended)
In this example, table_b is reduced and an ANTI JOIN clause is used to reduce the records in table_a by keeping only those records that don’t have a matching column1 value in table_b.
Example
sdf_a = spark.table('db.table_a')
sdf_b = spark.table('db.table_b')
sdf_b_cutdown = sdf_a.select('column1').distinct()
sdf_filtered = sdf_a.join(sdf_b_cutdown, sdf_a.column1 == sdf_b_cutdown.column1, 'left_anti')
Last edited: 11 January 2024 1:30 pm