Pre-processing source and target terminology

To optimise the mapping process, initial pre-processing of source and target terminology data is required. This involves data cleansing and applying computable rules to enhance the data, which can be facilitated by parsing terminology into component building blocks.

The challenge

Underlying differences between human-readable descriptions applied to source and target reportables can make it difficult for mapping tools to establish correct target maps. When representing a reportable, for a lab’s local use-case, a single word describing the lab test’s component is often sufficient because users understand the additional context (that is, a test’s property, specimen, and technique). However, for a national use-case like SNOMED CT PaLM, a clear and unambiguous description is required, stating exactly what is being measured, how it will be measured, and where the sample was taken so that codes can be processed by third parties who do not understand the local context.

The table below illustrates the significant differences between typical local reportables’ human-readable descriptions and those used in SNOMED CT PaLM.

Hospital A local description	SNOMED PaLM FSN	SNOMED PaLM preferred term
Creatinine	Substance concentration of creatinine in serum (observable entity)	Creatinine substance concentration in serum
Sodium	Substance concentration of sodium in serum (observable entity)	Sodium substance concentration in serum
Potassium	Substance concentration of potassium in serum (observable entity)	Potassium substance concentration in serum
Haemoglobin	Mass concentration of haemoglobin in blood (observable entity)	Haemoglobin mass concentration in blood
Albumin	Mass concentration of albumin in serum (observable entity)	Albumin mass concentration in serum
Platelets	Platelet count in blood (observable entity)	Platelet count in blood

Consequently, pre-processing of source data that serves to generate a SNOMED CT PaLM-like string that can be loaded as source data into a mapping tool is of great benefit because the closer the source data resembles a SNOMED CT PaLM description, the easier it is to map.

Data cleansing

Data cleansing involves specifying the structure of the data, specifying translation rules, and enhancing the richness of the source terminology, and where required, the target terminology. See Techniques for source data cleansing.

Parsing terminology into component building blocks

Parsing involves splitting source and target codes’ human-readable descriptions into their component parts, which are equivalent to the coded attributes or building blocks. This can be used to facilitate mapping by matching each of the parsed components in the source and target codes.

For example, the SNOMED CT PBCL concept 'Serum creatinine level' can be parsed into the components:

Property=Level
Component=Creatinine
Specimen=Serum.

In the same way the SNOMED CT PaLM concept 'Substance concentration of creatinine in serum' can be parsed into the components:

Property=Substance concentration
Component=Creatinine
Specimen=Serum.

The parsed components can be used to create a candidate map from the SNOMED CT PBCL concept 'Serum creatinine level' to SNOMED CT PaLM concept 'Substance concentration of creatinine in serum'.

Screenshot of SNOMED CT PBCL & SNOMED CT PaLM reportables

Value of source data cleansing

Many mapping tools use string based matching algorithms such as Levenshtein distance to establish map targets. Levenshtein distance measures the minimum edits (insertions, deletions, substitutions) needed to transform one string into another (read the Advanced Analysis Components to Support SNOMED PaLM Mapping project paper for further details). Using this technique, the example below illustrates the value of source data cleansing to enhance a mapping tool’s ability to establish correct map targets.

In this example, the local reportable 'Creatinine' is to be mapped to SNOMED CT PaLM. The lab test’s property and specimen have been established via other sources and the correct map target should be Substance concentration of creatinine in serum.

In its raw form, the number of changes between the source description and the target description is 37. In a list of SNOMED CT PaLM reportables that contain the string ‘creatinine’, the desired target appears 5th, clearly demonstrating how the source description alone is not enough to maximise the chances of getting a correct match.

Rank	SNOMED PaLM description	Levenshtein distance
1	Urine albumin:creatinine ratio	21
2	Urin C-peptide/creatinine ratio	22
3	Creatinine renal clearance in 24 hours	28
4	Clearance ratio of calcium to creatinine	30
5	Substance concentration of creatinine in serum	37
6	Substance concentration of creatinine in fluid	37
7	Substance concentration of creatinine in urine	37

The table below shows that when property and specimen information about the lab test is processed into the string, the Levenshtein distance decreases to zero, thereby establishing the appropriate target.

Information	SNOMED CT PaLM description	Levenshtein distance
Component	Creatinine	37
Component and specimen	Creatinine in serum	28
Component and property	Substance concentration of creatinine	9
Component, property and specimen	Substance concentration of creatinine in serum	0

The table below shows how processing this additional information into source descriptions when testing the mapping of the 'top 300' reportables improved overall returns.

Source information	Correct map target
Component	41%
Component and property	58%
Component and specimen	68%
Component, property and specimen	70%

Techniques for source data cleansing

Establishing source data via existing Read PBCL/SNOMED PBCS maps

Using the information in existing Read PBCL/SNOMED CT PBCL maps provides an alternative means to establish appropriate source data. See example below.

Hospital local code: CREA

Hospital local description: Creatinine

Hospital units of measure (UoM): umol/L

Read PBCL code: 44J3.

Read PBCL description: Serum creatinine

SNOMED PBCL concept ID: 1000731000000107

PBCL description: Serum creatinine level

SNOMED PaLM concept ID: 1107001000000108

SNOMED PaLM FSN: Substance concentration of creatinine in serum (observable entity)

SNOMED PaLM preferred term: Creatinine substance concentration in serum

In this example, the component and UoM source data are already available as source data, whilst the specimen type serum can be determined from the mapped Read PBCL/SNOMED CT PBCL descriptions. The same process can help establish a lab test’s property where a unit of measure in unavailable.

Using LLMs to specify the structure of the data, specify translation rules, and enhance the richness of the input terminology

A large language model (LLM) is an advanced artificial intelligence system trained on vast amounts of text data to understand and generate human-like language.

In the context of mapping to SNOMED CT PaLM, testing found LLMs useful as a means to cleanse labs’ source data and convert it into a format that facilitated mapping. This was achieved via LLMs ability to expand acronyms and extract key elements required for mapping (for example, identifying a lab test’s component from the source code description, the test’s property from associated UoM data, and the specimen from the existing Read PBCL/SNOMED PBCL map. In combination, these elements were used to instruct the LLM to create a SNOMED CT PaLM-like string for ingestion into a mapping tool as source data.

A prompt is an instruction given to a LLM to guide it to a desired response. In this context, a prompt was used in testing to cleanse source data to create a SNOMED CT PaLM-like string and is available in the Appendix. To avoid LLM 'hallucination' there is a need to engineer the prompt as per the instructions below.

Add context outlining what you are trying to achieve and the context for use

Explain the inputs - including file names and column headings

Explain the outputs - including column headings

Explain the processing steps required to get to the output, which typically include:

Expanding any acronyms associated with a lab test
Identifying the lab test component from the source description
Identifying the lab test property from the associated UoM
Identifying the lab test specimen by: extracting from the existing Read PBCL/SNOMED CT PBCL map, predicting the specimen when not otherwise available, and using a specimen mapping table
Identifying the technique
Combining the information into a form that can be processed by a terminology mapping tool
Keeping expanded forms of any acronym
Keeping names of any strings that do not meet the standard SNOMED CT PaLM pattern

Last edited: 22 May 2025 4:25 pm