Part of SNOMED CT PaLM Mapping Best Practice
Pre-processing source and target terminology
To optimise the mapping process, initial pre-processing of source and target terminology data is required. This involves data cleansing and applying computable rules to enhance the data, which can be facilitated by parsing terminology into component building blocks.
The challenge
Underlying differences between human-readable descriptions applied to source and target reportables can make it difficult for mapping tools to establish correct target maps. When representing a reportable, for a lab’s local use-case, a single word describing the lab test’s component is often sufficient because users understand the additional context (that is, a test’s property, specimen, and technique). However, for a national use-case like SNOMED CT PaLM, a clear and unambiguous description is required, stating exactly what is being measured, how it will be measured, and where the sample was taken so that codes can be processed by third parties who do not understand the local context.
The table below illustrates the significant differences between typical local reportables’ human-readable descriptions and those used in SNOMED CT PaLM.
Hospital A local description | SNOMED PaLM FSN | SNOMED PaLM preferred term |
---|---|---|
Creatinine | Substance concentration of creatinine in serum (observable entity) |
Creatinine substance concentration in serum |
Sodium | Substance concentration of sodium in serum (observable entity) | Sodium substance concentration in serum |
Potassium | Substance concentration of potassium in serum (observable entity) | Potassium substance concentration in serum |
Haemoglobin | Mass concentration of haemoglobin in blood (observable entity) | Haemoglobin mass concentration in blood |
Albumin | Mass concentration of albumin in serum (observable entity) | Albumin mass concentration in serum |
Platelets | Platelet count in blood (observable entity) | Platelet count in blood |
Consequently, pre-processing of source data that serves to generate a SNOMED CT PaLM-like string that can be loaded as source data into a mapping tool is of great benefit because the closer the source data resembles a SNOMED CT PaLM description, the easier it is to map.
Data cleansing
Data cleansing involves specifying the structure of the data, specifying translation rules, and enhancing the richness of the source terminology, and where required, the target terminology. See Techniques for source data cleansing.
SNOMED CT PaLM is an explicit, unambiguous target terminology, so no target terminology enhancements are required.
Parsing terminology into component building blocks
Parsing involves splitting source and target codes’ human-readable descriptions into their component parts, which are equivalent to the coded attributes or building blocks. This can be used to facilitate mapping by matching each of the parsed components in the source and target codes.
For example, the SNOMED CT PBCL concept 'Serum creatinine level' can be parsed into the components:
- Property=Level
- Component=Creatinine
- Specimen=Serum.
In the same way the SNOMED CT PaLM concept 'Substance concentration of creatinine in serum' can be parsed into the components:
- Property=Substance concentration
- Component=Creatinine
- Specimen=Serum.
The parsed components can be used to create a candidate map from the SNOMED CT PBCL concept 'Serum creatinine level' to SNOMED CT PaLM concept 'Substance concentration of creatinine in serum'.
Value of source data cleansing
Many mapping tools use string based matching algorithms such as Levenshtein distance to establish map targets. Levenshtein distance measures the minimum edits (insertions, deletions, substitutions) needed to transform one string into another (read the Advanced Analysis Components to Support SNOMED PaLM Mapping project paper for further details). Using this technique, the example below illustrates the value of source data cleansing to enhance a mapping tool’s ability to establish correct map targets.
In this example, the local reportable 'Creatinine' is to be mapped to SNOMED CT PaLM. The lab test’s property and specimen have been established via other sources and the correct map target should be Substance concentration of creatinine in serum.
In its raw form, the number of changes between the source description and the target description is 37. In a list of SNOMED CT PaLM reportables that contain the string ‘creatinine’, the desired target appears 5th, clearly demonstrating how the source description alone is not enough to maximise the chances of getting a correct match.
Rank | SNOMED PaLM description | Levenshtein distance |
---|---|---|
1 | Urine albumin:creatinine ratio | 21 |
2 | Urin C-peptide/creatinine ratio | 22 |
3 | Creatinine renal clearance in 24 hours | 28 |
4 | Clearance ratio of calcium to creatinine | 30 |
5 | Substance concentration of creatinine in serum | 37 |
6 | Substance concentration of creatinine in fluid | 37 |
7 | Substance concentration of creatinine in urine | 37 |
The table below shows that when property and specimen information about the lab test is processed into the string, the Levenshtein distance decreases to zero, thereby establishing the appropriate target.
Information | SNOMED CT PaLM description | Levenshtein distance |
---|---|---|
Component | Creatinine | 37 |
Component and specimen | Creatinine in serum | 28 |
Component and property | Substance concentration of creatinine | 9 |
Component, property and specimen | Substance concentration of creatinine in serum | 0 |
The table below shows how processing this additional information into source descriptions when testing the mapping of the 'top 300' reportables improved overall returns.
Source information | Correct map target |
---|---|
Component | 41% |
Component and property | 58% |
Component and specimen | 68% |
Component, property and specimen | 70% |
Techniques for source data cleansing
Establishing source data via existing Read PBCL/SNOMED PBCS maps
Using the information in existing Read PBCL/SNOMED CT PBCL maps provides an alternative means to establish appropriate source data. See example below.
Hospital local code: CREA
Hospital local description: Creatinine
Hospital units of measure (UoM): umol/L
Read PBCL code: 44J3.
Read PBCL description: Serum creatinine
SNOMED PBCL concept ID: 1000731000000107
PBCL description: Serum creatinine level
SNOMED PaLM concept ID: 1107001000000108
SNOMED PaLM FSN: Substance concentration of creatinine in serum (observable entity)
SNOMED PaLM preferred term: Creatinine substance concentration in serum
In this example, the component and UoM source data are already available as source data, whilst the specimen type serum can be determined from the mapped Read PBCL/SNOMED CT PBCL descriptions. The same process can help establish a lab test’s property where a unit of measure in unavailable.
Using LLMs to specify the structure of the data, specify translation rules, and enhance the richness of the input terminology
A large language model (LLM) is an advanced artificial intelligence system trained on vast amounts of text data to understand and generate human-like language.
In the context of mapping to SNOMED CT PaLM, testing found LLMs useful as a means to cleanse labs’ source data and convert it into a format that facilitated mapping. This was achieved via LLMs ability to expand acronyms and extract key elements required for mapping (for example, identifying a lab test’s component from the source code description, the test’s property from associated UoM data, and the specimen from the existing Read PBCL/SNOMED PBCL map. In combination, these elements were used to instruct the LLM to create a SNOMED CT PaLM-like string for ingestion into a mapping tool as source data.
A prompt is an instruction given to a LLM to guide it to a desired response. In this context, a prompt was used in testing to cleanse source data to create a SNOMED CT PaLM-like string and is available in the Appendix. To avoid LLM 'hallucination' there is a need to engineer the prompt as per the instructions below.
Add context outlining what you are trying to achieve and the context for use
Explain the inputs - including file names and column headings
Explain the outputs - including column headings
Explain the processing steps required to get to the output, which typically include:
- Expanding any acronyms associated with a lab test
- Identifying the lab test component from the source description
- Identifying the lab test property from the associated UoM
- Identifying the lab test specimen by: extracting from the existing Read PBCL/SNOMED CT PBCL map, predicting the specimen when not otherwise available, and using a specimen mapping table
- Identifying the technique
- Combining the information into a form that can be processed by a terminology mapping tool
- Keeping expanded forms of any acronym
- Keeping names of any strings that do not meet the standard SNOMED CT PaLM pattern
Last edited: 22 May 2025 4:25 pm