Skip to main content

Part of Advanced analysis components to support SNOMED PaLM mapping project

Extension beyond current requirements - parsing unstructured data to structured

Current Chapter

Current chapter – Extension beyond current requirements - parsing unstructured data to structured


While beyond the current scope, future phases may involve extracting relevant entities from full pathology reports before mapping to PaLM codes. Medical free text poses challenges due to domain-specific terminology, abbreviations, and informal or noisy text.  Additionally, these reports often include ambiguities and shorthand notations that, while intuitive for human clinicians, pose significant challenges for traditional natural language processing (NLP) models.

Potential approaches include:

  • ontology lookups - similar to current methods but extended to raw text
  • named entity recognition (NER) - identifies spans of text referring to specific entities (such as diseases and measurements)
  • question answering (QA) pipelines using large language models (LLMs)  - treat extraction as a reading comprehension task, allowing more flexible, context-aware answers

Named entity recognition (NER)

The emergence of word embeddings marked a shift towards deep learning-based biomedical NER. scispaCy15, for instance, integrated UMLS linkages and offered pre-trained embeddings tailored to biomedical texts. Biomedical focusses of the transformer architecture BERT (BioBERT16 or ClinicalBERT17 for example), significantly advanced the field by capturing contextual information and long-range dependencies. MedCAT18 built upon this by incorporating spelling correction, contextual disambiguation (such as differentiating 'hr' as 'hour' or “'heart rate'), and flexible entity linking to UMLS and SNOMED-CT, improving standardisation potential. These BERT-based models outperformed earlier systems on benchmark datasets but often require fine-tuning on large, annotated datasets (corpora), which limits their applicability in resource-constrained environments.

Recent NER models have focused on adaptability to unseen entities at test time, often using BERT backbones within their overall architectures. GLiNER19 approaches NER as a compatibility task between entity type prompts and text spans. However, GLiNER cannot process discontinuous spans or partially-nested NER, limiting its application in medical reports where more complex expressions of symptoms are possible.

Currently, NER is effective for straightforward tagging tasks but often struggles with the variability, longer spans, and implicit entity relations within medical text. Additionally, pathology reports often contain meta-annotations like negations (such as 'there is no evidence of rejection') and uncertainties (suc as 'there are features suggestive of chronic tubulointerstitial nephritis'). Standard medical NER models like medspaCy20 prove too limited, and while more advanced models like MedCAT21 can handle meta-annotations, they do not capture the variability of entity descriptions.

NER does offer some advantages over the QA approach, mainly in its ability to capture precise spans consistently for entities such as numbers and words common to simpler entities. However, the span-based focus presents challenges for some entities, particularly when single spans could correspond to multiple entities. Even simple entities like binary variables do not always have explicit answers and sometimes required inference from contextual information. However, the fundamental question of defining appropriate spans per entity remained hard to formalise into a set of rules. 

A more flexible approach is required and so we now turn our attention to the use of large language models.


Large language models (LLMs) for pathology reports

Large language models (LLMs) have been applied to a variety of pathology reports, with different focuses such as:

  • domain-specific tokenisation
  • active learning, data augmentation
  • big datasets
  • localised models

PathologyBERT22 introduced a custom tokeniser to address limitations of the Word Piece tokenisers, which can break down medical terms in ways that lose semantic meaning (for example, 'carcinoma' into ['car', 'cin', 'oma']). Using 340,492 unstructured histopathology reports from 67,136 patients at Emory University Hospital between 1981-2021, alongside a labelled set of 6,681 reports from 3,155 patients, they demonstrated greater coverage of pathology specific terminology compared to previous clinical language models like BlueBERT23 and ClinicalBERT. Muet al.24 generated diagnostically relevant embeddings from 11,000 bone marrow pathology synopses using a BERT model enhanced with active learning.

The active learning approach minimised manual annotation efforts while achieving strong performance, with only 350 annotated synopses required to reach an F1 score plateau, outperforming random sampling strategies. Zeng and colleagues25 trained a BERT-based NER model on 1,438 annotated US pathology reports and tested it on a separate dataset of 55 reports from the UAE. They found that data augmentation strategies, such as mention replacement, synonym replacement, label-wise token replacement, and segment shuffling, led to more accurate entity recognition for cancer grades, subtypes, and lesion positions compared to long short-term memory (LSTM) models.

Lu et al.26 analysed 2 key datasets:

  1. A dataset of 93,039 reports with CPT codes.
  2. A bigger general dataset of 749,136 reports, examining various classification tasks.

Their analysis revealed differences in entity complexity across categories, with position-related entities showing high variability while cancer grades had more constrained vocabularies. They also found that medical domain pretraining did not consistently improve performance.


Question answer LLM approach

Question answering (QA) frames information extraction as a reading comprehension task, where models must locate and extract relevant information from input text to answer specific questions. While NER outputs are limited to entity-type pairs, QA can generate free-form answers that combine multiple pieces of information, reason across paragraphs, and handle implicit relationships between concepts. There is some form of information synthesis that happens in QA unlike NER.

Alongside BERT-backbone QA models, the field has also explored using generative LLMs for NER by reformulating NER as a text generation task27, 28 . This approach includes input sentences in pre-defined prompt templates, treating entity annotation as a fill-in-the-blank (text generation) problem as opposed to traditional sequence labelling.

More recent methods like NuNER29 and NuExtract30 have extended these ideas by combining BERT-based models with data generated from GPT-style models and fine-tuning generative models for structured output extraction, respectively. These prompt-based NER models exist in the hazy space between NER and QA, or to be even more precise, the hazy space between extractive QA (where answers must be in the text) and open generative QA (where answers can be generated based on the text). QA is useful because it can capture document-level context but is not without flaws - it is more computationally expensive, and, when using inference techniques, prone to making things up31.

15 Neumann, M., King, D., Beltagy, I. & Ammar, W. Scispacy: fast and robust models for biomedical natural language processing. arXiv preprint arXiv:1902.07669 (2019).

16 Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).

17 Alsentzer, E. et al. Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323 (2019).

18 Kraljevic, Z. et al. Multi-domain clinical natural language processing with medcat: the medical concept annotation toolkit. Artificial intelligence in medicine 117, 102083 (2021)

19 Zaratiana, U., Tomeh, N., Holat, P. & Charnois, T. Gliner: Generalist model for named entity recognition using bidirectional transformer. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 5364–5376 (2024).

20 Eyre, H. et al. Launching into clinical space with medspacy: a new clinical text processing toolkit in python. In AMIA Annual Symposium Proceedings, vol. 2021, 438 (2022).

21 Kraljevic, Z. et al. Multi-domain clinical natural language processing with medcat: the medical concept annotation toolkit. Artificial intelligence in medicine 117, 102083 (2021)

22 Santos, T. et al. Pathologybert-pre-trained vs. a new transformer language model for pathology domain. In AMIA annual symposium proceedings, vol. 2022, 962 (American Medical Informatics Association, 2022).

23 Peng, Y., Chen, Q. & Lu, Z. An empirical study of multi-task learning on bert for biomedical text mining. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, 205–214 (2020).

24 Mu, Y. et al. A bert model generates diagnostically relevant semantic embeddings from pathology synopses with active learning. Communications medicine 1, 11 (2021).

25 Zeng, K. G. et al. Improving information extraction from pathology reports using named entity recognition. Research Square (2023).

26 Lu, Y. et al. Assessing the impact of pretraining domain relevance on large language models across various pathology reporting tasks. medRxiv 2023–09 (2023).

27 Wang, S. et al. Gpt-ner: Named entity recognition via large language models. arXiv preprint arXiv:2304.10428 (2023).

28 Ashok, D. & Lipton, Z. C. Promptner: Prompting for named entity recognition. arXiv preprint arXiv:2305.15444 (2023).

29 Bogdanov, S., Constantin, A., Bernard, T., CrabbÅLe, B. & Bernard, E. Nuner: Entity recognition encoder pre-training via llm-annotated data. arXiv preprint arXiv:2402.15343 (2024).

30 NuMind. Nuextract: A foundation model for structured extraction. https://numind.com/nuextract (2024). Accessed: 05-12-2024.

31 Hicks, M. T., Humphries, J. & Slater, J. Chatgpt is bullshit. Ethics and Information Technology 26, 38 (2024).


Last edited: 20 May 2025 4:23 pm