Parsing semi-structured data to structured

In this project, semi-structured data refers to a single string containing multiple entities related to different components or attributes. Parsing such data has been studied extensively, from rule-based linguistic methods (1950s) to neural networks (2000s) and, more recently, large language models (LLMs).

Common natural language processing (NLP) pre-processing steps include:

tokenization - splitting text into units (words, phrases, or sub-words)
stemming/lemmatization - normalising words to their root forms
stop word analysis - filtering out common words that add little semantic value
dependency parsing and part-of-speech (POS) tagging - building a parse tree and tagging words (nouns, verbs, adjectives, and so on)

These steps fall under language-based intrinsic terminology matching, distinct from string-based or semantic approaches discussed later. While these methods reduce text variation, they must be carefully managed to avoid losing critical information. The handling of acronyms (usually through the use of a reference list) should also be handled here, usually after the normalisation and stop word processes have taken place as the acronym can be considered as an already standardised token.

For our mapping task, POS tagging helps identify attributes (such as 'property' or 'site'). Synonym recognition also plays a role, potentially aided by rule-based systems that map recurring phrases (such as 'per mm' to 'property').

Last edited: 20 May 2025 4:14 pm

Parsing semi-structured data to structured

Parsing semi-structured data to structured

Chapters