Part of A buyer's guide to artificial intelligence in health and care

Does this product perform in line with the manufacturer’s claims

Previous Chapter

Does this product meet regulatory standards

Current Chapter

Current chapter – Does this product perform in line with the manufacturer’s claims

Next Chapter

Will this product work in practice

As part of your procurement exercise, you’ll need to scrutinise the performance claims made by the manufacturer about the product.

Metrics should be aligned to intended use

The performance expected of a supervised learning model will vary in line with its intended use.

For example, in the case of classification models used for diagnosis, there’s an important trade-off between:

sensitivity - the proportion of actual positive cases correctly identified
specificity - the proportion of actual negative cases correctly identified

The trade-off depends on weighing up the healthcare consequences and health economic implications of missing a diagnosis versus over-diagnosing. This trade-off may vary at different stages of a care pathway.

Different metrics will shed light on different aspects of model performance and all have limitations. Different metrics will be more or less appropriate to different use cases.

Performance of classification models

Classification models often provide a probability between 0 and 1 that a case is positive, as opposed to a binary result. Discrete classification into positive and negative is obtained by setting a threshold on the probability. If the value exceeds the threshold, the model classifies the case as positive. If the value does not exceed the threshold, the case is classified as negative.

You should pay attention to the chosen threshold, particularly as performance metrics will change according to where the threshold has been set. Given the complexity of these metrics, the Area Under the Curve (AUC) is a helpful measure. It’s a single metric which evaluates model performance without taking into account the chosen threshold.

Performance of regression models

Whilst classification models predict discrete quantities (classes), regression models predict continuous quantities for example how many residential care beds will be needed for patients discharged from hospital next week.

Performance metrics for regression models are affected differently by the presence of outliers in the data. You should consider how much of an issue outliers are for your use case, and then prioritise metrics accordingly.

Validating the model’s performance

You should request details of validation tests that have been performed, and should expect to see a form of retrospective validation. This entails testing the model on previously collected data that the model has not seen. The purpose of this is to test whether the model can generalise, carry across, its predictive performance from the data it was trained on to new data.

Ensuring the model is safe

Investigating the model’s safety credentials is key - you need to be confident that the model is:

robust
fair
explainable
resilient against attempts to compromise an individual’s privacy

Assessing comparative performance

For any reported performance to be meaningful, it must be compared to the current state of play. For example, how does model performance compare with the product it will replace, or with the human decision-making it will augment?

Last edited: 16 June 2025 4:05 pm