Bias in medical imaging AI: Checkpoints and mitigation

This year, the medical imaging AI industry was shaken by research showing that an AI system could identify patients’ racial identity even if not trained to do so. Despite rigorous testing, the authors could not figure out how AI could ‘learn’ without being ‘taught’.

The media regularly mention healthcare AI and bias in the same sentence, often concerning race or gender. Regulators are also acknowledging the issue by proposing requirements to address possible biases in algorithm development, such as the recent EU AI act.

Although physicians are increasingly using AI-based tools in their workflow, the influence of technology on their diagnostic or treatment decisions is understudied. Nonetheless, we clearly don’t want the results of AI medical imaging solutions to be unreliable or have unintended effects for some patient populations.

In this article, we – a clinical expert and a data scientist – reflect on how bias can be introduced in our algorithms and propose methods to mitigate it.

Understanding bias

We commonly define bias as a form of prejudice against a person or group, possibly leading to unfair treatment. In artificial intelligence, bias is an overloaded term used to describe several issues. Important to remember is that:

A biased dataset contains a disproportionately large quantity of one type of data;
A biased algorithm consistently underperforms for a specific category of data.

A famous example of algorithmic bias is Amazon’s AI-based recruiting tool, which made headlines in 2018 when it was scrapped for being sexist. Amazon had trained the algorithm to rate applicants based on previous selections. Not uncommon in the tech industry, men had been hired more frequently than women. So, the model found a (false) correlation between employability and gender. As a result, when presented with a female applicant, it would lower her score.

We’ll probably never create perfect algorithms. Still, we know (some of) the different ways in which bias can creep into an AI model. Moreover, research offers means to mitigate that bias. Our approach is to start matching the two: places where bias manifests and mitigation methods.

To set the stage, we’ll first outline the main steps in the AI development process.

AI development basics

Development starts with understanding clinical needs or problems. If we can address these with AI medical solutions, we follow the steps below to build them and put them in the hands of physicians. We’ll zoom in on the data, annotations, modelling, and clinical practice for this article.

Data, in our case, consists of large sets of chest CT scans. Scans are collected, and then annotated by radiologists, who mark their findings, for instance, by outlining a lung nodule.

The largest portion of the data is used to train the AI algorithm. For lung nodule detection, as an example, training requires repeating the process below – tens of thousands of times:

Provide a chest CT as input and ask the model to outline any lung nodules;
Compare the model’s output with the radiologist’s annotation;
Adjust the model parameters to decrease the difference between the two slightly.

Gradually, the model’s prediction gets closer to the truth. The model ‘learns’ to recognise lung nodules, as a trainee would after seeing a fellow radiologist review many scans with nodules.

A subset of the training dataset serves in the internal validation, where we assess the model’s performance on a real-world population. A whole new dataset is used for the external, independent clinical evaluation.

There are several ways moments in this process when bias can manifest. We’ll go through them by following the route below, but we’re aware that our overview may not be exhaustive.

(Note that the data gathering, labelling, and modelling cycle restarts when updating the model after deployment.)

The bias in the data

Data is top of the list when it comes to ways of introducing bias in algorithms. This may happen during training, internal validation, testing, or when re-training the model to improve or update it. To explain why and what we can do about it, we first define data diversity and share an example of data-induced bias.

Diversity and balance

How does a diverse dataset of medical images look? There are different ways to breakdown diversity; for the purpose of this piece, we propose three levels:

Individual: considering different biological factors, such as age, sex, race, and comorbidities.
Population: reflecting diverse disease prevalence, access to healthcare, and culture (including religion). It is particularly relevant to verify that datasets are representative of historically marginalised groups.
Technical: containing images originating from different scanner types and vendors, using varied acquisition or reconstruction parameters.

Once we acquire diverse data, it is essential also to consider balance, meaning the even distribution of the relevant features across the dataset.

An example of data-induced bias

This research provides a notable example of how gender imbalance in a medical imaging dataset influences AI performance. When training a model on a male-majority dataset, the performance on a female-majority test set was poor, and the other way around. Yet training the model on a balanced dataset did not result in worse performance when testing on a female-majority test set. The findings suggest that a balanced training set yields a more generalisable algorithm.

The access issue

Obtaining raw medical data that is diverse and balanced is a major challenge. We often rely on publicly available datasets – (anonymised) medical images that patients have agreed to make available for research or product development. Unfortunately, these datasets are not always representative of different demographics.

Public data often originates from centralised clinical trials, typically in one geographical area and one or more institutions. The issue of underrepresented groups in clinical trials is known: a study of a decade of oncology trials in the US found gaps in race reporting; disparities in lung cancer trials are also the topic of this insightful podcast, all the more relevant during Lung Cancer Awareness Month.

Within the EU, we face an additional challenge: the limited availability of large, curated training datasets. Our colleague Ricardo Roovers recently wrote an article in which he explained how EU data protection and privacy regulations restrict the availability of datasets.

A confidence-based system

For commercial companies to be able to build medical devices that advance healthcare, we must address data access. Dr Mackenzie Graham makes a strong case for a data-sharing system based on confidence (rather than on trust) and characterised by:

Transparency, allowing individuals to judge whether companies are using their data in a way that is consistent with the system’s rules;
Accountability, through laws and regulations;
Representation, requiring regulators, politicians, and policymakers to make the public aware of the importance of their health data in developing health technologies while addressing possible concerns.

It’s evident that one party can’t check all the boxes alone. Getting to a confidence-based data-sharing system requires a collective effort, and we are ready to join it.

Annotations are only human

Medical imaging AI “learns” from radiologists through their annotations. There may be (sometimes substantial) variability between experts who evaluate medical images, resulting in biased labels and segmentations.

At Aidence, we facilitate annotations through an in-house tool for physicians to mark their findings in a time-efficient manner. During the OpenRegulatory Conference, Astrid Ottosen presented a quality control plan for an annotation task, similar to how we organise annotations at Aidence. Its phases are an initial evaluation of the annotator, onboarding or training, and monitoring performance over time.

Machine learning presents further options to help mitigate bias, such as:

Semi-automated labels

The output of a trained model with acceptable performance (perhaps not for clinical use, but good enough for development) can serve as a possible annotation. The task of the radiologist is to confirm or correct this output. For example, a lung nodule detection model with good results would be used to outline nodules on raw scans before the radiologist reviews them. This diminishes the bias introduced by the annotators.

Annotation sessions using active learning

The algorithm selects the examples for which it needs annotations. The radiologist then looks at those examples and annotates only the difficult ones.

More data

More data makes it more likely that under-represented categories will make it to the training set.

Optimal balance in modelling

Here is where we’ll need to dive a bit deeper into data science.

Bias and variance

To discuss bias in modelling, we must understand it in relation to variance. High bias is an oversimplification of reality; think back to the recruitment algorithm that found a correlation between gender and employability. Variance, on the other hand, is an over-sensitivity of the model to specific features. For a biased model, big changes in the input result in moderate changes in the output. But with high variance, a slight difference in input has a significant impact on the output.

Here is an example of a Google model trained to classify photos of leaves as healthy or unhealthy. It is built on a diverse and balanced dataset with high-quality annotations, and it performs well. However, when the authors make an imperceptible change to an image of a healthy leaf (i.e., changing pixel values without compromising the image structure), the model changes its result from healthy to unhealthy. This is high variance.

Bias and variance. Source: Machine Learning for Science Team, ml-learnings.org

The tradeoff between a model’s ability to minimise bias and variance requires data scientists to decide how to get an optimal balance. As these are human choices, they may (unintentionally) induce bias. (For a more detailed explanation of this tradeoff, we recommend this article.)

The bias-variance tradeoff. Source: National Center for Biotechnology Information - Fundamentals of Clinical Data Science — The bias-variance tradeoff. Source: National Center for Biotechnology Information – Fundamentals of Clinical Data Science

Mitigation techniques

Mitigating bias during algorithm development is still an open research area; there is no magic solution to all problems. Machine learning engineers must pay attention to the balance between bias and variance and actively tune their models. These are some possible techniques:

Increasing the importance of a set of medical images in the training dataset

If a dataset contains more CT scans of women than men, one option is to increase the importance attached to the male scans to counterbalance the dataset.

Causal modelling

In causal modelling, we look for causal relationships between features and outcomes. Simply put, we know lung cancer can be caused by smoking habits (among other factors), but not by belonging to a specific gender. So, we would only allow a lung nodule detection model to consider the relevant features as input (e.g., smoking habits) and leave the irrelevant ones out (e.g., gender).

It is, however, problematic to determine and only extract those features from a CT scan that have a causal relationship with lung nodules.

A confidence score

If a model only saw very few lung scans with tuberculosis during training, it will not be very certain of its prediction when presented with a scan from an area where the disease is prevalent. So, together with the prediction, it could provide a score that reflects how close the scan is to its training data. A low confidence score would serve as a flag for both the clinician and the AI vendor, allowing them to reduce the risks of unreliable use and improve the model’s performance. (To be precise, this is a variance rather than bias issue, and it is called ‘Bayesian modelling’.)

AI in the real world

All of the above considered, the most thorough development process cannot guarantee a flawless algorithm. It is why we can’t emphasise the importance of post-market surveillance and performance monitoring enough. Biases we are not aware of or could not even predict may surface during clinical use. A recent example is the covid-19 pandemic, which resulted in new manifestations of lung damage on chest CTs.

Equally relevant to the safe and effective use of AI is investigating its impact on radiologists’ clinical decision-making; this is a topic we’ll soon explore as part of our AI award programme. In the UK, we are also working on a multi-centre clinical audit to assess the performance of our lung nodule algorithms in different geographies and trusts.

When it comes to mitigating AI bias, there’s a lot we can and should do. But, as an AI company, we are only one piece of the puzzle. To build state-of-the-art, safe and robust medical algorithms, we must work with data owners, regulators, policymakers, and others, and we are increasingly doing so. Tackling bias is crucial to ensuring that AI and other healthcare technologies serve all patients, regardless of race, gender, hospital, location, or other factors.