AI in healthcare: why enough quality data trumps good models

When you’re a data scientist, conferences are a great time to reload on state-of-the-art knowledge – and random goodies.

The latest one for me and the team was the hyper-targeted MIDL, which only counts a few hundred attendees, and the previous editions left us inspired to make our product even better.

However, I was underwhelmed.

Most new research presented small improvements to approaches that were already known. After speaking to other attendees, I quickly realised I was far from being the only one feeling this way.  Interestingly, research concerning traditional computer vision tasks like detection and segmentation were particularly affected. They felt disconnected from clinical practice.

Some notable research presented during MIDL

  • Lossau, T., et al. “Dynamic Pacemaker Artifact Removal (DyPAR) from CT Data using CNNs.” (2018).

Download the academic paper; watch the presentation at MIDL.

  • Souza, R., et al. “A Hybrid, Dual Domain, Cascade of Convolutional Neural Networks for Magnetic Resonance Image Reconstruction.” (2018).

Download the academic paper; watch the presentation at MIDL.

Academia is data starved

All these papers share one simple problem: they use (very) small datasets. A problem that is widespread in the biomedical data science community, as medical specialists that can expertly annotate health data are expensive. This is troubling; since data of actual patients from hospitals contain lots of variations that cannot be captured in really small datasets, the model performance does not translate to real-world clinical practice.

This problem is well known. Initiatives like the Open Health Care by NHS England attempt to resolve this by making huge amounts of data available to everyone. I can only applaud such ambitions, but they solve only half of the problem. At this very moment there are already huge publicly available raw datasets like the NLST for CT Chest scans (Level C on the scale proposed by Hugh Harvey).

The real issue is getting annotated data.

Data matters

To create annotations for medical datasets, researchers need money. The reason is twofold; not only because medical specialists are expensive to hire, but also because annotating software costs money to build. Academia does not have the funds to spend on this, but companies do.

Good and enough data trumps good models. 

This common piece of wisdom is something all data scientists experience every day, so do we at Aidence. For our latest model improvement, we tried a broad range of different approaches, including the latest research from MIDL and ICML. Yet, the performance increase we achieved was minimal. In the end, we chose the more costly option of getting radiologists to annotate more data. The extra annotated scans were much more effective than the latest research approaches.

Let there be data

This leaves us in a paradoxical situation. We need academia to find new approaches, but findings cannot be evaluated for generalisability due to lack of data. Companies on the other hand do have access to annotated data, but they are occupied with building a safe and certified product around the model and integrating it into the hospitals.

Neither of these parties is going to solve this on their own.

On the one hand, it is unlikely that academia will free up funding to create publicly available annotated datasets. While there have been datasets published, but more is needed. On the other hand, correct annotations on specific datasets are a huge competitive advantage for AI startups. For that reason, Aidence and its peers won’t make their datasets available to avoid creating their own competition.

It’s necessary for a third-party to step in.

I see this as a perfect opportunity for governments to stimulate the development of AI software. Governments are neutral in the market and are eager to provide funding for AI systemsData needs to come from a large pool of patients to ensure enough variation, and hospital networks cannot pull this off on their own.

To overcome this, governments can act as a mediator between hospitals and collect anonymised data from different PACS systems. The focus should be on creating annotated data sets rather than building a unified infrastructure or ensuring interoperability because that will only slow down the process. Researchers and professionals that want to use the annotated data are more than capable of pulling data from different sources and doing data cleanup.

To conclude; governments should provide funding to annotate anonymized data by medical specialists and make the data publicly available for everyone.

The example to follow

Two big annotated datasets were released by the National Institutes of Health (NIH) under Roland Summers. Check out these useful links:

About Kay

Kay Lamerigts

Kay Lamerigts is Data Scientist at Aidence

Connect with Kay on

Follow Aidence on Medium

Book a demo


    * Required fields.
    For more information on how to unsubscribe, our privacy practices, and how we are committed to protecting and respecting your privacy, please review our Privacy Policy.