AI in medical imaging: Why enough quality data trumps good models

color blob

When you’re a data scientist, conferences are a great time to reload on state-of-the-art knowledge – and random goodies. The latest one for me and the team was the hyper-targeted MIDL, which only counts a few hundred attendees, and the previous editions left us inspired to make our product even better.

However, I was underwhelmed.

Most new research presented small improvements to approaches that were already known. After speaking to other attendees, I quickly realised I was far from being the only one feeling this way.  Interestingly, research concerning traditional computer vision tasks like detection and segmentation were particularly affected. They felt disconnected from clinical practice.

Why was that? All answers seemed to lead to one thing: quality data.

Academia is data starved

Many papers shared one simple problem: they use (very) small datasets. This is a problem that is widespread in the biomedical data science community, as medical specialists who can expertly annotate health data are expensive. This is troubling; since data of actual patients from hospitals contain lots of variations that cannot be captured in really small datasets, the model performance does not translate to real-world clinical practice.

This problem is well known. Initiatives like the Open Health Care by NHS England attempt to resolve this by making huge amounts of data available to everyone. I can only applaud such ambitions, but they solve only half of the problem. At this very moment, there are already huge publicly available raw datasets like the NLST for CT Chest scans (Level C on the scale proposed by Hugh Harvey).

The real issue is getting annotated data.

Data matters

To create annotations for medical datasets, researchers need money. The reason is twofold; not only because medical specialists are expensive to hire, but also because annotating software costs money to build. Academia does not have the funds to spend on this, but companies do.

Good and enough data trumps good models.

This common piece of wisdom is something all data scientists experience every day, and so do we at Aidence. For our latest model improvement, we tried a broad range of different approaches, including the latest research from MIDL and ICML. Yet, the performance increase we achieved was minimal. In the end, we chose the more costly option of getting radiologists to annotate more data. The extra annotated scans were much more effective than the latest research approaches.

Let there be data

This leaves us in a paradoxical situation. We need academia to find new approaches, but findings cannot be evaluated for generalisability due to a lack of data. Companies, on the other hand, do have access to annotated data, but they are occupied with building a safe and certified product around the model and integrating it into the hospitals. Neither of these parties is going to solve this on their own.

It is unlikely that academia will free up funding to create publicly available annotated datasets. While there have been datasets published, more is needed. At the same time correct annotations on specific datasets are a huge competitive advantage for AI startups. For that reason, Aidence and its peers won’t make their datasets available to avoid creating their own competition.

It’s necessary for a third party to step in.

I see this as a perfect opportunity for governments to stimulate the development of AI software. Governments are neutral in the market and are eager to provide funding for AI systems. Data needs to come from a large pool of patients to ensure enough variation, and hospital networks cannot pull this off on their own.

To overcome this, governments can act as a mediator between hospitals and collect anonymised data from different PACS systems. The focus should be on creating annotated data sets rather than building a unified infrastructure or ensuring interoperability because that will only slow down the process. Researchers and professionals that want to use the annotated data are more than capable of pulling data from different sources and doing data cleanup.

To conclude, governments should provide funding to annotate anonymized data by medical specialists and make the data publicly available to everyone.


Notable research presented during the conference:

  • Lossau, T., et al. “Dynamic Pacemaker Artifact Removal (DyPAR) from CT Data using CNNs.” (2018). Download the academic paper; watch the presentation at MIDL.
  • Souza, R., et al. “A Hybrid, Dual Domain, Cascade of Convolutional Neural Networks for Magnetic Resonance Image Reconstruction.” (2018). Download the academic paper; watch the presentation at MIDL.

Two big annotated datasets were released by the National Institutes of Health (NIH) under Roland Summers. Check out these useful links:

About Kay

Kay Lamerigts was Machine Learning Engineer at AidenceAfter finishing his bachelor’s in computer science, Kay found his passion during his master’s in artificial intelligence at Utrecht University (the Netherlands). Specialising in deep learning, he looked for companies that were using AI where he could do research for his master's thesis. After successfully completing his degree, he chose to stay with Aidence as a machine learning engineer.

Connect with Kay on

Follow Aidence on Medium

Book a demo

    * Required fields.
    For more information on how to unsubscribe, our privacy practices, and how we are committed to protecting and respecting your privacy, please review our Privacy Policy.
    This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.