When you’re a data scientist, conferences are a great time to reload on state-of-the-art knowledge – and random goodies. The latest one for me and the team was the hyper-targeted MIDL, which only counts a few hundred attendees, and the previous editions left us inspired to make our product even better.
However, I was underwhelmed.
Most new research presented small improvements to approaches that were already known. After speaking to other attendees, I quickly realised I was far from being the only one feeling this way. Interestingly, research concerning traditional computer vision tasks like detection and segmentation were particularly affected. They felt disconnected from clinical practice.
Why was that? All answers seemed to lead to one thing: quality data.
Academia is data starved
Many papers shared one simple problem: they use (very) small datasets. This is a problem that is widespread in the biomedical data science community, as medical specialists who can expertly annotate health data are expensive. This is troubling; since data of actual patients from hospitals contain lots of variations that cannot be captured in really small datasets, the model performance does not translate to real-world clinical practice.
This problem is well known. Initiatives like the Open Health Care by NHS England attempt to resolve this by making huge amounts of data available to everyone. I can only applaud such ambitions, but they solve only half of the problem. At this very moment, there are already huge publicly available raw datasets like the NLST for CT Chest scans (Level C on the scale proposed by Hugh Harvey).
The real issue is getting annotated data.
To create annotations for medical datasets, researchers need money. The reason is twofold; not only because medical specialists are expensive to hire, but also because annotating software costs money to build. Academia does not have the funds to spend on this, but companies do.
Good and enough data trumps good models.
This common piece of wisdom is something all data scientists experience every day, and so do we at Aidence. For our latest model improvement, we tried a broad range of different approaches, including the latest research from MIDL and ICML. Yet, the performance increase we achieved was minimal. In the end, we chose the more costly option of getting radiologists to annotate more data. The extra annotated scans were much more effective than the latest research approaches.
Let there be data
This leaves us in a paradoxical situation. We need academia to find new approaches, but findings cannot be evaluated for generalisability due to a lack of data. Companies, on the other hand, do have access to annotated data, but they are occupied with building a safe and certified product around the model and integrating it into the hospitals. Neither of these parties is going to solve this on their own.
It is unlikely that academia will free up funding to create publicly available annotated datasets. While there have been datasets published, more is needed. At the same time correct annotations on specific datasets are a huge competitive advantage for AI startups. For that reason, Aidence and its peers won’t make their datasets available to avoid creating their own competition.
It’s necessary for a third party to step in.
I see this as a perfect opportunity for governments to stimulate the development of AI software. Governments are neutral in the market and are eager to provide funding for AI systems. Data needs to come from a large pool of patients to ensure enough variation, and hospital networks cannot pull this off on their own.
To overcome this, governments can act as a mediator between hospitals and collect anonymised data from different PACS systems. The focus should be on creating annotated data sets rather than building a unified infrastructure or ensuring interoperability because that will only slow down the process. Researchers and professionals that want to use the annotated data are more than capable of pulling data from different sources and doing data cleanup.
To conclude, governments should provide funding to annotate anonymized data by medical specialists and make the data publicly available to everyone.
Notable research presented during the conference:
- Lossau, T., et al. “Dynamic Pacemaker Artifact Removal (DyPAR) from CT Data using CNNs.” (2018). Download the academic paper; watch the presentation at MIDL.
- Souza, R., et al. “A Hybrid, Dual Domain, Cascade of Convolutional Neural Networks for Magnetic Resonance Image Reconstruction.” (2018). Download the academic paper; watch the presentation at MIDL.
Two big annotated datasets were released by the National Institutes of Health (NIH) under Roland Summers. Check out these useful links:
- DeepLesion dataset and published academic paper.
- CXR8 dataset and published academic paper.