The MIDAS touch

February 2, 2024

IISc-ICMR-ARTPARK collaboration seeks to build gold-standard medical datasets representative of the Indian population

Image courtesy: Pixabay/Mitrey

The year 1895. German scientist Wilhelm Conrad Röntgen noticed a strange spark inside his laboratory apparatus.

The spark, which led to the discovery of X-rays, not only got Roentgen the Nobel Prize in 1905 but also gave a fillip to medicine. Since then, several imaging techniques such as CT, ultrasound, and MRI have helped doctors better identify and treat diseases.

Fifteen years ago, a commentary highlighted the shortage of imaging experts in India. Now, in 2023, the situation does not seem to have improved. India still has very few radiologists, leading to longer wait times, and delayed diagnoses and treatments.

But there is one solution: AI.

AI has been under a skewed spotlight recently because of ChatGPT. But the use of AI in medicine is promising and has proved to be useful in many ways. AI tools have been used to diagnose tumours and interpret medical scans with high accuracy. However, to train all these AI models to diagnose diseases, we need data. Not just data, ‘good’ data.

“Good data is that set of data that helps answer your [research or medical] question with minimal failure,” says Debnath Pal, Professor at the Department of Computational and Data Sciences, IISc. According to Debnath, biology is a complex field, and the data required to answer a question of interest – such as diagnosing a specific disease – should be carefully collected.

There is a need for collecting good medical imaging data, such as CT and MRI scans, and X-ray images, especially from the Indian population.

It is for this reason that IISc and the Indian Council of Medical Research (ICMR) have entered into a collaboration. Called MIDAS (Medical Imaging Data Sets) India, it aims to establish institutional mechanisms for collecting, managing, and facilitating the use of medical imaging data from across the country.

“This project is funded by ICMR and will work through a hubs-and-spokes model where IISc is the nodal centre,” explains Debnath.

The AI and Robotics Technology Park (ARTPARK) at IISc has also been roped in as a technology partner to oversee the building of the platform.

MoU signing between IISc, ICMR and ARTPARK to launch the MIDAS project in 2022
(Photo: KG Haridasan)

“We need indigenous data that represents our country’s population. Data collection can be done in many ways by asking hospitals to carefully collect data. But for the data to flow through, we need a technology platform to host this pool of data and ARTPARK is providing this platform,” explains Raghu Dharmaraju, Chief Executive Officer at ARTPARK. The goal is to have institutions from across the country gather new data or filter existing imaging data, and then upload it to a centralised database that will be built by IISc and ARTPARK. A set of experts will then vet the data to see if they fit certain criteria before they are included.

Each hub institution will work on collecting data related to a specific disease and will gather data working alongside the spokes. The collaboration also seeks to outline a set of ‘gold standard’ guidelines for what kind of data should be collected, and how. “We will have inclusion and exclusion criteria, how many males and females should be there, how many positive and negative cases [for a disease], how many early stage and late-stage cases, and so on. We are essentially using the clinical setting to construct a dataset,” Raghu explains.

Apart from academia and industry, legal and policy experts, doctors and data scientists have also been roped in to put these standards together. “We are now drafting a comprehensive report and will come up with a set of standards. That will be used as the overarching guidelines,” Raghu adds. “The main thing is that this is not being done for regular clinical practise; this [data] is for research, innovation, and validation [of AI models].”

Currently, even though some medical institutions are putting in efforts to build digital databases of medical images, there is no unified protocol or guidelines for how these images should be collected. Each center has its own approach to collecting and storing data. If there is no centralised oversight, such repositories could become accessible only to some people. This can hinder access to medical images for researchers.

Debnath points out that the closest match to this kind of data bank elsewhere is the UK Biobank, a large-scale platform in which carefully curated and anonymised biomedical data from 500,000 participants is made available for researchers studying a variety of diseases and treatments. This includes full body MRI scans, whole genome sequences, dozens of blood biomarkers, data from physical activity monitors, and more. About 10,000 different variables have been collected for each participant, creating a gold mine of data to tap into for research.

“What we are envisaging is to develop a health research reference dataset ecosystem which is going to help significantly in developing quality [and] robust AI-ML tools for public health consumption,” says Harpreet Singh, Scientist and Head, Division of Biomedical Informatics, ICMR-AIIMS Computational Genomics Centre.

The ICMR-IISc team hopes that their collaboration will bypass existing problems of collecting and preserving good data. More importantly, it will also take steps to ensure patients’ privacy, according to Raghu and Debnath.

Each scan or dataset that will be fed into the database will be anonymised by removing any information that can be used to identify a patient. This is done by assigning coded information to each image, called labelling. Labelling is also important for training AI models that can help scan such images for medical applications.

“In India, a lot of foreign companies are [deploying] applications for cancer detection, TB detection and so on. But right now, we do not have a very strong mechanism to evaluate those tools and compare them with tools developed in India. These datasets from MIDAS can be used by foreign companies to test their accuracy and effectiveness, and can be compared with the tools developed in India,” Singh says.

As a first step in this collaboration, researchers at AIIMS Delhi have already begun collecting images of oral cancer and precancer cases, and labelling them.

“As of now, we have labelled 5,000 images and are expanding to reach a bigger target: 50,000 images,” says Deepika Mishra, Additional Professor at the Department of Oral Pathology, CDER at AIIMS Delhi, who oversees the data collection. “This is a work in progress. We have been looking at seven-year-old data in our repository, which were collected as part of routine scans. Now, as soon as the gold standards and the state of operation procedures are met, the data bank will be ready,” Deepika continues.

The goal is that once all the data is collected and organised, any authorised institution or researcher can put in a request to access the data that they need, whether it is for a study testing the effects of a new treatment for a disease, or a startup trying to build an AI model for diagnosis.

Efforts like these can greatly help the research community at large. For instance, in 2021, the UK Biobank released the whole genome sequences of 200,000 participants, to help researchers uncover links between DNA and disease. In a similar vein, datasets available through MIDAS can accelerate medical research in India using data that represents the Indian population.