The pandemic has significantly impacted the Data Science market worldwide. Not only has the volume of processed daily information increased, but so has the demand for predictive models and specialists. Although much has already been said and written on this topic, the essence of the profession still raises questions. Therefore, senior programmer from Belarus Egemen Şener decided to talk about the tasks that data scientists solve in health and social care, how the coronavirus has changed things, and how to enter this profession.
Despite the methodological distance between IT and medicine, it is biology and medical research that have been driving data analysis and the application of various analytical models forward. Today, even in medical institutes, the basics of Data Science are studied in courses on medical statistics. Although these methods are called differently in medical schools, it is quite difficult for doctors to apply them due to a lack of programming experience.
Programming skills are the first requirement for a specialist in this field. It is necessary to understand modern data analysis algorithms — neural networks. Mustafa Egemen Sener emphasizes that it is necessary not only to theoretically understand how the algorithm works, but also to have a good command of advanced mathematics and be able to use these algorithms on real medical data. This, in turn, requires the specialist to be knowledgeable in specialized Data Science tools — Python libraries and data preprocessing methods.
How Coronavirus Became a Catalyst for Data Science in Medicine
At present, there are two key areas of applied data Science in medicine — healthcare and pharmaceuticals. The first direction includes tasks such as diagnostics, optimization of clinic and physician work, drug selection, and treatment based on diagnosis. Solutions applied in each of these global tasks are based on data analysis algorithms and machine learning. Accumulated medical data is actively used in drug development. This includes both the search for active substances and the testing of drugs on animals and humans.
The pandemic of the coronavirus played a special role in the development of Data Science technologies. As Mustafa Egemen notes, there has been a sharp increase in the need for predictive models that could provide more accurate data on the future spread of the coronavirus. Specifically, it was required to predict the number of hospitalizations, the impact of various restrictive measures and vaccination on COVID-19. And if in classical epidemiology such predictions are based on relatively simple epidemiological models, in reality, these models have shown themselves to be extremely poor. Modern Data Science methods can replace them and improve the accuracy of forecasts.
The main directions of applying Data Science in healthcare during the pandemic remained the same. But the volume of data and the expected time to solve the problem have changed significantly.
For example, the task of diagnosing the disease based on lung CT scans has long been studied, and there are enough working solutions on the market. But due to the global nature of the pandemic, constant data exchange, and their availability, the task of automatic diagnosis of COVID-19 based on CT scans was solved in the shortest possible time. The same applies to predicting the severity of the disease outcome, which could help forecast the number of available hospital beds. To solve this problem, a gigantic amount of data is being collected and analyzed in several countries simultaneously. “However, the specificity of medicine is such that the implementation of new solutions is practically impossible,” admits Egemen Şener. “As with vaccines, any model needs careful testing before it can affect medical decisions.”
How Data Science Helps in Fighting Cancer, Alzheimer’s Disease, and in the Search for New Drugs
Let’s delve into various applications of data Science in healthcare. One of the most promising is the diagnosis of oncological diseases. Today, data scientists use a whole spectrum of algorithms to develop solutions in this area. The choice of a specific method depends on the task, the available data, and their volume. For example, tumor diagnosis through imaging will likely involve the use of neural networks by Data Science specialists. For diagnostic analysis, one of the machine learning methods best suited for the specific task will be selected. Additionally, there are specific algorithms used, for instance, for DNA data analysis obtained from individual cells. Such data is often analyzed using graph algorithms, although this is more of an exception than the rule.
Furthermore, there are several methods used to enhance images and improve result accuracy. Big data platforms (such as Hadoop) apply techniques like MapReduce to search for parameters usable in various tasks. For those intending to develop their product in this sphere or enthusiasts, there are several open datasets for brain visualization: BrainWeb, IXI Dataset, fastMRI, and OASIS.
Another case is the modeling of human organs. Egemen Mustafa Sener says this is one of the most complex technical tasks. When developing a particular solution, the specialist must precisely understand for what purpose and at what level of complexity the organ is being modeled. For instance, a model of a specific tumor can be created at the level of gene expression and signaling pathways. Currently, the company Insilico Medicine tackles such tasks. This approach is used to search for therapy targets, including through Data Science methods. Such models are mainly used for scientific research, still far from practical application.
Analyzing gene sequences is a whole field of medicine whose development is simply impossible without Data Science. While proficiency in Python programming is crucial in Data Science, working with genes also requires knowledge of the R programming language and specific bioinformatics tools — programs for working with DNA and protein sequences. Most such programs run on Unix operating systems and are not very user-friendly. To master them, one needs at least a basic understanding of molecular biology and genetics.
Unfortunately, even medical schools face significant challenges in this area today, and most physicians actually have a poor understanding of how gene sequences work. In Russia, two companies are engaged in this area — Atlas and Genotek. Analyzing mutations in individual genes is currently popular. Most large companies involved in medical analysis offer such services. Patients, for example, can find out if they have a predisposition to breast cancer in the same genes as Angelina Jolie.
This field is characterized by a shortage of personnel, as there are only a few places where one can receive relevant education. Moreover, many either remain in academia or move abroad. Mustafa Egemen also notes that there are few online resources where one can learn such analysis. Usually, they are designed for physicians or biologists and teach only programming and basic data skills.
Today, there are several tools in the market for data analysis in this area:
- Processes genetic data and reduces the time required to process genetic sequences.
- It is a language for relational databases that we use to execute queries and extract data from genomic databases.
- An open-source application for biomedical research based on a graphical interface.
- Open-source software developed for genomic data analysis.
An important commercial and research direction is the creation of next-generation drugs. Pharma specialists use machine learning to search for therapy targets and biomarkers. Neither the former nor the latter, of course, are drugs themselves. Targets are molecules in the body that interact with drugs, and biomarkers are molecules that inform the physician who should use the drug. Therefore, almost all companies developing drugs for diseases with unknown targets and biomarkers — Novartis, Merck, Roche, and the Russian BIOCAD — use machine learning. This primarily includes oncological and autoimmune diseases, Alzheimer’s disease. This also includes the search for new antibiotics.
Why Doctors Are Not Facilitating the Implementation of Data Science
Recent years have shown that Data Science is the driving force behind the industry of predictive and analytical models in healthcare. As an example, Mustafa Egemen Şener cites the use of neural networks to determine the spatial structure of proteins. However, the pandemic has revealed a global problem in many countries related to the optimization of clinic resources and a shortage of staff.
Over the past year, numerous companies have emerged offering to address these issues using Data Science. The use of data has also been a major breakthrough for private clinics, as it makes medical services more affordable. Against the backdrop of the pandemic, there has also been an increased demand for telemedicine services, where machine learning algorithms are widely used. Telemedicine services are in demand for preliminary diagnosis, analysis work, and chatbot creation.
From a technological standpoint, the application of computer vision and machine learning has virtually no barriers. However, deeper integration of algorithms and services is hindered by the reluctance of clinics and doctors to adopt Data Science methods. There is also a keenly felt lack of data for training, which is not only a concern for commercial medical institutions but also for the government. Sener Egemen Mustafa acknowledges that access to public hospital data needs to be democratized so that software companies can create cutting-edge products.
“Training even a single program requires a significant amount of quality data,” Egemen Şener says of the development challenges. “For a program to learn to distinguish a tumor in an image, thousands of manually analyzed patient scans are needed, with experienced doctors involved in the analysis. First, a doctor must identify the tumor and then indicate its location. As you can understand, experienced doctors have many other tasks.”
But surprisingly, the pandemic has helped some areas. For example, DiagnoCat, a Russian startup that applies computer vision to analyze dental images, was able to recruit idle doctors to analyze images during the lockdown.
As for the reluctance of clinics and doctors, doctors simply do not trust such technologies. A good doctor will surely find a case where the program gives an incorrect diagnosis, and an inexperienced doctor will be afraid that the program will outperform them. In the end, one can always justify concerns for patient care and legal aspects.
In any case, Egemen Mustafa notes that the synergy of Data Science and medical technologies has already enabled a leap in the development of solutions for diagnosing oncological, autoimmune, and neurodegenerative diseases. Services based on data analysis and machine learning can forecast the spread of viruses and search for next-generation drugs. Despite classical medical education lagging behind the challenges faced by the industry today, becoming a modern specialist working at the intersection of two scientific disciplines — Data Science and healthcare — is achievable.