The earth is not flat, and Machine Learning works
Carlos Eduardo Beluzo, Federal Institute of São Paulo, PhD Candidate in Demography Program at University of Campinas and research scientist at the Population Studies Center (NEPO)1, 2
Luciana Correia Alves, Institute of Philosophy and Human Sciences at University of Campinas and research scientist at the Population Studies Center (NEPO)1, 3
Believe it or not, Machine Learning (ML) is science, and it works, just as the earth is not flat, science tells us this, not by opinion or common sense. ML is a sub-area of the Artificial Intelligence science field. It has been an object of scientific research since the 1950s. The term “machine learning” was popularized by Arthur Lee Samuel, who defines it as “the field of study that gives computers the ability to learn without being explicitly programmed“. We can also better understand ML methods by the following definition:
“Machine Learning methods seek to automatically learn meaningful relationships and patterns from examples and observations” (Pattern recognition and machine learning, Christopher M. Bishop, 2006).
ML is currently widespread in the world, used in the industry and almost all fields of knowledge of science, including engineering, medicine, and demography. It is used to efficiently extract relevant and accurate knowledge from huge datasets where viewing and interpreting data is not possible by humans, nor suitable by other methods. If we search for the term “machine learning” on the Nature journal website, it will retrieve more than 19,000 records from almost all subjects, more than 9,000 records from Plos One and 124,000 from IEEE Xplore. On the other hand, there are only a few records in the major journals in demography, which is understandable since in order to use ML methods, one first needs to understand it, and of course, evaluate its feasibility.
Although there is a consolidated field of science, it’s very common to hear researchers making false statements regarding ML, and it seems that people may be afraid of the new, or dislike things that they do not understand, yet. Some of them we will discuss next, but first, let’s see a good analogy of understanding ML’s purpose:
“You will find it difficult to describe your mother’s face accurately enough for your friend to recognize her in a supermarket. But if you show him a few of her photos, he will immediately spot the tell-tale traits he needs … This is what we want our technology to emulate. Unable to define certain objects or concepts with adequate accuracy, we want to convey them to the machine by way of examples … however, the computer has to be able to convert the examples into knowledge.” (An Introduction to Machine Learning, Miroslav Kubat, 2017).
“Machine learning and Statistics are the same”. This affirmation is as incorrect as saying that Demographic Methods and Statistics are the same. To this fake and shallow statement, one good answer is: “Oh, yes, that’s why all the Machine Learning labs and journals in the world are now being transformed into Statistics labs and journals“. ML is built upon statistics, so they are not the same, since ML is designed to handle huge datasets in an efficient way, it’s a very important tool considering the volume and availability of data in recent decades. Technically, statistical models are designed to evaluate relationships between variables by inference and hypothesis tests, while ML models are designed to make the most accurate predictions possible. Besides that, ML methods don’t require data distribution assumptions. Although these can differentiate them, it may not be enough, so evaluating the purpose is a better way to decide which one is the most appropriate for your research. For example, if you just want to create an algorithm that can predict neonatal mortality with high accuracy or use data to determine whether newborns are likely to contract certain types of diseases, ML is likely to be the better approach, especially when having a huge dataset to explore. Also, some gains can be achieved on old problems like generating accurate small area population forecasts. Otherwise, if you are trying to make inferences from data or prove a relationship between socioeconomic variables and a certain event, or if you want to just evaluate socio-economic health determinants on small datasets, a statistical model is likely the better approach.
“Machine learning is a black box”, and the answer for this one is: “As well as your smartphone“. Many researchers say that ML is a black box because they do not understand how it works or believe that you cannot explain how the resultant models were built, but you can. Let’s be fair, for some ML methods this is a very difficult and complex assessment, and in a few situations (like in deep learning for example), this may actually be unfeasible. However, this does not make all other ML methods impossible to explain, and if your problem has hard requirements on model explanation, do not use this, choose another. It is not a simple duty, but it is possible to track how the model was built and explain its outputs based on the input features. In fact, many of the libraries already provide implementations for this intent. When building a model (a mathematical equation) by applying statistical methods, you define and follow the approach to building the model. When it is finished, you already know how it works, because you did it. On ML, the algorithm does it for you, and the resultant model is the best of all possibilities (for that algorithm). In fact, the human interaction when building a model is passive of human error or bias. On the other hand, when using ML, if there is a bias it’s a dataset problem, and this problem can happen on both approaches, it’s not an ML problem only. We need to keep in mind that we can still use these “hard-to-understand machine learning methods” with confidence because they were all designed using the rigor of the scientific method, so believe it or not, it works.
“How better are machine learning methods compared to traditional methods?”. We can answer this with another question: “And how better are traditional methods compared to ML methods?“. Unless you plan to compare methods as part of your research, there is no sense in executing experiments performed using ML against traditional methods, just to compare the results. You do not need to validate an ML method, and if you have, it is not by comparing it against a traditional method, their theory has already been established, and they are valid. Besides that, ML methods have their own metrics for results evaluation. It’s true that in some design experiments you may not need ML, although you can still use them with confidence, you just need to decide about feasibility and purpose.
“Machine learning is useless”. This is a statement based on personal beliefs, or personal experience, i.e., a false statement taken as true without any scientific evidence. All big tech companies like Meta, Google, Microsoft, IBM, and Netflix, are using ML and achieving incredible results with the help of ML methods, generating population and demographic insights for their business. The same is true for banks, insurance companies, airline companies, health institutes, and many other research centers around the world.
Good reasons to start learning Machine Learning
If you are working with huge datasets, you should consider using ML methods, it’s not me saying this, it’s science. You do not need to know statistics deeply to start learning and using ML (although this would be very helpful), the algorithms are very well encapsulated, and you do not need to prove the algorithm used by a method (unless this is an object of your research). You can just read the documentation, and understand how it works and how to use it adequately. ML is not a fashion season; it is not an adventure, and it was not created by a little group of disqualified people. Instead, Machine Learning is a result of the hard work by a large number of scientists around the world over the past 6 decades. In spite of this, you can still hold on to your beliefs.
1 Co-Pi in the project “Decision-Making Support Platform Based on Visual Analytics and Machine Learning to Subsidize Public Politics Focused on Gestational Health“, funded by Bill & Melinda Gates Foundation and Brazilian Ministry of Health.
2 Co-Pi in the project “Data Science Applied to Epidemiological and Demographic Information as a Strategy to Simulation and Malaria Vigilance Monitoring in the Brazilian Amazon“, funded by Bill & Melinda Gates Foundation and Brazilian Ministry of Health.
3 PI in the project “Data Science Applied to Epidemiological and Demographic Information as a Strategy to Simulation and Malaria Vigilance Monitoring in the Brazilian Amazon“, funded by Bill & Melinda Gates Foundation and Brazilian Ministry of Health.