Using Data? Master the Science in Data Science
This past November, Avi Patchava and Paul Meinshausen participated in panel discussions at the Data Hack Summit in Bangalore and were encouraged to see Kunal Jain’s keynote on developing the data science ecosystem in India. They have been a part of the community for several years and have worked in data science roles across start-ups, management consulting, industry, and venture capital. They share the strong conviction that an important area for development in the ecosystem is in Data Science’s intellectual infrastructure.
In the early stages of development, data science is often mistaken for a thin layer of popular statistical tools and packaged algorithms that are applied bluntly to a problem of choice. As data science evolves as a discipline in India, it will take more robust shape as an intellectual approach to discovering, as well as building solutions, for tough problems across business and society. The early stages of data science emphasize the size and diversity of data. As data science evolves, we see more focus on the application of scientific rigor in forming hypotheses, making deliberate decisions on model design, and causal inference.
We still get a lot of confused responses when we talk about the science in data science. One of the most common questions is: “What do Apache Spark and neural network models have to do with laboratories and experiments?” So, in this article, we list down the concepts of science that are at the core of productive data science.
Table of Contents
- Model Thinking – Understand the role and meaning of models
- The Hypothesis – Deploy the power of hypothesis-led learning
- The Data Generating Process – Know what it is that you seek to model
- Searching for the Mechanism – The how and why of a model’s performance
- Replicability, Reproducibility, Generalizability – Push for enduring impact
1) Model Thinking – Understand the role and meaning of models
Model thinking is at the very heart of data science. We use models to understand or predict the parts of the world that we want to make decisions about in our businesses. For example, at InMobi, Avi and team use models to predict behaviours of mobile users – such as whether they will click an advert, download an app, or purchase an ecommerce product. These models directly inform whether InMobi will bid for the opportunity to show an advertisement to a given user.
George Box famously said that “All models are wrong but some are useful.” Once you have understood the lesson of intellectual humility Box was emphasizing, it is important to think about why only some models are useful and how you can make sure your models are among that useful subset. Models that are formed unconsciously or uncritically (think of cliches, stereotypes, and poorly considered KPIs, all of which are automatic mental models) are usually not very useful.
We evaluate a model in different ways depending on what we want to do with it: description, explanation, prediction, prescription. Data scientists who do their work like scientists describe their models clearly, both the parts of the world that are included in the model and the parts that are left out. They also explicitly identify decisions their model can inform and acknowledge the kinds of decisions that the model will not help inform.
For example, Avi was once involved in building an ML-driven control system for a metallurgical process in a steel production plant. The control system was midstream in the production value chain: it was one of 8 sub-plants performing different activities required in taking steel all the way from raw materials to steel slabs ready for transport.
There were 3 major complications:
- There were potentially 10,000+ features of highly-granular sensor data from the full production process;
- There were at least 5 sub-plants upstream in the process with many variables that could potentially have downstream effects;
- The actual model needed to do prediction for some intermediate variables, before it could give a prescription on values for the control system variables which would be dynamically adjusted.
The team made progress by sketching out the control system on a whiteboard, and showing clearly which types of variables would be used in the model and at what level of granularity. Also, the team was upfront about what assumptions the model needed to make about the data-generating process (more on this term below) given it would use a limited dataset. Further, the team was clear on which variables would enter the prediction system, and which variables were inputs into the final recommendation system.
2) The Hypothesis – Deploy the power of hypothesis-led learning
Hypotheses are an important tool for achieving useful models. The growth of computational power has been a boon to our science; our models have become incredibly sophisticated and complex. We are able to test many more possibilities for the inputs, the form, or the configurations of our models and we can look for many potential relationships that our models might represent.
But effective data scientists do not let computational power drown the role of hypotheses in model building. Sometimes it is not possible to cost-effectively try every combination of inputs, forms and configurations. We need to know where to start. Moreover, when we auto-test many ideas we are vulnerable to that very human inclination to see patterns where they do not exist.
Data scientists who think like scientists are able to take advantage of small data. Small data is where a smaller set of experiences suggest a certain idea or hypothesis. Sometimes humans have themselves become trained algorithms in virtue of having experienced a large number of the event in question. They cannot explain how they know a likely outcome (e.g., whether a consumer will be a defaulter, or whether their child is lying) but through accumulation of wide data from many situations, they become a ‘human compass’ with accurate intuitions and hunches.
So speak to the humans who have accumulated relevant experience; hear their intuitions. Once you have developed a model from this kind of research, you can start to build technology that can collect and create more data to fit and validate your models. Real scientists do not just use data; they also collect new data based on well specified hypothetico-deductive models.
3) The Data Generating Process – Know what it is that you seek to model
Models go hand-in-hand with data. Sometimes you form the models first and then go looking for data that matches your model and lets you test empirically and eventually deploy it. Sometimes you have a bunch of data and you explore it to develop a model of the phenomena the data represents. In either case, it is critically important that you remember the data only ever represents the world, it is not actually the world itself.
Since data is only a representation of the world, it is important to think clearly about where it came from and how the way it was generated might limit or mislead your models.
For example, when Paul was at Housing.com an important problem for the Data Science Lab was developing valuation models for real estate. The model hypothesized that the features of a house (number of bedrooms/bathrooms, size, etc) and the features of the locality the house is in (proximity to public transport, safety, proximity to schools, etc) contributed to the value people assign to it. Prices and people’s perceptions of the value of a house are also influenced by the houses around it.
At the beginning those models relied on the assumption that Housing’s data reliably represented the houses in a given locality and city. Then they discovered that their data collectors did not collect houses in a random or representative way. Instead, young bachelor data collectors would often go to brokers who specialized in renting flats to young bachelors. The result was an oversampling of certain kinds of properties that led to poor predictive power for properties that did not come from that population. The team was able to deal with the problem because they paid close attention to the processes (human and technological) that generated their data.
4) Searching for the Mechanism – The how and why of a model’s performance
A model is often treated as a box: you add inputs, you get outputs. The mechanism is that inner causal process within the box which is how inputs are converted into outputs. Some models are built to accurately reflect or simulate the how in the data-generating process.
Though some models are explicitly built not to mimic the how of the data-generating process (e.g. the predictive models for click behaviours above), often models work better if they are constructed so as to mimic the way the real-world mechanism works. The more you try to understand this, the more you can strengthen your model design.
Sometimes models work well, even brilliantly. Sometimes they just will not deliver no matter how many ideas you try. Some people move from one model to the next, blindly hoping for a bullseye on the next shot, knowing they have many arrows yet in their quiver.
By asking why something has worked i.e. looking for deeper clues in the data, developing hypotheses based on potential causes, and testing with further exploration, you seek a more reliable path to success. Your next step is better informed with learnings from the former. If you land on something that does work brilliantly, and you understand why it works so well, you are less likely to have succeeded by fluke (which often fails next turn).
For models involving a high number of variables (i.e. high dimensionality), it will be challenging to understand or accurately map the mechanism. Contrast this to classical physics equations – such as Newtonian laws – where the limited variables means the mechanism is easier to identify. However even in high-dimensionality situations, the model design can seek to capture the structure of the underlying problem. Consider how convolutional neural networks have been so effective because they are structured to work with grids of pixels, whereas recurrent neural networks are structured to work with sequential data, such as time-series.
At InMobi, the underlying dynamics of our global marketplace are changing rapidly. It is very easy to find models – or an arbitrary modelling technique – that works well over the last month’s data but, once deployed, the model subsequently fails to perform in the following month. We push ourselves to understand the why and how of a new technique before we are truly convinced, even if all performance metrics look green.
5) Replicability, Reproducibility, Generalizability – Push for enduring impact
People (this includes product managers and executives) love clear answers and good stories with happy endings. The best scientists learn to resist the urge to prioritize making their customers happy vs. maintaining a critical mindset on the world. As Ibn al-Haytham put it, “The duty of man who investigates the writings of scientists, if learning the truth is his goal, is to make himself an enemy of all that he reads and … attack it from every side. He should also suspect himself as he performs his critical examination of it, so that he may avoid falling into either prejudice or leniency.”
If your model has worked where before it failed, recognise that you have learned something about the world. Maybe you have made a novel discovery. If you have asked why, find what is your best understanding of why it has worked. Did you learn something surprising about the world? Update your overall view of how the world works.
You might have been trying to map a single data-generating process but there will be learnings across similar types of process – is your learning consistent with what you know about other processes? Or, if your process is part of a wider system, ask yourself: how do my learnings on this process affect my understanding of the wider system? What is generalizable? What are the implications? Remind yourself that one novel discovery in science often has wide ramifications, across many other fields of science.
Data science that aims to be science doesn’t leave work buried in files and folders of distributed code. Your work should be checked and verified and corrected on a regular basis. But if you truly want that to happen you have to do the extra work to document it and make it transparent, accessible, and extensible – all the successes and especially the failures too.
Finally, remember failure is inevitable in your journey. For every new idea you have, perhaps only one in five will deliver. Do not be disheartened. Failure is an opportunity for learning, especially if you began with hypothesis. So why is this called science? Because discovering genuine and enduring insights is going to be difficult, but is a huge part of our journey.
In conclusion, data science is a mindset and an approach that is far more than manipulating maths and writing clever code. Remind yourself that you are a scientist, and consequently you will be a much better data scientist.
About the Authors:
Paul Meinshausen is Data Scientist in Residence at Montane Ventures, an early stage VC fund based in India. He is a cofounder at PaySense, a mobile fintech startup in Mumbai, and was previously the Chief Data Officer. He has held data science positions at Housing.com, Teradata, the University of Chicago, and the U.S. Department of Defense, and conducted research in Psychology at Harvard University.
Avi Patchava is Vice-President of Data Sciences, Machine Learning and Artificial Intelligence at InMobi – a leading Indian company in the world of Mobile AdTech. Previously, he was with McKinsey&Co driving large-scale machine learning initiatives in sectors such as Banking, Automotive, and Manufacturing. His background is in economics and the social sciences, with Masters’ degrees from the University of Oxford and the London School of Economics.