By Jonas Dias, Head of Data Science at Evergen – featured in Towards Data Science Magazine 

Although data science became popular with the advances in machine learning and AI, science is a much broader topic. In the past few years, machine learning, artificial intelligence, and ultimately data science rose as the buzzword of the industry. Of course, there is a reason for this phenomenon. New algorithms and new hardware made complex prediction systems affordable for many companies. It’s not hard to find a use case in the industry about how they overcame a massive problem using a thousand-layers convolutional neural network. And really, this is a good thing. Who never wanted a crystal ball?

Just fit it, fit it, fit it, fit it…” 

No one wants to be defeated in the market. And the Internet is full of tutorials about how to train your model. And you just have to be careful about the overfit monster. Right? No.

Don’t take me wrong. There is plenty of valuable content on the Web, and it’s important to learn. However, based on my experience, if you want to be a successful data scientist, you need to go beyond the machine learning recipes. Here are three high-level skills I believe all Data Scientists need to master (or develop): 

  1. The Scientific Method
  2. How to create value for the business.
  3. How to communicate findings appropriately.

On the one hand, data scientists with a solid academic background are usually good at the scientific method. However, they often get too excited delving into interesting research and may forget about what the business really needs. On the other hand, data scientists that grow in the industry tend to neglect the rigour of the scientific method.

The third skill is really something apart. I strongly believe it depends on the experience of the individual, and it may even be associated with his or her personality.

The good news is even if you lack one of these skills, you can learn them and develop yourself.

Today, I’m going to talk about the first one: The Scientific Method

There is a reason why we put the word science in ‘data science’. If you have no clues about how to conduct a scientific experiment, you should learn (even if you are not a data scientist because science is for everybody). A data scientist must understand what science is about and know how to criticise it properly. This includes knowing how to scrutinise your own scientific work.

Science made with data

That is what data science is about in general terms. We use historical observations to prove or reject hypotheses. In this sense, one may state that data science is evidence-based research. People have been doing this in other fields (like medicine) for more than a century. Thus, when you are doing a data science task, think of it within the scientific method:

  1. What is your question?
  2. What is your hypothesis?
  3. How are you going to test your hypothesis?
  4. How the results support your claims?

Besides, if new insight or evidence arrives in the middle of the process, you can always start the process again. It’s called: Scientific experiment lifecycle!

But a machine learning pipeline doesn’t look like the scientific method. I fully disagree. Let’s take a vanilla example:

  • Question: What causes traffic jams?
  • Hypothesis: Rain causes traffic jams.
  • Method: Train an ML model to forecast traffic based on historical rain data. Predict traffic with unseen observations and measure the error. Also, estimate traffic based on historical averages as a baseline model and measure the error.
  • Analysis: Compare the error distributions. Is the error of ML model significantly lower than the baseline error?

When I stress ‘significantly lower’ is because you cannot say just ‘it’s 5% lower than I proved my hypothesis’. You better use hypothesis testing, a statistical tool that, unfortunately, many data scientists don’t know how to use.

My method may not be the best way to test whether rain causes traffic. Really, it’s just a possible method, and, again, a vanilla example. My point is more about how you should think of a data science task. Instead of pressing the ‘fit button’ because we need to predict traffic, we should reason about what we are trying to answer, how we will do it and whether our outcomes are actually better than something simpler or even random.

By now, I hope you agree with me that data science and machine learning are NOT the same. Besides the high-level skills I mentioned, there are many other technical subjects that a data scientist should master. But this is a topic for another blog post.

Related posts