Top 5 Must-Read Answers – What does a Data Scientist do on a Daily Basis?
Overview
- What does a data scientist do on a day-to-day basis? A popular and must-know question
- We analyze this question from a data scientist’s perspective through the lens of 5 detailed and insightful answers from experienced data scientists
Introduction
I’m a curious person by nature. Whenever I come across a concept I haven’t heard of before, I can’t wait to dig in and find out how it works. This has come in quite handy in my own data science journey.
But before I landed my first break in data science, I was always curious about what data scientists actually did every day. Was I supposed to simply build models all the time? Or was the oft-quoted saying about spending 70-80% of our time cleaning data actually true?
I’m sure you have asked (or at least wondered) about this too. The role of a data scientist might be the “sexiest job of the 21st century”, but what does that entail on a day-to-day basis?
I decided to research this. I wanted to expand my horizons and understand how data scientists look at their role in different domains (such as NLP). This helped me gain a broader understanding of our role and why we should always read different perspectives when it comes to data science.
So, here is a list of the top 5 answers to help you get a sense of what the typical routine of a data scientist is. Prepare to be surprised – building models isn’t the primary (and only) function in a data scientist’s day-to-day tasks!
I also encourage you to take part in a discussion on this question here. This will enrich your current understanding of what a data scientist does and your thoughts will foster a discussion among our community!
Note: I have taken the answers verbatim from Quora and added my thoughts right at the beginning of each answer. This will help you get a good perspective of what the answer covers without diluting the author’s thoughts. Enjoy!
Machine Learning is Very Process Oriented – Mike West
I like this answer because it’s crisp, to-the-point and simple. The author has even designed a flow diagram and explained his thought process in a wonderfully illustrated way. Here is his answer in full:
A Percentage-wise Breakdown of a Data Scientists’ Day-to-Day Role – Vinita Silaparasetty
I really like the use of visualization by Vinita. The percentage-wise description of each data science task is helpful and insightful. Vinita has also leaned on her experience to explain the step-by-step work a data scientist does. It’s a must-read answer!
Contrary to popular belief, Data Science is not all glamour. The following survey results by CrowdFlower accurately sum up a typical day for a Data Scientist:
There is a lot of backtracking involved. Sometimes you even need to be able to predict what consequences removing/adding a variable might have.
- Collecting Datasets: Data is the lifeline of Data Science, so we spend plenty of time curating it. On rare occasions, some projects might already have plenty of data
- Cleaning & Organizing Data: This is the most time consuming and crucial step in the entire process. It has a great impact on the final results. Usually, after this step, the once large amount of data reduces and so we may need to collect more data for effective training
- Data Mining: It is the practice of examining large pre-existing databases in order to generate new information. Once data is organized and stored in databases, we can finally begin to derive value from it by finding patterns within the data
- Building Training Sets & Test Sets: Once we have a decent amount of data, we need to split it into the training set and the test set. A training set is a set of data used to discover potentially predictive relationships. It contains all the information about the expected output. A test set is a set of data used to assess the strength and utility of a predictive relationship. It contains mixed variables
- Refining Algorithms: We start with a skeletal algorithm. It is very basic and defines roughly what output is expected. After a few sessions, the accuracy, precision, etc. are recorded and the algorithm is refined to maximize its efficiency
Data Scientist Perspective from a Small-Sized Company – Justin Fister
This is a superb answer and one I can relate to. Note that machine learning, the most anticipated aspect of a data scientist’s job, only occupies 5% of the total time! Just like Vinita, he has also explained his tasks in terms of percentage. Here is Justin’s view:
The “Data Scientist” is a bit of a Myth – Tim Kiely
The author, Tim Kiely, uses a Venn diagram to explain what data science is. Just take a look at this Venn diagram below – it will blow your mind. Tim additionally talks about what data scientists are supposed to be by taking a somewhat contradictory view of the general definition. Here is Tim’s answer:
The “Data Scientist” is a bit of a myth, in my opinion. Not to say they aren’t out there but they are far rarer than is popularly understood and are more of the exception than the rule.
I liken it to the “Web Master” title of the dot-com bubble – these supposed people who could do full stack programming, front end development, marketing, everything. All of those roles/skills were always specialized and remain so today.
“Data Scientists” are supposed to be database architects, understand distributed computing, have a deep understanding of statistics AND some area of business or field expertise. That’s asking a lot when any one of those skill sets can take a career to build.
The Data Scientists I’ve worked with typically have a Ph.D. in A.I. or Machine learning and are effective communicators, which gives them the ability to direct the analysts, DevOps people, programmers and DBA’s at their disposal to solve problems with data-driven solutions. They outline the desired solution and leave it to their teams to fill in the gaps.
Machine Learning Engineer Working on NLP Tasks – Evan Pete Walsh
Let’s drill down into a particular specialization of machine learning. One of my favorites – Natural Language Processing (NLP)! I wanted to bring out a machine learning engineer’s view here (a role every data scientist should become familiar with). Check out Evan’s full response:
Currently working on NLP, for the most part, including intent classification and entity extraction. Here’s a typical day for me:
- Get to work, pull up GitHub and check on the ZenHub board (kind of like Jira, except way cooler). I had some models that were training last night on our servers and I should have gotten an email that they finished. I did!
- I’ll probably spend a few minutes testing those new models and then tweak some parameters, then restart the training process
- The rest of the day I’m usually head-down coding, either working on a back-end Python application that will supply the AI for one of our products, or implementing a new algorithm that I want to try out
- For example, recently I read a paper on coupled simulated annealing (CSA), and I wanted to try it out on tuning the parameters for XGBoost as an alternative to a grid search. CSA is a generalized form of simulated annealing (SA), which is an algorithm for optimizing a function that doesn’t use any information on the derivative of the function
- Unfortunately, I couldn’t find an implementation in Python, so I decided to write my own. Two days later, I had submitted my first package to PyPI!
End Notes
The data scientist role is truly multi-faceted, isn’t it? A LOT of aspiring data scientists assume that they will primarily be building models all day long but that simply isn’t the case.
There are all sorts of tasks involved in a typical data science project which you’ll find yourself working on day-to-day. I quite like that because it opens up avenues to learn new concepts and apply them in the real world.
I’ll be posting some more career-related articles on Analytics Vidhya, so stay tuned and keep learning!
4 thoughts on "Top 5 Must-Read Answers – What does a Data Scientist do on a Daily Basis?"
Rutvij Bhutaiya says: June 28, 2019 at 6:04 pm
Hi! Shubham, nice article, on collective views from experienced persons in the industry. It's true most of the Data Science related tasks involves Data Cleaning. Here are my views on the Data Cleaning part. After completion of data collection, I store it in excel file. I love working on MS Excel, so here what I do, I clean 50%-60% data through MS Excel tool and then load the file on R platform - now, on R Studio I again start with data cleaning and mainly on data normalization. Then I do EDA and chart analysis, If I see there are outliers [depends on the project objective] and all, Then I again check on data normalization task. Then all the following tasks like modeling and prediction .. Hope this help!Shubham Singh says: June 29, 2019 at 1:06 am
Thank you so much for sharing your views. This would surely help the community.Jyoti Kulkarni says: July 02, 2019 at 11:38 am
Hi Rutvij, is that all a Data Scientist does? Data cleansing, outlier removal, and then data normalization? Then what is the difference between a data analyst and a data scientist? Being a data scientist, why one would end up doing the data cleansing activities? So, in case you work on a test data and implement the model on the rest of the data, what's the guarantee that the effort you have put would work correctly?Rutvij B says: August 22, 2019 at 10:58 pm
Hi Jyoti, apologies for the late reply. I believe, there are no right and wrong answers. Most of the data scientists have their own style and set of the process for building models. Domain knowledge and clarity on objective, are the two important things, which makes one data scientist better than others. If the dataset is perfect any algo/stats expert can build the models, hence which is not true. For example, if you are a data scientist working on a telecom company - let's say customer churn report and your dataset contains 30 variables. Now, data analyst would clean the data, normalize, etc. But data scientist would choose and work on the best 10-15 variables which he/she analyses for better output. And if you give the same set of data to other data scientist, he'll come up with other 18-20 variables, which he believes fits right for output - based on his domain knowledge. Hope this clarifies your doubts, however, I am directly taking up your questions.