- What does a data scientist do on a day-to-day basis? A popular and must-know question
- We analyze this question from a data scientist’s perspective through the lens of 5 detailed and insightful answers from experienced data scientists
I’m a curious person by nature. Whenever I come across a concept I haven’t heard of before, I can’t wait to dig in and find out how it works. This has come in quite handy in my own data science journey.
But before I landed my first break in data science, I was always curious about what data scientists actually did every day. Was I supposed to simply build models all the time? Or was the oft-quoted saying about spending 70-80% of our time cleaning data actually true?
I’m sure you have asked (or at least wondered) about this too. The role of a data scientist might be the “sexiest job of the 21st century”, but what does that entail on a day-to-day basis?
I decided to research this. I wanted to expand my horizons and understand how data scientists look at their role in different domains (such as NLP). This helped me gain a broader understanding of our role and why we should always read different perspectives when it comes to data science.
So, here is a list of the top 5 answers to help you get a sense of what the typical routine of a data scientist is. Prepare to be surprised – building models isn’t the primary (and only) function in a data scientist’s day-to-day tasks!
I also encourage you to take part in a discussion on this question here. This will enrich your current understanding of what a data scientist does and your thoughts will foster a discussion among our community!
Note: I have taken the answers verbatim from Quora and added my thoughts right at the beginning of each answer. This will help you get a good perspective of what the answer covers without diluting the author’s thoughts. Enjoy!
Machine Learning is Very Process Oriented – Mike West
I like this answer because it’s crisp, to-the-point and simple. The author has even designed a flow diagram and explained his thought process in a wonderfully illustrated way. Here is his answer in full:
Machine learning engineers spend a ton of time in the first two pictures (or stages). The fun part is really in the third stage but it’s only a small part of what happens in the real world.
Some key things to keep in mind about data science in the real world:
- Almost all applied machine learning is supervised. That means we build models against structured datasets
- Data wrangling is a large part of what happens in the real world
- When you hear the word supervised, think classification and regression. Most of my models are classification problems
- Model building is about 20% of my work. Yep, that’s it!
- Many small and medium-sized companies don’t use deep learning at all. Why? Because structured data algorithms like XGBoost win every time
- Everything I do is programmatic
- Most real-world data resides in relational databases. It will be your job to craft queries to pull out the data you need
- Big Data is unstructured data. If you have to build your models against big data, then you’ll need to learn another set of skills
- The cloud is here to stay. I use BigQuery for my really large structured data. Most large models can’t be built on your laptop
- Computers are monolingual. They only speak numbers. When you pass data to your model, you are passing a highly structured, well cleansed numerical dataset
I really like the use of visualization by Vinita. The percentage-wise description of each data science task is helpful and insightful. Vinita has also leaned on her experience to explain the step-by-step work a data scientist does. It’s a must-read answer!
Contrary to popular belief, Data Science is not all glamour. The following survey results by CrowdFlower accurately sum up a typical day for a Data Scientist:
There is a lot of backtracking involved. Sometimes you even need to be able to predict what consequences removing/adding a variable might have.
- Collecting Datasets: Data is the lifeline of Data Science, so we spend plenty of time curating it. On rare occasions, some projects might already have plenty of data
- Cleaning & Organizing Data: This is the most time consuming and crucial step in the entire process. It has a great impact on the final results. Usually, after this step, the once large amount of data reduces and so we may need to collect more data for effective training
- Data Mining: It is the practice of examining large pre-existing databases in order to generate new information. Once data is organized and stored in databases, we can finally begin to derive value from it by finding patterns within the data
- Building Training Sets & Test Sets: Once we have a decent amount of data, we need to split it into the training set and the test set. A training set is a set of data used to discover potentially predictive relationships. It contains all the information about the expected output. A test set is a set of data used to assess the strength and utility of a predictive relationship. It contains mixed variables
- Refining Algorithms: We start with a skeletal algorithm. It is very basic and defines roughly what output is expected. After a few sessions, the accuracy, precision, etc. are recorded and the algorithm is refined to maximize its efficiency
Data Scientist Perspective from a Small-Sized Company – Justin Fister
This is a superb answer and one I can relate to. Note that machine learning, the most anticipated aspect of a data scientist’s job, only occupies 5% of the total time! Just like Vinita, he has also explained his tasks in terms of percentage. Here is Justin’s view:
- NLP-related tasks (15%). It’s no surprise that PaperRater’s automated proofreading technology requires heavy use of parsers, taggers, regular expressions, and other NLP goodies as part of the core algorithms and feedback modules
- Machine learning (5%). This tends to be the most enjoyable part. Data cleaning, feature extraction/engineering/selection, and model building
- Reporting and analytics (10%). Running queries, reviewing analytics, and assisting with strategic decision making
- Data management (5%). Setting up and managing database servers including MySQL, Redis, and MongoDB. Larger projects may require Hadoop or Spark
- General software development (40%). Many data scientists’ have a background in computer science, so expect to pitch in if you have an applicable background. API integration, web-development, and wherever else I can add value. Even at an AI startup, most of the development is not going to involve AI
- Other (25%). This includes a wide variety of tasks, including blog posts, marketing, management, technical documentation, technical support, website copy, emails, meetings, etc.
The “Data Scientist” is a bit of a Myth – Tim Kiely
The author, Tim Kiely, uses a Venn diagram to explain what data science is. Just take a look at this Venn diagram below – it will blow your mind. Tim additionally talks about what data scientists are supposed to be by taking a somewhat contradictory view of the general definition. Here is Tim’s answer:
The “Data Scientist” is a bit of a myth, in my opinion. Not to say they aren’t out there but they are far rarer than is popularly understood and are more of the exception than the rule.
I liken it to the “Web Master” title of the dot-com bubble – these supposed people who could do full stack programming, front end development, marketing, everything. All of those roles/skills were always specialized and remain so today.
“Data Scientists” are supposed to be database architects, understand distributed computing, have a deep understanding of statistics AND some area of business or field expertise. That’s asking a lot when any one of those skill sets can take a career to build.
The Data Scientists I’ve worked with typically have a Ph.D. in A.I. or Machine learning and are effective communicators, which gives them the ability to direct the analysts, DevOps people, programmers and DBA’s at their disposal to solve problems with data-driven solutions. They outline the desired solution and leave it to their teams to fill in the gaps.
Machine Learning Engineer Working on NLP Tasks – Evan Pete Walsh
Let’s drill down into a particular specialization of machine learning. One of my favorites – Natural Language Processing (NLP)! I wanted to bring out a machine learning engineer’s view here (a role every data scientist should become familiar with). Check out Evan’s full response:
Currently working on NLP, for the most part, including intent classification and entity extraction. Here’s a typical day for me:
- Get to work, pull up GitHub and check on the ZenHub board (kind of like Jira, except way cooler). I had some models that were training last night on our servers and I should have gotten an email that they finished. I did!
- I’ll probably spend a few minutes testing those new models and then tweak some parameters, then restart the training process
- The rest of the day I’m usually head-down coding, either working on a back-end Python application that will supply the AI for one of our products, or implementing a new algorithm that I want to try out
- For example, recently I read a paper on coupled simulated annealing (CSA), and I wanted to try it out on tuning the parameters for XGBoost as an alternative to a grid search. CSA is a generalized form of simulated annealing (SA), which is an algorithm for optimizing a function that doesn’t use any information on the derivative of the function
- Unfortunately, I couldn’t find an implementation in Python, so I decided to write my own. Two days later, I had submitted my first package to PyPI!
The data scientist role is truly multi-faceted, isn’t it? A LOT of aspiring data scientists assume that they will primarily be building models all day long but that simply isn’t the case.
There are all sorts of tasks involved in a typical data science project which you’ll find yourself working on day-to-day. I quite like that because it opens up avenues to learn new concepts and apply them in the real world.
I’ll be posting some more career-related articles on Analytics Vidhya, so stay tuned and keep learning!