It’s Data Science Myth-Busting Time!
Transitions into data science are tough, even scary! And it is not because you need to learn maths, statistics, and programming. You need to do that, but you also need to battle out the myths you hear from people around you and find your own path through them!
I had my own share of these “popular” myths and they made my life difficult. Here are the few I had heard:
“You need a Ph.D to have a chance of becoming a data scientist. Two is even better!”
“Participate in data science competitions, that will tell you how the industry works.”
“You need tons of computational resources to build deep learning models. You can only get that at the top tech firms.”
I had promised myself that once I have seen them through – I’ll help others by debunking these myths. And there are multiple myths floating around that add a false aura around data science roles. Don’t fall for them!
These myths often make you feel like only geniuses can work in data science. This is just not true. Whether you’re a recent graduate, an experienced professional, or a leader, it’s important to understand how data science works and you will find your place in the industry.
This article is my revenge on those myths!
I have also done some clustering 🙂 and presented the myths in three types – Career-related myths, Tools and Framework related myths, and Data Science role-related myths. Read about them and make sure you don’t fall for them!
If you’re looking for a role in data science and are struggling to break through, make sure you check out this awesome Ascend Pro program! It’s a perfect blend of learning from experienced instructors and practical hands-on projects – an unmissable opportunity.
You can also check out my article on the 13 common mistakes amateur data scientists make. It’s full of resources so you don’t want to miss that!
Data Science Career-Related Myths
1. Ph.D is Mandatory to Become a Data Scientist
Holding a Ph.D. degree is an amazing achievement. It requires years of hard work and dedication. I have utmost respect for people who are willing to put in that amount of effort.
But is it compulsory to do a Ph.D in order to become a data scientist? This is a heavily role dependent question. There are several layers to peel off here so let’s get down to it.
Breaking down the myth
To understand this, let’s broadly divide the role of a data scientist into two categories:
- Applied Data Science Role
- Research Role
It’s important to understand the distinction between these two roles. The Applied Data Science is primarily about working with existing algorithms and understanding how they work. In other words, it’s all about applying these techniques in your project. You DO NOT need a Ph.D for this role.
Most folks fit into the above category. Most of the openings and job descriptions you see or hear about are for these roles.
But what if you are more interested in a research role? Then yes, you might need a Ph.D. Creating new algorithms from scratch, researching them, writing scientific papers, etc. – these fit a Ph.D candidate’s mindset. It also helps if the Ph.D adds to the domain you want to work in. For example, a Ph.D in linguistics will be immensely helpful for a career in NLP.
Another misunderstood aspect of a Ph.D is the opportunity cost. It’s a massive commitment from your side – both mentally and financially. Rachel Thomas wrote about this question here and I suggest taking a look. It’s a well-balanced point of view from a leading researcher in this field.
As Rachel mentions in her post, there are tons of data science pioneers who don’t hold a Ph.D:
- Jeremy Howard, co-founder of fast.ai
- Mariya Yao, author of the popular ‘Applied Artificial Intelligence’ book
- Devaki Raj, co-founder of CrowdAI
So which role do you see yourself in? That’s a critical question to answer before you jump into data science.
2. A Full-Time Data Science Degree is a Must for Making the Transition
Much like the Ph.D dilemma, this is another myth I’ve seen aspiring data science professionals fall for. With the amount of interest data science has generated in the last few years, how would you differentiate yourself from the competition? Splashing out money to get a degree seems like a good starting point. It’s an understandable reaction.
Here’s the good news – this is a myth perpetuated predominantly by institutes.
Breaking down the myth
In a vast and complex field like data science, practical experience is king. There are numerous projects you can pick up and work on right now. Or find a problem you are passionate about solving and see if data science techniques can be applied there.
There are plenty of resources available online for learning. Books, MOOCs, blogs, videos, and so on. Start from Analytics Vidhya’s learning path. It’s completely free, comprehensive in nature (contains all the aforementioned resources), and provides a structure around your learning – an invaluable feature.
Because of the lack of formal education in this field, transitioning into data science boils down to sheer hard work, discipline and practical experience. Those are the differentiating factors a recruiting manager will look at.
Greg Brockman, the co-founder and CTO of OpenAI, doesn’t even hold a college degree!
3. All your previous Work Experience will Translate to the Data Science Domain
You have a solid 5-10 years of experience in *some* industry. You are a well-respected professional who’s calling the cards. But you’ve recently become enamoured with data science and all it can do for your business and career. You can’t wait to bring all that experience to your new field.
Sounds familiar? Good. But if you are of the thought that your entire experience will translate to your new role, I suggest you re-think.
Breaking down the myth
There are two sides to this story:
- You are changing your domain entirely to get into data science
- You are sticking to your previous domain, but are looking for a data science role
Let’s understand the implications of each of these points.
Changing your Domain Entirely
If you are changing your domain entirely (for example, coming into data science after years of software testing), your work experience will most likely count for nothing. Not only are you switching to an entirely new line of work, you are also looking for a new role. When the recruiter looks at your resume, the first thought is – “what value will he/she add to the organization/project?”. Unfortunately, the answer in this case is usually close to zero.
Why? Because as a newcomer, you don’t have any experience of how that domain works. When you’re given real-world data, how will you work with it if you don’t understand how those features impact the final decision?
This is a reality most people skim over or prefer not to face. It’s an entirely wrong way to go about your career switch and will only end up harming your prospects. Understand the situation, talk to people who have made this switch, and align your expectations accordingly. Going in blind into such a big decision is a one-way ticket to failure.
Staying in the same Domain
Coming to scenario #2 – what should you expect if you’re staying in the same domain but switching to data science? Then your prospects look a lot sunnier. You have the added advantage of knowing the industry. You should already be aware of the nuances present in the domain so you will understand the data you’re working with. That is a MASSIVE benefit. The recruiting manager will factor that into the final decision.
I highly recommend the second scenario if at all possible. Stay in the same field you have always worked with and understand how you can apply data science there.
This article by Kunal Jain contains plenty of practical tips and tricks to help you overcome a lack of experience in this field.
4. Necessary to have a Computer Science/Mathematics/Statistics/Programming Background
This is essentially a continuation of the first two myths we covered. Most of the folks you’ll come across in data science will have an engineering/computer science background. They’ll have experience in at least have one, if not more, of the below fields:
- Computer Science
Does this mean someone coming from a completely different background can’t get into data science?
Breaking down the myth
Absolutely not! Trust me, I speak from experience. I spent 5 years working in a learning and development role before transitioning into data science. It can be done.
However, there are certain things you’ll need to consider that people coming from this background already have. Data science is a nuanced field comprising of several aspects. As a beginner, you will be required to learn concepts from scratch. It will often be a frustrating experience. Your technical colleagues might know more. They might be ahead of you initially at every turn.
And that’s where the dedication and discipline characteristics we spoke about earlier come into play.
For example, folks with a computer science background will already have a handle on how programming works. They can switch between languages relatively easily. I, on the other hand, struggled initially with learning R. I didn’t know left from right when it came to coding. But I persisted and eventually broke through.
There is nothing to suggest you can’t do that too. Once you have decided this is the field for you, you should pull out all the stops. Still sceptical? Then check out these inspiring stories from folks with a non-technical background who successfully made the transition:
- Deepak Vadithala – From a Paper Delivery Boy to a Lead Data Engineer & QlikView Luminary
- Step-by-Step Process of How I Became a Machine Learning Expert in 10 Months
- Marios Michailidis’ Inspiring Story of a Non-Programmer to No. 1 on Kaggle
Data Science Tools and Frameworks-Related Myths
5. Learning a Tool is Enough to Become a Data Scientist
Python or R – which tool should you learn? If I got a penny each time I came across this question..
There is a widely held belief that mastering data science is about learning how to apply techniques in Python or R. Or any other tool. That tool has become the central point around which all other data science functions revolve.
The assumption (or myth) is that being able to write code using existing libraries (numpy, scikit-learn, caret, etc.) should be enough to label yourself an expert. Why aren’t the job offers rolling in yet?
That one really drives recruiting managers right up the wall.
Breaking down the myth
Data science requires a combination of multiple skills. Programming is not at the centre of the data science spectrum – it is just one part of a whole. Let’s divide the spectrum of skills into two parts:
- Technical qualities
- Non-technical qualities or soft skills
Understanding how a certain technique works will help you become a better data scientist. This is why we encourage everyone to learn algorithms from scratch. Learn how changing a certain parameter will impact the final model. This will eventually pay off when you’re working on a large-scale project in the industry.
The margin for error and experimentation is slim where stakeholders come into the picture. We have plenty of articles on our blog explaining machine learning and deep learning techniques from the ground up. Go through them and try to understand and replicate the code yourself.
It will be an invaluable addition to your skillset.
Soft skills often get overlooked by aspiring data scientists. They certainly aren’t taught in any online courses or offline classrooms. And yet these are qualities interviewers look for.
- Problem-solving skills
- Structured thinking
- Communication skills
How can you pick up these skills?! By following a disciplined approach and practising every chance you get. Here are a few resources for your perusal:
- The Art of Structured Thinking and Analysis
- Tools for Improving Structured Thinking
- Must for Data Scientists & Analysts: Brain Training for Analytical Thinking
- 20 Challenging Job Interview Puzzles
- Tips to Train your Mind for Analytical Thinking
6. Deep Learning Requires Computational Power that Only Top Companies Have
Deep learning seems to propagate more myths than any other branch of data science I have come across, including:
- DL is another way of saying machine learning
- You need a research background to start out in deep learning
- Deep learning doesn’t have many practical applications
But the most common myth I’ve heard – you need a considerable amount of hardware to perform deep learning tasks. When I first heard about deep learning, I pictured a room full of IBM supercomputers being operated by dozens of data scientists.
Breaking down the myth
Don’t get me wrong, a deep learning model will always perform more efficiently when it has a powerful hardware setup to run on. But you don’t need a supercomputer to work with deep learning. It just might take longer than expected to train the model on your machine.
What if the data you’re working with is huge? Running it locally might not work. Google, as always, has the answer to that. Google Colab is a free cloud service that doubles as a coding notebook. But here’s the best part – Colab comes with FREE GPU support!
That’s right, you can leverage the power of a GPU for free and run your deep learning model there. Colab runs through your web browser, so there’s no computational cost to your machine. What else could a deep learning enthusiast ask for?
Here are a couple of resources to acquaint you with aspects of deep learning most people won’t teach:
- 12 Frequently Asked Questions on Deep Learning
- Why are GPUs Necessary for Training Deep Learning Models?
7. Once Built, AI Systems will Continue to Evolve and Generalize by Themselves
Hollywood has done its best to showcase AI systems in the form of robots who possess human-level intelligence. Movies like Blade Runner and Terminator became cult classics for this very reason. The consensus they portray is that once an AI system has been built, it will continue to work and evolve by itself.
So once you build a model for say, fraud detection, the model will adapt to any changes thrown at it. If the entire financial landscape changed, or new features were added to the data, it’s expected that the system will continue to function equally well.
Breaking down the myth
This state, that AI systems evolve by themselves, is called Artificial General Intelligence (AGI). Unfortunately, we are not at that state yet. Not even close.
We are very much in the narrow stage right now. The models we build cannot generalize to other tasks, or even to big changes in the data. Let’s take the example of last year’s GDPR policy. Would an existing system adapt to the changes without manual human intervention? Are the systems we build smart enough to incorporate the ethics aspect? Can an AI system, built to recommend products to the customer, integrate a new product without any prior information about it?
That’s why labelling images in an object detection problem is such a crucial task. That level of intelligence isn’t available to machines yet.
Yes, DeepMind and other similar high-level research organizations have definitely made progress in this regard. You might have come across the news about neural networks creating their own neural networks. But these developments are way too few and far between. And we haven’t figured out how to get them into commercial applications.
I strongly recommend watching the below video by Michael B. Jordan regarding the state of AI currently and the challenges we face. It’s a great reference point if you can bring it up in an interview situation. 🙂
Data Science Role-Related Myths
8. Universe of AI Jobs
What comes to your mind when you think about the perfect artificial intelligence team? The most common answer I have heard is to get as many top data scientists as possible. Who wouldn’t like a bunch of top talent working on the same project?
But that just isn’t a practical solution for most of us. One “unicorn” data scientist is difficult to source, let alone tons of them. And will your team of data scientists be aware of the non-AI parts of a project? Do they understand how hardware works?
Simply put – an artificial intelligence project has a universe of jobs attached to it. It isn’t only limited to the role of a data scientist.
Breaking down the myth
Applied artificial intelligence is a complex field. It requires working with different disciplines across the length and breadth of the project. A plethora of interdisciplinary roles exist:
- Data Engineer
- Data Analyst
- AI/ML Engineer
- Data Scientist
- Business Analyst
- Domain expert (for example, a self-driving car project will have mechanical engineers and car hardware experts)
- IoT Specialist
- Data Science Manager/Decision-Maker
- Decision Scientist
- Software Engineer
- UX Designer
- Project Manager
Note that the roles and number of folks staffed will vary depending on the project. The idea I’m trying to get across is that AI isn’t a cut and dry field. It’s not a straightforward path. If someone tries to sell you on a project that is staffed with just data scientists, it might be time to sound the alarm bells.
This is especially relevant for people operating in a senior role (team leaders, managers, CxOs, etc.). It’s VERY important to understand each role in order to create a successful project.
I recommend taking the below course to fully grasp how an AI project works. This includes how to hire the perfect AI team, and other intricate details every AI leader (or even enthusiast) should know:
You can also read about the jobs that might be impacted as AI continues to grow.
9. Data Science is Only About Building Predictive Models
Being able to predict an event is a powerful thing. And that’s what stands out to newcomers in data science. Building models that can predict what a customer will buy next sounds like a must-have skill, right?
In fact, when I describe data science or machine learning to a non-technical person, their first reaction is quite similar. The hype around this field is unprecedented. Apparently, a data scientist is only building predictive models all day at work.
Is that not what DJ Patil meant when he described the role of a data scientist as the “sexiest job of the 21st century”? Well, not quite.
Breaking down the myth
There are multiple layers in a data science project. The model building part is just a speck in the overall data science lifecycle (I’ll cover the different roles in data science in the next section). To give you a general idea, the steps involved in a typical data science lifecycle are:
- Understanding the problem statement
- Hypothesis building
- Data collection
- Verifying the data
- Data cleaning
- Exploratory analysis
- Designing the model
- Testing/Verifying the model
- If an error is found, head back to the verification or cleaning stage
- Putting it into production (deploying the model)
Nothing is as straightforward as they teach you in a classroom or a course. Experience is the best way to learn how a project works. Try talking to someone who has seen the end-to-end process. Even better, get an internship and get a first-hand account of what makes a data science project tick.
Additionally, data science isn’t limited to simply making predictions. I’m sure you’ve come across the market-basket analysis concept. It’s a combination of clustering techniques and association rules. Or how about anomaly detection? The ability to figure out outliers in the data. There is so much to learn!
10. Participating in Data Science Competitions Translates to Real-Life Projects
Data science competitions are an excellent stepping stone in your data science journey. You get to practice your skills on a dataset, showcase it to the world, and even stand a chance to win prizes.
These hackathons and competitions have increased multi-fold in the last 4-5 years as more and more people want a piece of the data science cake. Most aspiring data science professionals include these competitions on their resume.
The problem from an interview standpoint? Recruiters have started paying less and less attention to this aspect of your portfolio.
Breaking down the myth
There are plenty of reasons why recruiters don’t consider your competition experience. I’ll laser that down to one:
Real-world projects are an entirely different beast compared to what you see in competitions.
Data science competitions have clean and almost spotless datasets. If there are missing values, you can impute them using a plethora of techniques. What matters is the accuracy of your model, not the way you got there.
Real-world projects have end-to-end pipelines which involve working with a bunch of people. Most of us will always have to work with messy and untidy data. The old saying about spending 70-80% of your time just collecting and cleaning data is true. Tasks like data cleaning and feature engineering will take up the majority of your time.
This LinkedIn post is an excellent read on the standard methodology one can use for analytical models. You can also refer to the section above where we spoke about the different stages involved in a typical data science project.
Additionally, we can’t just build a stacked complex ensemble model. Clients demand transparency so the simpler model usually wins out. Interpretability is key in a corporate setting. The project is accountable for the model behaving poorly.
As I mentioned in this article, getting a good score on a competition leaderboard is excellent for measuring your learning progress, but interviewers will want to know how you can optimize your algorithm for impact, not for the sake of increasing accuracy. Talk to data science experts, try to understand how these projects work, build your network in the domain of your choice, and try to structure your thoughts to align accordingly.
11. Data Collection is a Breeze, the Focus should be on Building Models
We’ll wrap up this article with another myth around building models. This is a conversation I had with a fresher data scientist recently:
Pranav Dar: What’s your favorite part of a data science role apart from designing models?
Fresher DS: I like the feature engineering part.
PD: Sounds fair. How do you usually collect data for your projects?
Fresher DS: Um, I usually just download it from one of the open-source platforms.
PD: OK..but what if the data is skewed or biased? How do you verify the identity of the data? And what will you do when you’re asked to collect data from multiple sources that require database skills?
Fresher DS: I hadn’t thought about that..
That, unfortunately, is a conversation I have on a far too regular basis. Most experienced data science professionals are well aware of this situation as well. Expect to be tested on this subject thoroughly in an interview.
Breaking down the myth
Data is being generated at an unprecedented pace but collecting and cleaning it isn’t getting any easier. Without building a pipeline to collect the data, your data science project is going nowhere. Typically, this is the role of a data engineer (but data scientists are expected to know this function as well).
I cannot overstate the importance of the data collection step. Collecting honest and accurate data is imperative to your final model working well. As Wikipedia puts it, “The goal for all data collection is to capture quality evidence that allows analysis to lead to the formulation of convincing and credible answers to the questions that have been posed”.
There are too many sources of data available. How do you connect to each? What data format do you receive from each? What’s the cost of data collection from each of these sources? This is a microcosm of the kind of questions you’ll need to ask in a real-world setting.
Roles like database manager, database architect and data engineer have taken on a new level of importance. Maintaining the integrity of the data and the aforementioned pipeline is as important as any other task that succeeds it.
I had initially planned to make this a short piece. But the importance of busting these misconceptions took precedence over everything else. There are a lot more myths following data science like a shadow.
I would love to hear your thoughts on this article and if there are other myths that you have come across (or believed yourself). Let’s empower the new wave of data science professionals with resources to overcome the mistakes we have made!You can also read this article on our Mobile APP