Starting your First Data Science Project? Here are 10 Things You Must Absolutely Know
Can you imagine navigating through a city without Google Maps? It feels like an alien concept! We have no sense of direction and all paths seem to lead away from where we want to go.
That’s often what the first data science project feels like. I can personally attest to this and I know most data science enthusiasts are caught like a deer in the headlights when they’ve based their learning entirely on just online courses.
Building a machine learning model in Python is great – but doing that in the industry is an entirely different kettle of fish altogether. If you feel that learning Python and the basics of machine learning are going to land you your first data science project or make you a data science rockstar, you’ll be in for a shock.
For me, this reality hit home when I joined an organization as a data scientist. Building a machine learning model was not enough anymore (not even close). There were other tons of things, such as data collection, cleaning, exploration, and a lot more tough work which I had earlier ignored.
A few things I realized quickly – problem-solving skills, creativity, a structured thinking approach, and good storytelling skills will be more helpful than just applying a novel algorithm. Trust me, don’t take this lightly!
In this article, I will be sharing 10 key points that I wish I knew when I started my Data Science career. I hope this will help you out in your own data science journey.
There is a lot of difference in the data science we learn in courses and self-practice and the one we work in the industry. I’d recommend you to go through these crystal clear free courses to understand everything about analytics, machine learning, and artificial intelligence:
- Introduction to AI/ML Free Course | Mobile app
- Introduction to AI/ML for Business Leaders Mobile app
- Introduction to Business Analytics Free Course | Mobile app
1. Hypothesis Generation is More Important Than you Think
Oh boy – if I could shout this from the rooftops, I would scream at the top of my lungs. Hypothesis generation is such a crucial step in a data science project. And yet almost all data science newcomers are ill-prepared for it.
The almighty question at the beginning of any data science project should be – what is the hypothesis behind your analysis?
Simply put, a hypothesis is a possible view or assertion of an analyst about the problem he or she is working upon. It may be true or may not be true.
Let’s say if you go with a non-hypothesis-driven approach, you’ll be bound to look at hundreds or even thousands of variables to analyze without any prior knowledge. This is an extremely hard task for an analyst, right?
A hypothesis-driven approach is much more productive. You’ll first form a hypothesis or an assumption and then accordingly note down the potential variables you’ll need for the analysis. These variables may or may not be available. After this activity, you’ll finally go through the data and select the required variables. If the variable is not available, then you can opt for feature engineering or finding new ways to collect the data.
This hypothesis is the base of your whole project so don’t hesitate to put in the time, effort, and ask for help from your team members. In the industry, you’ll be working with several teams to come up with these hypotheses.
For example, let’s say you are part of the data science team that is working on a fraud detection model at an insurance organization. Here, you’ll be working with the operations team, the leadership team, your supervisors, and perhaps even the sales agents. Your team will have to work with ALL these departments to come up with the hypotheses and figure out which variables you have (or can collect) to validate all these hypotheses.
I found this great discussion on hypothesis generation – you can read more about it here.
2. Knowledge of Data Science Tools is Good; the Ability to Break Down Business Problems is Priceless
Data Science tools will come and go but basics will stick forever.
There is an endless number of tools out there to build your data science project. Tools like SPSS and SAS had their golden time and now R and Python have taken over the limelight. Now Julia is said to take over both of them. The competition never ends.
Learning the tool takes the least time but learning about the domain and business problems can take years of experience. The knowledge of the domain will help you in hypothesis generation, data analysis, feature engineering and finally conveying the results as a great story to the stakeholders.
Let’s say you joined an e-commerce company as a data scientist. You are part of the team tasked with building a recommendation engine for their retail products. If you have no idea how the business works, what are the different variables at play, etc., how in the world will you proceed?
You need to work on understanding the business, what the different aspects of the business are, what exactly the problem is, and then break that down into a DATA PROBLEM. Your structured thinking skills will help you out massively here.
3. Be Prepared To Do a LOT of Data Cleaning
Data Cleaning is the task that can “make or break” your whole analysis.
“Data” is the crux of the whole problem solving and analysis. If you feed dirty data into your model then it’s pretty obvious that it will spit out useless results. Therefore, you should not shy away from spending time making your data-rich in value.
While starting out, we usually practice on simple datasets that are publically available but this is as far away from real-world data as you can imagine. The industry isn’t a hackathon setting where you’ll get mostly clean data with well-defined outcomes. You would need to do all of this as a team (or yourself) – including spending A LOT of time on data cleaning.
The most common data cleaning activities include missing value imputation, outlier treatment, encoding categorical features, etc. These may sound rudimentary to you but these can literally make or break your data science project.
The real-world data may contain errors that are unique to the dataset which you may have to fish out using manual rules. An efficient data scientist never misses out on data eyeballing. 🙂
4. Fail to Explore; Prepare to Fail
Data Exploration is the most underrated step in data science.
The most crucial step that beginners miss out is simply data exploration. It is fundamental to the process of data analysis and it can help you gain crucial insights at the beginning of your data science project.
Data Exploration is usually the first step in any kind of data analysis. This activity helps to understand the dataset at a broader level. It helps in unfolding some patterns, and characteristics usually hidden in plain sight.
A good data exploration exercise will bring out information about the variables as well as their relationships and their effect on our results. I personally find this step to be very enjoyable as you get to be the detective here and it includes a lot of visualization too!
5. Model Deployment is the Key – Learn Software Engineering
If you don’t like coding, I have some bad news for you. And yes, there is no getting away from learning programming if you want to be successful in data science.
You have made a data model successfully. Now what?
Let’s take a moment to ponder the above question. After tons of hard work, you have finally created a model with high accuracy in your Jupyter notebook. What’s the next step? Will you just send the Jupyter notebook to your clients? What are the additional things you need to take care of?
This is a crucial roadblock that every data scientist hits in his or her new project because as a beginner no one has the need to deploy their model. So what to do?
It is important that you learn some basic software engineering and computer science skills. Learn everything you can about version control, how to write neat and tidy code, how to use GitHub, etc. All of this ties into your data science skillset.
Learning Flask and Django can be a great starting point. Here’s a great project to get started.
6. A Data Scientist isn’t a Magic Bullet – Learn About Other Data-Based Fields
Data Science was termed as the sexiest job of the 21st century and since then we have been trying to chase it. But here’s the caveat – becoming a data scientist isn’t the be-all and end-all of your data science journey. It is essential that we uncover other data-based roles.
A data science project covers a whole host of data-related roles, such as a data engineer, machine learning engineer, deep learning engineer, business analyst, data analyst, etc. The list goes on. A Data Scientist doesn’t build architecture for a big data system – a data engineer does. A data scientist doesn’t typically answer business-related questions – a business analyst does.
Note that these roles interchange and intertwine a lot depending on your project and your organization.
So before starting a data-based project, you can choose what you want to become. If you want to know more about the differences between different roles, you should definitely check out this article.
7. Believe me, you Need a Benchmark Model
During my first regression project as a data scientist, I built a data model using all the knowledge that I learned. But I felt that the error was coming out to be high and R-squared to be very low. After getting frustrated I took this problem to my manager. He said – “How do you know the error is high? What is your benchmark score?”
A benchmark model is your basic run of the mill machine learning model that gives you a decent score. You don’t even require to know machine learning for building a benchmark model. A benchmark model for regression can be made by taking the simple mean, and a classification model can be simply made by using the mode (though I encourage you to not do that in the industry!).
Let me give you an example from my previous data science project. We were working on a marketing analytics problem and while the data science team was busy trying to decipher which model to try out, my project manager fired up KNIME, built a simple regression model, and came up with a benchmark score. It took him 45 minutes to do this.
It can really be that simple but it’s such an effective way to put a benchmark in place and work from there.
8. Always stay in touch with roots (Linear regression may help you better than advanced neural networks)
Have you seen anyone using an ax for slicing butter? Metaphorically, that’s what a lot of beginners do when starting out their machine learning journey. You may be surprised but a simple linear regression problem can help you arrive at a model that is more accurate and requires less computational power.
It is always important that you understand the problem statement, the type of data you are dealing with, and ask yourself – What do I want to accomplish with the project? Do you want your model to deliver higher accuracy or you want a simple model that will help you in variable attribution?
Remember, most organizations that have a data science division likely won’t have the computational power to support complex models. The likes of Google and Facebook have skewed our perception of data science by pumping in money to build complex multi-layered deep neural networks – don’t fall for that trap.
9. No Data Science project can succeed without the Proper Infrastructure in Place
Like most industry projects, a data science project depends on a lot of external factors. In an organization, you must make sure that these factors support your needs for a successful project.
For example, a traditional logistics company plans to build a route optimization application for the transporters but they don’t even have any architecture for tracking their fleet. This is one of the primary reasons why ~85% of data science projects end up failing. That’s a HUGE number and it’s because decision-makers don’t really understand how important the core infrastructure is before the splurge the money on building a team.
Before starting out, executives and leaders can save a lot of time and effort by making sure that everything is in place when the team requires it.
10. Get buy-in from stakeholders before you launch a new Data Science project
A project must have a clearly defined problem statement. It should have listed expected results and it should be the same for all the stakeholders. Due to lack of proper communication, the stakeholders and the data science team may get different expectations which may make your project haywire.
Let me hearken back to my previous project as an example. Our data science team was told to “use data science to increase revenue by 25% without increasing costs more than 10%”. That is an incredibly vague problem statement! We had to sit with the project manager and the leadership team to understand the scope of the project, what we could use, and what we couldn’t, etc.
If we had blindly gone in and started working on the problem, we would inevitably run into a blind alley.
It is always better to keep the stakeholders updated with proper communication in place. Otherwise, the project may take a different direction and ultimately lead to starting over again.
To conclude, in this article, I have listed 10 things or challenges that I faced when I started out as a data scientist. This is not an exhaustive list and I am sure there must be some challenges that you must have faced personally. Let me know in the comments so that it can help the community members who have just started out.
Also, I am listing down a few courses that are specially crafted to introduce beginners to the world of Artificial Intelligence, Machine Learning, and analytics:
I hope this article was fruitful to you. Please feel free to comment below if you think we missed something.