Hiring the Right Data Scientist – The Needle in a Haystack Problem

guest_blog 10 May, 2019 • 6 min read

Introduction

One of the most common questions asked these days is what makes a good data scientist. The simple answer – it depends. The long answer – someone who can lead all the phases of a data science project. For an even longer answer, read on.

A Data Science project is not just a hackathon competition where a ready-made dataset is provided and the success metric or the error to optimize is clearly laid out.

Source: IBM

So what’s different? Well, there are various phases in a data science project – Getting the context of the problem, understanding the data, deep diving into it, understanding implementations and coding shortcomings, figuring out the right set of algorithms to use, coding those algos, performance of those algorithms from an engineering and a data science perspective and optimization.

As you can imagine, a data science skillset is a mixture of what was traditionally called computer science, and business analytics. Sometimes, given the breadth and depth of the work, you might be unlikely to find a person who knows all these aspects (let alone being good at them). Instead, its better to build a team that has a mix of people who specialize in different areas required for the data science project.

In this article, we will look at what types of data scientists are there, how to find them, what the current process is and what can be further improved.

 

Table of Contents

  1. Types of Scientists
  2. Strategy of the talent pool
  3. How to find these super scientists
  4. The Current Process
  5. What can be improved?

 

Types of scientists

Given this prelude, I am going to help us understand and categorize the existing talent pool in the market into different categories of skill sets and knowledge based on three dimensions – Context, Coding and Concept.

Context:

The context in which the problem is set.

  1. Where does the problem fit in the overall scheme of things?
  2. Who are the right stakeholders?
  3. What is success and is the project worth all the effort?
  4. Given the knowledge, what would be the right algorithm to use?
  5. Understanding data and shaping the technical problem (For example: which is the dependant variable and are there any post processing steps after the initial modelling?)

Coding:

Simply put, R/Python or any other open source data science tool with which the person can analyze, create features and build a model. Work with the implementation team to get the codes to a production environment.

Concept:

The depth of understanding the technical solution. Ability to understand the algorithm in detail. Some knowledge of literature in this area. Ability to do a lit survey and differentiate or adopt the solution to the given problem.

The size of the bubble in the above chart purely measures knowledge and depth of understanding the algorithms.

 

Strategy of the talent pool

Given that we now understand how data science talents are, how do you, as a start up or a mid-level company or an enterprise, match the right pool to the available job? Whom do you choose and what weights do you give to each of the 3Cs at what stage of your company? Let’s examine this in a bit more detail.

Start-ups

If you are starting up or building a data science pool in your organization, chances are that the problem is not well defined and is still very blurry. The need of the hour could be breadth rather than depth. Maybe the balance could be more of geeky business analysts, data scientists and data engineers than the algo Specialists. Depending on the nature of the problem it could be a mix 30 – 40% Geeky Business Analysts and the rest divided between data scientists and engineers

Enterprises

Here I would assume that the problem is well defined. There may be existing data science solutions based either on machine learning or some other technique. The need of the hour may be to upgrade the solutions and get more of the solutions into deployment mode. I would recommend this – 40% of data scientists, 20% of data engineers, 20% Algorithm Specialists and another 20% of Geeky Business analysts.

For R&D

For organizations that wants to have a research division, the mix could shift towards algorithm specialists. They can afford to have fewer Data Scientists and Business Analysts. The idea here is that the organization aims to contribute more to research journals and wants to mark its space in certain areas or specializations.

But sometimes during this search for talent, we also come across what I like to call “Super Scientists”. Finding a super data scientist is 10 times tougher than a full stack developer. This is why there is no industry tag to them. There is also a fundamental mistake of evaluating data scientists only in terms of knowledge of ML or Python (or any other tool). This yardstick only effectively measures the efficiency of modeling process to model delivery and leaves the other parts to mere chance. Salary is also not a yardstick while finding these super scientists as very few companies realize their potential and hence would have given a premium to them.

 

How to find these super scientists

Before we see how to find them, let’s take a look at what a super scientist is capable of doing.

  1. Independently understand the business problem – for example, they don’t think they are not working on recommended system but rather working to solve a consumer engagement problem
  2. Go beyond the problem specified to them to where the issue is – sometimes you are building a model, because the business or a stakeholder or your manager thought so as opposed to whether the problem really needs it
  3. Understand Data – the unfortunate mundane task of writing multiple select * statements and just living with the data for a couple of weeks
  4. Identify the algorithm or come up with one. With time in hand little bit of literature review to know how the world is handling a similar problem 
  5. Start the core modeling work of working with data and algorithms. Trying to be open about what fits the as opposed to what fits your resume
  6. Final execution code that can be delivered to deployment team
  7. Reporting and making sure things are running right if not the drive to make the model or the algorithm better
  8. Presenting the results to whomsoever the point needs to be proven
  9. Final rollout and monitoring
  10. Optimization of the results and ensuring progress

As you can see, all 10 steps are important.

 

The Current Process

Currently, most hiring organizations evaluate data scientists only on point 4-5 in the form of an interview discussion. There too the focus ends up being too much on the knowledge and too little on the application itself. How do get your code into production? Can you streamline your pipeline to work with the existing hardware (and even software) that exists in the organization? These are critical questions I feel are not asked enough in interviews.

More or less the rest of the 8 steps are left to chance. Its important we start innovating on how we test the usefulness of a person to a job than how much the person knows

 

What can be improved?

Case Studies can be a key instrument in testing all 9 points. Case studies can be presented as real data science problems that would show up the job. For example, instead of interviewing on collaborative filtering, one can give a statement that we want to show or send right items to the right set of users. Then we can evaluate how the candidate arrives at a solution and how does the person think of success metrics, KPIs, etc. Create a scenario where the interviewer plays the role of a business or problem owner and see how the candidate reacts to constraints  – be it data or implementations. Then deep dive into programming and algorithms.

 

End Notes

This is my humble attempt on building a data science team and how to recruit evasive super scientists . Now, time to find the needle in the haystack!

If you have ever been in a hiring role, what has your experience been like? On the other hand, folks looking for a data scientist role – what are some of the challenges you have faced in your journey? Use the comments section below to let me know!

 

About the Author

Mathangi is currently building a Data Science team at PhonePe. She has 13+ years of proven track record in building world-class data sciences solutions and products. She has extensively worked on building chatbots and productizing text mining insights. She has 6 Patent grants and 20+ patents pending in the area of inuitive customer service,indoor positioning and user profiles. She is adept across machine learning , text mining NLP technologies & tools.

guest_blog 10 May 2019

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Nakul Gupta
Nakul Gupta 21 Jun, 2018

Could not agree more, brilliant article for hiring the right people.

Dharika reddy
Dharika reddy 21 Jun, 2018

man could u please suggest me where to join data science Msc in India where placements would be there?

Sean Fynn
Sean Fynn 21 Jun, 2018

Great article. This mirrors my experience and your points/questions are bang on.

Harsha
Harsha 22 Jun, 2018

That's an interesting segregation of roles. Few pointers to add , a successful team is a composition of business analysts (accumen), set the solution conceptual (critical thinking), handle data (data engineer), model (algo specialists), inferences (businesses intelligence / consulting) and operationalize while coordinating clients (Project Management) roles. It also needs selling ability understand true needs which solve client purposes than building fancy techniques.

Ajay Thakur
Ajay Thakur 01 Oct, 2018

Very informative article! India is currently facing a shortage of data scientists and it is difficult to hire a good data scientist.

vicky vally
vicky vally 15 Oct, 2018

Thanks for this great information..very useful article..

  • [tta_listen_btn class="listen"]