avcontentteam — Updated On October 15th, 2020
Data Science Interview Questions Interviews Jobs Machine Learning


  • Contains a list of widely asked interview questions based on machine learning and data science
  • The primary focus is to learn machine learning topics with the help of these questions
  • Crack data scientist job profiles with these questions



Careful! These questions can make you think THRICE!

Machine learning and data science are being looked as the drivers of the next industrial revolution happening in the world today. This also means that there are numerous exciting startups looking for data scientists. What could be a better start for your aspiring career!

However, still, getting into these roles is not easy. You obviously need to get excited about the idea, team and the vision of the company. You might also find some real difficult techincal questions on your way. The set of questions asked depend on what does the startup do. Do they provide consulting? Do they build ML products ? You should always find this out prior to beginning your interview preparation.

To help you prepare for your next interview, I’ve prepared a list of 40 plausible & tricky questions which are likely to come across your way in interviews. If you can answer and understand these question, rest assured, you will give a tough fight in your job interview.

Note: A key to answer these questions is to have concrete practical understanding on ML and related statistical concepts. You can get that know-how in our course ‘Introduction to Data Science‘! 

Or how about learning how to crack data science interviews from someone who has conducted hundreds of them? Check out the ‘Ace Data Science Interviews‘ course taught by Kunal Jain and Pranav Dar.

40 interview questions, machine learning, data science

40 Interview Questions asked at Startups in Machine Learning / Data Science


Interview Questions on Machine Learning

Q1. You are given a train data set having 1000 columns and 1 million rows. The data set is based on a classification problem. Your manager has asked you to reduce the dimension of this data so that model computation time can be reduced. Your machine has memory constraints. What would you do? (You are free to make practical assumptions.)


Q2. Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the components?


Q3. You are given a data set. The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why?


Q4. You are given a data set on cancer detection. You’ve build a classification model and achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it?


Q5. Why is naive Bayes so ‘naive’ ?


Q6. Explain prior probability, likelihood and marginal likelihood in context of naiveBayes algorithm?


Q7. You are working on a time series data set. You manager has asked you to build a high accuracy model. You start with the decision tree algorithm, since you know it works fairly well on all kinds of data. Later, you tried a time series regression model and got higher accuracy than decision tree model. Can this happen? Why?


Q8. You are assigned a new project which involves helping a food delivery company save more money. The problem is, company’s delivery team aren’t able to deliver food on time. As a result, their customers get unhappy. And, to keep them happy, they end up delivering food for free. Which machine learning algorithm can save them?


Q9. You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?


Q10. You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?


Q11. After spending several hours, you are now anxious to build a high accuracy model. As a result, you build 5 GBM models, thinking a boosting algorithm would do the magic. Unfortunately, neither of models could perform better than benchmark score. Finally, you decided to combine those models. Though, ensembled models are known to return high accuracy, but you are unfortunate. Where did you miss?

Q12. How is kNN different from kmeans clustering?


Q13. How is True Positive Rate and Recall related? Write the equation.


Q14. You have built a multiple regression model. Your model R² isn’t as good as you wanted. For improvement, your remove the intercept term, your model R² becomes 0.8 from 0.3. Is it possible? How?


Q15. After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he’s true? Without losing any information, can you still build a better model?


Q16. When is Ridge regression favorable over Lasso regression?


Q17. Rise in global average temperature led to decrease in number of pirates around the world. Does that mean that decrease in number of pirates caused the climate change?


Q18. While working on a data set, how do you select important variables? Explain your methods.


Q19. What is the difference between covariance and correlation?


Q20. Is it possible capture the correlation between continuous and categorical variable? If yes, how?


Q21. Both being tree based algorithm, how is random forest different from Gradient boosting algorithm (GBM)?


Q22. Running a binary classification tree algorithm is the easy part. Do you know how does a tree splitting takes place i.e. how does the tree decide which variable to split at the root node and succeeding nodes?


Q23. You’ve built a random forest model with 10000 trees. You got delighted after getting training error as 0.00. But, the validation error is 34.23. What is going on? Haven’t you trained your model perfectly?


Q24. You’ve got a data set to work having p (no. of variable) > n (no. of observation). Why is OLS as bad option to work with? Which techniques would be best to use? Why?


11222Q25. What is convex hull ? (Hint: Think SVM)


Q26. We know that one hot encoding increasing the dimensionality of a data set. But, label encoding doesn’t. How ?


Q27. What cross validation technique would you use on time series data set? Is it k-fold or LOOCV?


Q28. You are given a data set consisting of variables having more than 30% missing values? Let’s say, out of 50 variables, 8 variables have missing values higher than 30%. How will you deal with them?


29. ‘People who bought this, also bought…’ recommendations seen on amazon is a result of which algorithm?


Q30. What do you understand by Type I vs Type II error ?


Q31. You are working on a classification problem. For validation purposes, you’ve randomly sampled the training data set into train and validation. You are confident that your model will work incredibly well on unseen data since your validation accuracy is high. However, you get shocked after getting poor test accuracy. What went wrong?


Q32. You have been asked to evaluate a regression model based on R², adjusted R² and tolerance. What will be your criteria?


Q33. In k-means or kNN, we use euclidean distance to calculate the distance between nearest neighbors. Why not manhattan distance ?


Q34. Explain machine learning to me like a 5 year old.


Q35. I know that a linear regression model is generally evaluated using Adjusted R² or F value. How would you evaluate a logistic regression model?


Q36. Considering the long list of machine learning algorithm, given a data set, how do you decide which one to use?


Q37. Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?


Q38. When does regularization becomes necessary in Machine Learning?


Q39. What do you understand by Bias Variance trade off?


Q40. OLS is to linear regression. Maximum likelihood is to logistic regression. Explain the statement.


End Notes

You might have been able to answer all the questions, but the real value is in understanding them and generalizing your knowledge on similar questions. If you have struggled at these questions, no worries, now is the time to learn and not perform. You should right now focus on learning these topics scrupulously.

These questions are meant to give you a wide exposure on the types of questions asked at startups in machine learning. I’m sure these questions would leave you curious enough to do deeper topic research at your end. If you are planning for it, that’s a good sign.

Did you like reading this article? Have you appeared in any startup interview recently for data scientist profile? Do share your experience in comments below. I’d love to know your experience.

Looking for a job in analytics? Check out currently hiring jobs in machine learning and data science.

33 thoughts on "40 Interview Questions asked at Startups in Machine Learning / Data Science"

kavitha says: September 16, 2016 at 7:09 am
thank you so much manish Reply
Gianni says: September 16, 2016 at 7:12 am
Thank you Manish, very helpfull to face on the true reality that a long long journey wait me :-) Reply
Prof Ravi Vadlamani
Prof Ravi Vadlamani says: September 16, 2016 at 7:46 am
Good collection compiled by you Mr Manish ! Kudos ! I am sure it will be very useful to the budding data scientists whether they face start-ups or established firms. Reply
Srinivas says: September 16, 2016 at 9:05 am
Thank you Manish.Helpful for Beginners like me. Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: September 16, 2016 at 9:14 am
Welcome :) Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: September 16, 2016 at 9:15 am
Hi Gianni Good to know, you found them helpful! All the best. Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: September 16, 2016 at 9:17 am
Hi Gianni, I am happy to know that these question would help you in your journey. All the best. Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: September 16, 2016 at 9:17 am
Hi Kavitha, I hope these questions help you to prepare for forthcoming interview rounds. All the best. Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: September 16, 2016 at 9:20 am
Hi Prof Ravi, You are right. These questions can be asked anywhere. But, with growth in machine learning startups, facing off ML algorithm related question have higher chances, though I have laid emphasis on statistical modeling as well. Reply
chibole says: September 16, 2016 at 9:35 am
It seems Stastics is at the centre of Machine Learning. Reply
chibole says: September 16, 2016 at 9:36 am
* stastics = Statistics Reply
chibole says: September 16, 2016 at 9:38 am
* stastics = Statistics Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: September 16, 2016 at 9:39 am
Hi Chibole, True, statistics in an inevitable part of machine learning. One needs to understand statistical concepts in order to master machine learning. Reply
chibole says: September 16, 2016 at 9:55 am
I was wondering, do you recommend for somebody to special in a specific field of ML? I mean, it is recommended to choose between supervised learning and unsupervised learning algorithms, and simply say my specialty is this during an interview. Shouldn't organizations recruiting specify their specialty requirements too? ....and thank you for the post. Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: September 16, 2016 at 10:40 am
Hi Chibole, It's always a good thing to establish yourself as an expert in a specific field. This helps the recruiter to understand that you are a detailed oriented person. In machine learning, thinking of building your expertise in supervised learning would be good, but companies want more than that. Considering, the variety of data these days, they want someone who can deal with unlabeled data also. In short, they look for someone who isn't just an expert in operating Sniper Gun, but can use other weapons also if needed. Reply
Karthikeyan Sankaran
Karthikeyan Sankaran says: September 16, 2016 at 1:10 pm
Hi Manish - Interesting & Informative set of questions & answers. Thanks for compiling the same. Reply
Nicola says: September 16, 2016 at 7:22 pm
Hi, really an interesting collection of answers. From a merely statistical point of view there are some imprecisions (e.g. Q40), but it is surely useful for job interviews in startups and bigger firms. Reply
Raju says: September 17, 2016 at 1:45 am
I think you got Q3 wrong. It was to calculate from median and not mean. how can assume mean and median to be same Reply
Raju says: September 17, 2016 at 1:53 am
Don't bother.....Noted .....you assumed normal distribution.... Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: September 17, 2016 at 4:11 am
Hi Nicola, Thanks for sharing your thoughts. Tell me more about Q40. What's about it? Reply
Amit Srivastava
Amit Srivastava says: September 17, 2016 at 7:07 am
Great article. It will help in understanding which topics to focus on for interview purposes. Reply
chinmaya Mishra
chinmaya Mishra says: September 17, 2016 at 1:13 pm
Dear Kunal, Few queries i have regarding AIC 1)why we multiply -2 to the AIC equation 2)where this equation has been built. Rgds Reply
Sampath says: September 18, 2016 at 5:33 pm
Hi Manish, Great Job! It is a very good collection of interview questions on machine learning. It will be a great help if you can also publish a similar article on statistics. Thanks in advance Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: September 19, 2016 at 12:34 pm
Hi Sampath, Thanks for your suggestion. I'll surely consider it in my forthcoming articles. Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: September 19, 2016 at 12:35 pm
Hi Amit, Thanks for your encouraging words! The purpose of this article is to help beginners understand the tricky side of ML interviews. Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: September 19, 2016 at 12:36 pm
Most Welcome ! :) Reply
KARTHI V says: September 20, 2016 at 3:32 pm
Hi Manish, Kudos to you!!! Good Collection for beginners I Have small suggestion on Dimensionality Reduction,We can also use the below mentioned techniques to reduce the dimension of the data. 1.Missing Values Ratio Data columns with too many missing values are unlikely to carry much useful information. Thus data columns with number of missing values greater than a given threshold can be removed. The higher the threshold, the more aggressive the reduction. 2.Low Variance Filter Similarly to the previous technique, data columns with little changes in the data carry little information. Thus all data columns with variance lower than a given threshold are removed. A word of caution: variance is range dependent; therefore normalization is required before applying this technique. 3.High Correlation Filter. Data columns with very similar trends are also likely to carry very similar information. In this case, only one of them will suffice to feed the machine learning model. Here we calculate the correlation coefficient between numerical columns and between nominal columns as the Pearson’s Product Moment Coefficient and the Pearson’s chi square value respectively. Pairs of columns with correlation coefficient higher than a threshold are reduced to only one. A word of caution: correlation is scale sensitive; therefore column normalization is required for a meaningful correlation comparison. 4.Random Forests / Ensemble Trees Decision Tree Ensembles, also referred to as random forests, are useful for feature selection in addition to being effective classifiers. One approach to dimensionality reduction is to generate a large and carefully constructed set of trees against a target attribute and then use each attribute’s usage statistics to find the most informative subset of features. Specifically, we can generate a large set (2000) of very shallow trees (2 levels), with each tree being trained on a small fraction (3) of the total number of attributes. If an attribute is often selected as best split, it is most likely an informative feature to retain. A score calculated on the attribute usage statistics in the random forest tells us ‒ relative to the other attributes ‒ which are the most predictive attributes. 5.Backward Feature Elimination In this technique, at a given iteration, the selected classification algorithm is trained on n input features. Then we remove one input feature at a time and train the same model on n-1 input features n times. The input feature whose removal has produced the smallest increase in the error rate is removed, leaving us with n-1 input features. The classification is then repeated using n-2 features, and so on. Each iteration k produces a model trained on n-k features and an error rate e(k). Selecting the maximum tolerable error rate, we define the smallest number of features necessary to reach that classification performance with the selected machine learning algorithm. 6.Forward Feature Construction. This is the inverse process to the Backward Feature Elimination. We start with 1 feature only, progressively adding 1 feature at a time, i.e. the feature that produces the highest increase in performance. Both algorithms, Backward Feature Elimination and Forward Feature Construction, are quite time and computationally expensive. They are practically only applicable to a data set with an already relatively low number of input columns. Reply
Ramit Pandey
Ramit Pandey says: September 22, 2016 at 7:28 am
Hi Manish , After going through these question I feel I am at 10% of knowledge required to pursue career in Data Science . Excellent Article to read. Can you Please suggest me any book or training online which gives this much deep information . Waiting for your reply in anticipation . Thanks a million Reply
Rahul Jadhav
Rahul Jadhav says: September 22, 2016 at 5:36 pm
Amazing Collection Manish! Thanks a lot. Reply
NikhilS says: September 25, 2016 at 3:44 pm
An awesome article for reference. Thanks a ton Manish sir for the share. Please share the pdf format of this blog post if possible. Have also taken note of Karthi's input! Reply
vinaya says: October 05, 2016 at 9:19 am
ty manish...its an awsm reference...plz upload pdf format also...thanks again Reply
Prasanna says: October 07, 2016 at 5:32 am
Great set of questions Manish. BTW.. I believe the expressions for bias and variance in question 39 is incorrect. I believe the brackets are messed. Following gives the correct expressions. https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff Reply
Sidd says: November 07, 2016 at 6:42 am
Really awesome article thanks. Given the influence young, budding students of machine learning will likely have in the future, your article is of great value. Reply

Leave a Reply Your email address will not be published. Required fields are marked *

Top Resources