40 Interview Questions asked at Startups in Machine Learning / Data Science

avcontentteam 15 Oct, 2020 • 18 min read

Overview

  • Contains a list of widely asked interview questions based on machine learning and data science
  • The primary focus is to learn machine learning topics with the help of these questions
  • Crack data scientist job profiles with these questions

 

Introduction

Careful! These questions can make you think THRICE!

Machine learning and data science are being looked as the drivers of the next industrial revolution happening in the world today. This also means that there are numerous exciting startups looking for data scientists. What could be a better start for your aspiring career!

However, still, getting into these roles is not easy. You obviously need to get excited about the idea, team and the vision of the company. You might also find some real difficult techincal questions on your way. The set of questions asked depend on what does the startup do. Do they provide consulting? Do they build ML products ? You should always find this out prior to beginning your interview preparation.

To help you prepare for your next interview, I’ve prepared a list of 40 plausible & tricky questions which are likely to come across your way in interviews. If you can answer and understand these question, rest assured, you will give a tough fight in your job interview.

Note: A key to answer these questions is to have concrete practical understanding on ML and related statistical concepts. You can get that know-how in our course ‘Introduction to Data Science‘! 

Or how about learning how to crack data science interviews from someone who has conducted hundreds of them? Check out the ‘Ace Data Science Interviews‘ course taught by Kunal Jain and Pranav Dar.

40 interview questions, machine learning, data science

40 Interview Questions asked at Startups in Machine Learning / Data Science

 

Interview Questions on Machine Learning

Q1. You are given a train data set having 1000 columns and 1 million rows. The data set is based on a classification problem. Your manager has asked you to reduce the dimension of this data so that model computation time can be reduced. Your machine has memory constraints. What would you do? (You are free to make practical assumptions.)

 

Q2. Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the components?

 

Q3. You are given a data set. The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why?

 

Q4. You are given a data set on cancer detection. You’ve build a classification model and achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it?

 

Q5. Why is naive Bayes so ‘naive’ ?

 

Q6. Explain prior probability, likelihood and marginal likelihood in context of naiveBayes algorithm?

 

Q7. You are working on a time series data set. You manager has asked you to build a high accuracy model. You start with the decision tree algorithm, since you know it works fairly well on all kinds of data. Later, you tried a time series regression model and got higher accuracy than decision tree model. Can this happen? Why?

 

Q8. You are assigned a new project which involves helping a food delivery company save more money. The problem is, company’s delivery team aren’t able to deliver food on time. As a result, their customers get unhappy. And, to keep them happy, they end up delivering food for free. Which machine learning algorithm can save them?

 

Q9. You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?

 

Q10. You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?

 

Q11. After spending several hours, you are now anxious to build a high accuracy model. As a result, you build 5 GBM models, thinking a boosting algorithm would do the magic. Unfortunately, neither of models could perform better than benchmark score. Finally, you decided to combine those models. Though, ensembled models are known to return high accuracy, but you are unfortunate. Where did you miss?

 
Q12. How is kNN different from kmeans clustering?

 

Q13. How is True Positive Rate and Recall related? Write the equation.

 

Q14. You have built a multiple regression model. Your model R² isn’t as good as you wanted. For improvement, your remove the intercept term, your model R² becomes 0.8 from 0.3. Is it possible? How?

 

Q15. After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he’s true? Without losing any information, can you still build a better model?

 

Q16. When is Ridge regression favorable over Lasso regression?

 

Q17. Rise in global average temperature led to decrease in number of pirates around the world. Does that mean that decrease in number of pirates caused the climate change?

 

Q18. While working on a data set, how do you select important variables? Explain your methods.

 

Q19. What is the difference between covariance and correlation?

 

Q20. Is it possible capture the correlation between continuous and categorical variable? If yes, how?

 

Q21. Both being tree based algorithm, how is random forest different from Gradient boosting algorithm (GBM)?

 

Q22. Running a binary classification tree algorithm is the easy part. Do you know how does a tree splitting takes place i.e. how does the tree decide which variable to split at the root node and succeeding nodes?

 

Q23. You’ve built a random forest model with 10000 trees. You got delighted after getting training error as 0.00. But, the validation error is 34.23. What is going on? Haven’t you trained your model perfectly?

 

Q24. You’ve got a data set to work having p (no. of variable) > n (no. of observation). Why is OLS as bad option to work with? Which techniques would be best to use? Why?

 

11222Q25. What is convex hull ? (Hint: Think SVM)

 

Q26. We know that one hot encoding increasing the dimensionality of a data set. But, label encoding doesn’t. How ?

 

Q27. What cross validation technique would you use on time series data set? Is it k-fold or LOOCV?

 

Q28. You are given a data set consisting of variables having more than 30% missing values? Let’s say, out of 50 variables, 8 variables have missing values higher than 30%. How will you deal with them?

 

29. ‘People who bought this, also bought…’ recommendations seen on amazon is a result of which algorithm?

 

Q30. What do you understand by Type I vs Type II error ?

 

Q31. You are working on a classification problem. For validation purposes, you’ve randomly sampled the training data set into train and validation. You are confident that your model will work incredibly well on unseen data since your validation accuracy is high. However, you get shocked after getting poor test accuracy. What went wrong?

 

Q32. You have been asked to evaluate a regression model based on R², adjusted R² and tolerance. What will be your criteria?

 

Q33. In k-means or kNN, we use euclidean distance to calculate the distance between nearest neighbors. Why not manhattan distance ?

 

Q34. Explain machine learning to me like a 5 year old.

 

Q35. I know that a linear regression model is generally evaluated using Adjusted R² or F value. How would you evaluate a logistic regression model?

 

Q36. Considering the long list of machine learning algorithm, given a data set, how do you decide which one to use?

 

Q37. Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?

 

Q38. When does regularization becomes necessary in Machine Learning?

 

Q39. What do you understand by Bias Variance trade off?

 

Q40. OLS is to linear regression. Maximum likelihood is to logistic regression. Explain the statement.

 

End Notes

You might have been able to answer all the questions, but the real value is in understanding them and generalizing your knowledge on similar questions. If you have struggled at these questions, no worries, now is the time to learn and not perform. You should right now focus on learning these topics scrupulously.

These questions are meant to give you a wide exposure on the types of questions asked at startups in machine learning. I’m sure these questions would leave you curious enough to do deeper topic research at your end. If you are planning for it, that’s a good sign.

Did you like reading this article? Have you appeared in any startup interview recently for data scientist profile? Do share your experience in comments below. I’d love to know your experience.

Looking for a job in analytics? Check out currently hiring jobs in machine learning and data science.

avcontentteam 15 Oct 2020

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

kavitha
kavitha 16 Sep, 2016

thank you so much manish

Gianni
Gianni 16 Sep, 2016

Thank you Manish, very helpfull to face on the true reality that a long long journey wait me :-)

Prof Ravi Vadlamani
Prof Ravi Vadlamani 16 Sep, 2016

Good collection compiled by you Mr Manish ! Kudos ! I am sure it will be very useful to the budding data scientists whether they face start-ups or established firms.

Srinivas
Srinivas 16 Sep, 2016

Thank you Manish.Helpful for Beginners like me.

chibole
chibole 16 Sep, 2016

It seems Stastics is at the centre of Machine Learning.

chibole
chibole 16 Sep, 2016

* stastics = Statistics

Karthikeyan Sankaran
Karthikeyan Sankaran 16 Sep, 2016

Hi Manish - Interesting & Informative set of questions & answers. Thanks for compiling the same.

Nicola
Nicola 16 Sep, 2016

Hi, really an interesting collection of answers. From a merely statistical point of view there are some imprecisions (e.g. Q40), but it is surely useful for job interviews in startups and bigger firms.

Raju
Raju 17 Sep, 2016

I think you got Q3 wrong. It was to calculate from median and not mean. how can assume mean and median to be same

Raju
Raju 17 Sep, 2016

Don't bother.....Noted .....you assumed normal distribution....

Amit Srivastava
Amit Srivastava 17 Sep, 2016

Great article. It will help in understanding which topics to focus on for interview purposes.

chinmaya Mishra
chinmaya Mishra 17 Sep, 2016

Dear Kunal, Few queries i have regarding AIC 1)why we multiply -2 to the AIC equation 2)where this equation has been built. Rgds

Sampath
Sampath 18 Sep, 2016

Hi Manish, Great Job! It is a very good collection of interview questions on machine learning. It will be a great help if you can also publish a similar article on statistics. Thanks in advance

KARTHI V
KARTHI V 20 Sep, 2016

Hi Manish, Kudos to you!!! Good Collection for beginners I Have small suggestion on Dimensionality Reduction,We can also use the below mentioned techniques to reduce the dimension of the data. 1.Missing Values Ratio Data columns with too many missing values are unlikely to carry much useful information. Thus data columns with number of missing values greater than a given threshold can be removed. The higher the threshold, the more aggressive the reduction. 2.Low Variance Filter Similarly to the previous technique, data columns with little changes in the data carry little information. Thus all data columns with variance lower than a given threshold are removed. A word of caution: variance is range dependent; therefore normalization is required before applying this technique. 3.High Correlation Filter. Data columns with very similar trends are also likely to carry very similar information. In this case, only one of them will suffice to feed the machine learning model. Here we calculate the correlation coefficient between numerical columns and between nominal columns as the Pearson’s Product Moment Coefficient and the Pearson’s chi square value respectively. Pairs of columns with correlation coefficient higher than a threshold are reduced to only one. A word of caution: correlation is scale sensitive; therefore column normalization is required for a meaningful correlation comparison. 4.Random Forests / Ensemble Trees Decision Tree Ensembles, also referred to as random forests, are useful for feature selection in addition to being effective classifiers. One approach to dimensionality reduction is to generate a large and carefully constructed set of trees against a target attribute and then use each attribute’s usage statistics to find the most informative subset of features. Specifically, we can generate a large set (2000) of very shallow trees (2 levels), with each tree being trained on a small fraction (3) of the total number of attributes. If an attribute is often selected as best split, it is most likely an informative feature to retain. A score calculated on the attribute usage statistics in the random forest tells us ‒ relative to the other attributes ‒ which are the most predictive attributes. 5.Backward Feature Elimination In this technique, at a given iteration, the selected classification algorithm is trained on n input features. Then we remove one input feature at a time and train the same model on n-1 input features n times. The input feature whose removal has produced the smallest increase in the error rate is removed, leaving us with n-1 input features. The classification is then repeated using n-2 features, and so on. Each iteration k produces a model trained on n-k features and an error rate e(k). Selecting the maximum tolerable error rate, we define the smallest number of features necessary to reach that classification performance with the selected machine learning algorithm. 6.Forward Feature Construction. This is the inverse process to the Backward Feature Elimination. We start with 1 feature only, progressively adding 1 feature at a time, i.e. the feature that produces the highest increase in performance. Both algorithms, Backward Feature Elimination and Forward Feature Construction, are quite time and computationally expensive. They are practically only applicable to a data set with an already relatively low number of input columns.

Ramit Pandey
Ramit Pandey 22 Sep, 2016

Hi Manish , After going through these question I feel I am at 10% of knowledge required to pursue career in Data Science . Excellent Article to read. Can you Please suggest me any book or training online which gives this much deep information . Waiting for your reply in anticipation . Thanks a million

Rahul Jadhav
Rahul Jadhav 22 Sep, 2016

Amazing Collection Manish! Thanks a lot.

NikhilS
NikhilS 25 Sep, 2016

An awesome article for reference. Thanks a ton Manish sir for the share. Please share the pdf format of this blog post if possible. Have also taken note of Karthi's input!

vinaya
vinaya 05 Oct, 2016

ty manish...its an awsm reference...plz upload pdf format also...thanks again

Prasanna
Prasanna 07 Oct, 2016

Great set of questions Manish. BTW.. I believe the expressions for bias and variance in question 39 is incorrect. I believe the brackets are messed. Following gives the correct expressions. https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

Sidd
Sidd 07 Nov, 2016

Really awesome article thanks. Given the influence young, budding students of machine learning will likely have in the future, your article is of great value.

jatin pal singh
jatin pal singh 16 Jul, 2017

very nice article. keep it up

Manchun Kumar
Manchun Kumar 07 Sep, 2017

Thank you for sharing great list of data science interview questions and answers. I am looking for more question and answer.

Arpit
Arpit 14 Oct, 2017

Q14. You have built a multiple regression model. Your model R² isn’t as good as you wanted. For improvement, your remove the intercept term, your model R² becomes 0.8 from 0.3. Is it possible? How? how can we simply remove intercept term?Does it make any sense?

mx player download
mx player download 01 Nov, 2017

Amazing! Its reɑlly awesome piece οf writing, І have got much clear idea гegarding from this piece of writing.

Anand Rahul
Anand Rahul 17 Nov, 2017

Thank you for your awesome curated list. However, at one point you mentioned that the Random forest over fit in increasing number of trees. According to course Introduction to statistical learning by Stanford University ,the author mention that random forest does not over fit on increasing number of trees.

planet7 casino no deposit
planet7 casino no deposit 22 Nov, 2017

Whats up this is kid of of off topic but I was wanting to know if blogs use WYSIWYG editors orr if you have to manually code ith HTML. I'm starting a blog soon but have no coding experience so I waanted to get guidance from someone with experience. Anyy help would be greatly appreciated!

Mock Interview Online
Mock Interview Online 08 Jan, 2018

Hello, really an interesting collection of answers. These questions can be asked anywhere. But, with growth in machine learning startups, facing off ML algorithm related question have higher chances,Thanks