Ankit Gupta — April 30, 2017

## Introduction

Machine Learning is one of the most sought after skills these days. If you are a data scientist, then you need to be good at Machine Learning – no two ways about it. As part of DataFest 2017, we organized various skill tests so that data scientists can assess themselves on these critical skills. These tests included Machine Learning, Deep Learning, Time Series problems and Probability. This article will lay out the solutions to the machine learning skill test. If you missed out on any of the above skill tests, you can still check out the questions and answers through the articles linked above.

In Machine Learning skill test, more than 1350 people registered for the test. The test was designed to test your conceptual knowledge in machine learning and make you industry ready. If you missed on the real time test, you can still read this article to find out how you could have answered correctly. Also, check out our online training in machine learning.

Here are the leaderboard rankings for all the participants in the Machine Learning Skilltest.

These questions, along with hundreds of others, are part of our ‘Ace Data Science Interviews‘ course. It’s a comprehensive guide, with tons of resources, to crack data science interviews and land your dream role! And if you’re just starting your data science journey, then check out our most comprehensive program to master Machine Learning

## Overall Scores You can access the final scores here. More than 210 people participated in the machine learning skill test and the highest score obtained was 36. Here are a few statistics about the distribution.

Mean Score: 19.36

Median Score: 21

Mode Score: 27

## Useful Resources on Mahchine Learning

Machine Learning basics for a newbie

Deep Learning vs. Machine Learning – the essential differences you need to know!

Applied Machine Learning Course

Introduction to Data Science Course

Ace Data Science Interviews Course

## Machine Learning Questions & Solutions

Question Context

A feature F1 can take certain value: A, B, C, D, E, & F and represents grade of students from a college.

1) Which of the following statement is true in following case?

A) Feature F1 is an example of nominal variable.
B) Feature F1 is an example of ordinal variable.
C) It doesn’t belong to any of the above category.
D) Both of these

2) Which of the following is an example of a deterministic algorithm?

A) PCA

B) K-Means

C) None of the above

3) [True or False] A Pearson correlation between two variables is zero but, still their values can still be related to each other.

A) TRUE

B) FALSE

4) Which of the following statement(s) is / are true for Gradient Decent (GD) and Stochastic Gradient Decent (SGD)?

1. In GD and SGD, you update a set of parameters in an iterative manner to minimize the error function.
2. In SGD, you have to run through all the samples in your training set for a single update of a parameter in each iteration.
3. In GD, you either use the entire data or a subset of training data to update a parameter in each iteration.

A) Only 1

B) Only 2

C) Only 3

D) 1 and 2

E) 2 and 3

F) 1,2 and 3

5) Which of the following hyper parameter(s), when increased may cause random forest to over fit the data?

1. Number of Trees
2. Depth of Tree
3. Learning Rate

A) Only 1

B) Only 2

C) Only 3

D) 1 and 2

E) 2 and 3

F) 1,2 and 3

6) Imagine, you are working with “Analytics Vidhya” and you want to develop a machine learning algorithm which predicts the number of views on the articles.

Your analysis is based on features like author name, number of articles written by the same author on Analytics Vidhya in past and a few other features. Which of the following evaluation metric would you choose in that case?

1. Mean Square Error
2. Accuracy
3. F1 Score

A) Only 1

B) Only 2

C) Only 3

D) 1 and 3

E) 2 and 3

F) 1 and 2

7) Given below are three images (1,2,3). Which of the following option is correct for these images?

A) B) C) A) 1 is tanh, 2 is ReLU and 3 is SIGMOID activation functions.

B) 1 is SIGMOID, 2 is ReLU and 3 is tanh activation functions.

C) 1 is ReLU, 2 is tanh and 3 is SIGMOID activation functions.

D) 1 is tanh, 2 is SIGMOID and 3 is ReLU activation functions.

8) Below are the 8 actual values of target variable in the train file.

[0,0,0,1,1,1,1,1]

What is the entropy of the target variable?

A) -(5/8 log(5/8) + 3/8 log(3/8))

B) 5/8 log(5/8) + 3/8 log(3/8)

C) 3/8 log(5/8) + 5/8 log(3/8)

D) 5/8 log(3/8) – 3/8 log(5/8)

9) Let’s say, you are working with categorical feature(s) and you have not looked at the distribution of the categorical variable in the test data.

You want to apply one hot encoding (OHE) on the categorical feature(s). What challenges you may face if you have applied OHE on a categorical variable of train dataset?

A) All categories of categorical variable are not present in the test dataset.

B) Frequency distribution of categories is different in train as compared to the test dataset.

C) Train and Test always have same distribution.

D) Both A and B

E) None of these

10) Skip gram model is one of the best models used in Word2vec algorithm for words embedding. Which one of the following models depict the skip gram model? A) A

B) B

C) Both A and B

D) None of these

11) Let’s say, you are using activation function X in hidden layers of neural network. At a particular neuron for any given input, you get the output as “-0.0001”. Which of the following activation function could X represent?

A) ReLU

B) tanh

C) SIGMOID

D) None of these

12) [True or False] LogLoss evaluation metric can have negative values.

A) TRUE
B) FALSE

13) Which of the following statements is/are true about “Type-1” and “Type-2” errors?

1. Type1 is known as false positive and Type2 is known as false negative.
2. Type1 is known as false negative and Type2 is known as false positive.
3. Type1 error occurs when we reject a null hypothesis when it is actually true.

A) Only 1

B) Only 2

C) Only 3

D) 1 and 2

E) 1 and 3

F) 2 and 3

14) Which of the following is/are one of the important step(s) to pre-process the text in NLP based projects?

1. Stemming
2. Stop word removal
3. Object Standardization

A) 1 and 2

B) 1 and 3

C) 2 and 3

D) 1,2 and 3

15) Suppose you want to project high dimensional data into lower dimensions. The two most famous dimensionality reduction algorithms used here are PCA and t-SNE. Let’s say you have applied both algorithms respectively on data “X” and you got the datasets “X_projected_PCA” , “X_projected_tSNE”.

Which of the following statements is true for “X_projected_PCA” & “X_projected_tSNE” ?

A) X_projected_PCA will have interpretation in the nearest neighbour space.

B) X_projected_tSNE will have interpretation in the nearest neighbour space.

C) Both will have interpretation in the nearest neighbour space.

D) None of them will have interpretation in the nearest neighbour space. 16) In the above images, which of the following is/are examples of multi-collinear features?

A) Features in Image 1

B) Features in Image 2

C) Features in Image 3

D) Features in Image 1 & 2

E) Features in Image 2 & 3

F) Features in Image 3 & 1

17) In previous question, suppose you have identified multi-collinear features. Which of the following action(s) would you perform next?

1. Remove both collinear variables.
2. Instead of removing both variables, we can remove only one variable.
3. Removing correlated variables might lead to loss of information. In order to retain those variables, we can use penalized regression models like ridge or lasso regression.

A) Only 1

B)Only 2

C) Only 3

D) Either 1 or 3

E) Either 2 or 3

18) Adding a non-important feature to a linear regression model may result in.

1. Increase in R-square
2. Decrease in R-square

A) Only 1 is correct

B) Only 2 is correct

C) Either 1 or 2

D) None of these

19) Suppose, you are given three variables X, Y and Z. The Pearson correlation coefficients for (X, Y), (Y, Z) and (X, Z) are C1, C2 & C3 respectively.

Now, you have added 2 in all values of X (i.enew values become X+2), subtracted 2 from all values of Y (i.e. new values are Y-2) and Z remains the same. The new coefficients for (X,Y), (Y,Z) and (X,Z) are given by D1, D2 & D3 respectively. How do the values of D1, D2 & D3 relate to C1, C2 & C3?

A) D1= C1, D2 < C2, D3 > C3

B) D1 = C1, D2 > C2, D3 > C3

C) D1 = C1, D2 > C2, D3 < C3

D) D1 = C1, D2 < C2, D3 < C3

E) D1 = C1, D2 = C2, D3 = C3

F) Cannot be determined

20) Imagine, you are solving a classification problems with highly imbalanced class. The majority class is observed 99% of times in the training data.

Your model has 99% accuracy after taking the predictions on test data. Which of the following is true in such a case?

1. Accuracy metric is not a good idea for imbalanced class problems.
2. Accuracy metric is a good idea for imbalanced class problems.
3. Precision and recall metrics are good for imbalanced class problems.
4. Precision and recall metrics aren’t good for imbalanced class problems.

A) 1 and 3

B) 1 and 4

C) 2 and 3

D) 2 and 4

21) In ensemble learning, you aggregate the predictions for weak learners, so that an ensemble of these models will give a better prediction than prediction of individual models.

Which of the following statements is / are true for weak learners used in ensemble model?

1. They don’t usually overfit.
2. They have high bias, so they cannot solve complex learning problems
3. They usually overfit.

A) 1 and 2

B) 1 and 3

C) 2 and 3

D) Only 1

E) Only 2

F) None of the above

22) Which of the following options is/are true for K-fold cross-validation?

1. Increase in K will result in higher time required to cross validate the result.
2. Higher values of K will result in higher confidence on the cross-validation result as compared to lower value of K.
3. If K=N, then it is called Leave one out cross validation, where N is the number of observations.

A) 1 and 2

B) 2 and 3

C) 1 and 3

D) 1,2 and 3

#### Question Context 23-24

Cross-validation is an important step in machine learning for hyper parameter tuning. Let’s say you are tuning a hyper-parameter “max_depth” for GBM by selecting it from 10 different depth values (values are greater than 2) for tree based model using 5-fold cross validation.

Time taken by an algorithm for training (on a model with max_depth 2) 4-fold is 10 seconds and for the prediction on remaining 1-fold is 2 seconds.

Note: Ignore hardware dependencies from the equation.

23) Which of the following option is true for overall execution time for 5-fold cross validation with 10 different values of “max_depth”?

A) Less than 100 seconds

B) 100 – 300 seconds

C) 300 – 600 seconds

D) More than or equal to 600 seconds

C) None of the above

D) Can’t estimate

24) In previous question, if you train the same algorithm for tuning 2 hyper parameters say “max_depth” and “learning_rate”.

You want to select the right value against “max_depth” (from given 10 depth values) and learning rate (from given 5 different learning rates). In such cases, which of the following will represent the overall time?

A) 1000-1500 second

B) 1500-3000 Second

C) More than or equal to 3000 Second

D) None of these

25) Given below is a scenario for training error TE and Validation error VE for a machine learning algorithm M1. You want to choose a hyperparameter (H) based on TE and VE.

 H TE VE 1 105 90 2 200 85 3 250 96 4 105 85 5 300 100

Which value of H will you choose based on the above table?

A) 1

B) 2

C) 3

D) 4

E) 5

26) What would you do in PCA to get the same projection as SVD?

A) Transform data to zero mean

B) Transform data to zero median

C) Not possible

D) None of these

#### Question Context 27-28

Assume there is a black box algorithm, which takes training data with multiple observations (t1, t2, t3,…….. tn) and a new observation (q1). The black box outputs the nearest neighbor of q1 (say ti) and its corresponding class label ci.

You can also think that this black box algorithm is same as 1-NN (1-nearest neighbor).

27) It is possible to construct a k-NN classification algorithm based on this black box alone.

Note: Where n (number of training observations) is very large compared to k.

A) TRUE

B) FALSE

28) Instead of using 1-NN black box we want to use the j-NN (j>1) algorithm as black box. Which of the following option is correct for finding k-NN using j-NN?

1. J must be a proper factor of k
2. J > k
3. Not possible

A)  1

B) 2

C) 3

29) Suppose you are given 7 Scatter plots 1-7 (left to right) and you want to compare Pearson correlation coefficients between variables of each scatterplot.

Which of the following is in the right order? 1. 1<2<3<4
2. 1>2>3 > 4
3. 7<6<5<4
4. 7>6>5>4

A) 1 and 3

B) 2 and 3

C) 1 and 4

D) 2 and 4

Which of the following option is / are true for interpretation of log-loss as an evaluation metric?

1. If a classifier is confident about an incorrect classification, then log-loss will penalise it heavily.
2. For a particular observation, the classifier assigns a very small probability for the correct class then the corresponding contribution to the log-loss will be very large.
3. Lower the log-loss, the better is the model.

A) 1 and 3

B) 2 and 3

C) 1 and 2

D) 1,2 and 3

#### Context Question 31-32

Below are five samples given in the dataset. Note: Visual distance between the points in the image represents the actual distance.

31) Which of the following is leave-one-out cross-validation accuracy for 3-NN (3-nearest neighbor)?

A) 0

D) 0.4

C) 0.8

D) 1

32) Which of the following value of K will have least leave-one-out cross validation accuracy?

A) 1NN

B) 3NN

C) 4NN

D) All have same leave one out error

33) Suppose you are given the below data and you want to apply a logistic regression model for classifying it in two given classes.  You are using logistic regression with L1 regularization. Where C is the regularization parameter and w1 & w2 are the coefficients of x1 and x2.

Which of the following option is correct when you increase the value of C from zero to a very large value?

A) First w2 becomes zero and then w1 becomes zero

B) First w1 becomes zero and then w2 becomes zero

C) Both becomes zero at the same time

D) Both cannot be zero even after very large value of C

34) Suppose we have a dataset which can be trained with 100% accuracy with help of a decision tree of depth 6. Now consider the points below and choose the option based on these points.

Note: All other hyper parameters are same and other factors are not affected.

1. Depth 4 will have high bias and low variance
2. Depth 4 will have low bias and low variance

A) Only 1

B) Only 2

C) Both 1 and 2

D) None of the above

35) Which of the following options can be used to get global minima in k-Means Algorithm?

1. Try to run algorithm for different centroid initialization
3. Find out the optimal number of clusters

A) 2 and 3

B) 1 and 3

C) 1 and 2

D) All of above

36) Imagine you are working on a project which is a binary classification problem. You trained a model on training dataset and get the below confusion matrix on validation dataset. Based on the above confusion matrix, choose which option(s) below will give you correct predictions?

1. Accuracy is ~0.91
2. Misclassification rate is ~ 0.91
3. False positive rate is ~0.95
4. True positive rate is ~0.95

A) 1 and 3

B) 2 and 4

C) 1 and 4

D) 2 and 3

37) For which of the following hyperparameters, higher value is better for decision tree algorithm?

1. Number of samples used for split
2. Depth of tree
3. Samples for leaf

A)1 and 2

B) 2 and 3

C) 1 and 3

D) 1, 2 and 3

E) Can’t say

#### Context 38-39

Imagine, you have a 28 * 28 image and you run a 3 * 3 convolution neural network on it with the input depth of 3 and output depth of 8.

Note: Stride is 1 and you are using same padding.

38) What is the dimension of output feature map when you are using the given parameters.

A) 28 width, 28 height and 8 depth

B) 13 width, 13 height and 8 depth

C) 28 width, 13 height and 8 depth

D) 13 width, 28 height and 8 depth

39)  What is the dimensions of output feature map when you are using following parameters.

A)  28 width, 28 height and 8 depth

B) 13 width, 13 height and 8 depth

C) 28 width, 13 height and 8 depth

D) 13 width, 28 height and 8 depth

40) Suppose, we were plotting the visualization for different values of C (Penalty parameter) in SVM algorithm. Due to some reason, we forgot to tag the C values with visualizations. In that case, which of the following option best explains the C values for the images below (1,2,3 left to right, so C values are C1 for image1, C2 for image2 and C3 for image3 ) in case of rbf kernel. A) C1 = C2 = C3

B) C1 > C2 > C3

C) C1 < C2 < C3

D) None of these

## End Notes

I hope you enjoyed the questions and were able to test your knowledge about machine learning. If you have any questions or doubts, feel free to post them below.

Check out all the upcoming events here. ###### Ankit Gupta

Ankit is currently working as a data scientist at UBS who has solved complex data mining problems in many domains. He is eager to learn more about data science and machine learning algorithms.

## 11 thoughts on "40 Questions to test a data scientist on Machine Learning [Solution: SkillPower – Machine Learning, DataFest 2017]" ###### Amit Srivastava says:May 02, 2017 at 3:15 am
For question 25, wouldnt Occam's Razor suggest choosing option 2. Its giving the same VE, but with a lower hyperparameter value. Considering that we should keep our hyperparameters and hence our model simpler, wouldnt option 2 be a choice. Option 4 may be overfitting the training data Reply ###### Ankit Gupta says:May 02, 2017 at 3:29 am
Hi Amit, It is true what you are saying but here hyperparameter H doesn't have any interpretation. So in such case you should choose the one which has lower training and validation error and also the close match. Best! Ankit Gupta Reply ###### Quan says:May 02, 2017 at 8:22 am
Hi, why is the correct answer for question 28 "Not Possible"? For example, to construct a 6-NN classifier from a 2-NN one, we can perform 2-NN three times each with two previous results discarded. Therefore the correct answer here should be "J must be a proper factor of K". Am I missing something here? Reply ###### Ankit Gupta says:May 02, 2017 at 9:21 am
Hi Quan, Thanks for noticing it. It was marked incorrectly Reply ###### adita says:May 06, 2017 at 4:24 am
I think the correct answer for 4 should be the option which mentions both 1 and 3 options Reply ###### Ankit Gupta says:May 06, 2017 at 5:31 am
Hi Adita, In GD, we use entire training data for single step so 3rd option can not be possible. Best! Ankit Gupta Reply ###### Jerry says:May 06, 2017 at 1:57 pm
The answer explanation for problem 3 is a little confusing. It is the limitation of Pearson correlation that it can only check if two variables are linearly correlated, but is not able to check non-linear correlation. Reply ###### Ankit Gupta says:May 06, 2017 at 2:12 pm
Hi Jerry, Yes, you are right. Even the answer of this question was explaining the same thing but I write the explanation little simpler. Thanks for noticing Best! Ankit Gupta Reply ###### Yannis says:May 12, 2017 at 9:32 am
It will be interesting to add option J < k. I think this can be a solution too. Thus, "J must be a proper factor of K” is not a strict condition, it is just a sub-case of J < k.. Reply ###### Nicolás Tagle says:September 05, 2017 at 9:10 pm
I think 5) is not correct, a increase in number of trees could impact in over fitting, also the statement "Increase in the number of tree will cause under fitting." Reply ###### 40 questions on Machine Learning – bigdata says:March 03, 2018 at 11:57 am
[…] Estratte dal sito https://www.analyticsvidhya.com/blog/2017/04/40-questions-test-data-scientist-machine-learning-solut… […] Reply