**Data Science** is getting more popular by the day, with data scientists using **Artificial Intelligence** and **Machine Learning **to solve various challenging and complex problems. It is one of the hottest fields that every person dreams of getting into. According to a recent survey, there has been an increase in the number of opportunities related to Data Science during the **COVID-19** pandemic. Ever wonder what it takes to ace the data science interview in startups and top product-based companies like amazon?

So, I have curated a list of 30 questions spanning from Probability and Statistics to Machine Learning and Deep Learning which I have faced during several data science interviews. These questions and answers are fit not only for beginners but for intermediate and advanced learners as well and range from ‘what is decision tree MCQ’, ‘types of naive Bayes model mcq’, ‘the benefit of naïve Bayes mcq’, and ‘disadvantages of naive Bayes classifier MCQ’. These are some important techniques used by data scientists and data analysts for building models performing exploratory data analysis, data cleaning, data mining, etc.

This article comprises over 30 data science interview questions which are broadly divided into three sections:

**Probability, Statistics, and Machine Learning Algorithms****Deep Learning****Coding Questions**

*This article was published as a part of the **Data Science Blogathon**.*

(A) We assume the missing values as the mean of all values.

(B) We ignore the missing features.

(C) We integrate the posteriors probabilities over the missing features.

(D) Drop the features completely.

**Answer: (C)**

Explanation: Here, we don’t use general methods of handling missing values; instead, we integrate the posterior probabilities over the missing features for better predictions.

(A) For a very large value of K, points from other classes may be included in the neighborhood.

(B) For the very small value of K, the algorithm is very sensitive to noise.

(C) KNN is used only for classification problem statements.

(D) KNN is a lazy learner.

**Answer: (C)**

Explanation: We can use KNN for both regression and classification problem statements. In classification, we use the majority class based on the value of K, while in regression, we take an average of all points and then give the predictions.

(A) Outliers should be identified and removed always from a dataset.

(B) Outliers can never be present in the test set.

(C) Outliers is a data point that is significantly close to other data points.

(D) The nature of our business problem determines how outliers are used.

**Answer: (D)**

Explanation: The nature of a business problem often determines the use of outliers, e.g., in the case of problems where a class imbalance condition exists, like Credit Card Fraud detection, where the records for fraud class are very few with respect to no fraud class.

X | 1 | 20 | 30 | 40 |

Y | 1 | 400 | 800 | 1300 |

(A) 27.876

(B) 32.650

(C) 40.541

(D) 28.956

**Answer: (D)**

Explanation: Hint: Use the ordinary least square method.

(A) Supervised Learning.

(B) Unsupervised Learning.

(C) Reinforcement Learning.

(D) Both (A) and (B).

**Answer: (C)**

Explanation: Here robot is learning from the environment by taking the rewards for positive actions and penalties for negative actions.

(A) Decision tree is only suitable for the classification problem statement.

(B) In a decision tree, the entropy of a node decreases as we go down the decision tree.

(C) In a decision tree, entropy determines purity.

(D) Decision tree can only be used for only numeric valued and continuous attributes.

**Answer: (B)**

Explanation: Entropy helps to determine the impurity of a node, and as we go down the decision tree, entropy decreases.

(A) An attribute having high entropy

(B) An attribute having high entropy and information gain

(C) An attribute having the lowest information gain.

(D) An attribute having the highest information gain.

**Answer: (D)**

Explanation: We select first those attributes which are having maximum information gain.

(A) Euclidean distance.

(B) Manhattan distance.

(C) Minkowski distance.

(D) Hamming distance.

**Answer: (D)**

Explanation: Hamming distance is a metric for comparing two binary data strings, i.e., suitable for categorical variables.

(A) towards region R1.

(B) towards region R2.

(C) No shift in decision boundary.

(D) It depends on the exact value of priors.

**Answer: (B)**

Explanation: Upon shifting the decision boundary towards region R2, we preserve the prior probabilities proportion since the prior for w1 is greater than w2.

(A) These are types of regularization methods to solve the overfitting problem.

(B) Lasso Regression is a type of regularization method.

(C) Ridge regression shrinks the coefficient to a lower value.

(D) Ridge regression lowers some coefficients to a zero value.

**Answer: (D)**

Explanation: Ridge regression never drops any feature; instead, it shrinks the coefficients. However, Lasso regression drops some features by making the coefficient of that feature zero. Therefore, the latter is used as a Feature Selection Technique.

(A) A zero correlation does not necessarily imply independence between variables.

(B) Correlation and covariance values are the same.

(C) The covariance and correlation are always the same sign.

(D) Correlation is the standardized version of Covariance.

**Answer: (B)**

Explanation: Correlation is defined as covariance divided by standard deviations and, therefore, is the standardized version of covariance.

(A) one predictor and one or more response variables are related.

(B) several predictors and several response variables response are related.

(C) one response and one or more predictors are related.

(D) All of these are correct.

**Answer:** **(C)**

Explanation: In the regression problem statement, we have several independent variables but only one dependent variable.

(A) True

(B) False

(C) Can’t be determined

(D) None of these

**Answer: (A)**

Explanation: Since for a particular value in the attribute, the probability will be zero due to the absence of an example present in the training dataset. This usually leads to the problem of zero probability in the Naive Bayes algorithm. For further reference, refer to the given article Link.

(A) Bagging decreases the variance of the classifier.

(B) Boosting helps to decrease the bias of the classifier.

(C) Bagging combines the predictions from different models and then finally gives the results.

(D) Bagging and Boosting are the only available ensemble techniques.

**Answer: (D)**

Explanation: Apart from bagging and boosting, there are other various types of ensemble techniques such as Stacking, Extra trees classifier, Voting classifier, etc.

(A) Bayes classifier works on the Bayes theorem of probability.

(B) Bayes classifier is an unsupervised learning algorithm.

(C) Bayes classifier is also known as maximum apriori classifier.

(D) It assumes the independence between the independent variables or features.

**Answer: (A)**

Explanation: Bayes classifier internally uses the concept of the Bayes theorem for doing the predictions for unseen data points.

(A) It is the ratio of true positive to false negative predictions.

(B) It is the measure of how accurately a model can identify positive classes out of all the positive classes present in the dataset.

(C) It is the measure of how accurately a model can identify true positives from all the positive predictions that it has made

(D) It is the measure of how accurately a model can identify true negatives from all the positive predictions that it has made

**Answer: (C)**

Explanation: Precision is the ratio of true positive and (true positive + false positive), which means that it measures, out of all the positive predicted values by a model, how precisely a model predicted the truly positive values.

(A) High bias means that the model is underfitting.

(B) High variance means that the model is overfitting

(C) Bias and variance are inversely proportional to each other.

(D) All of the above

**Answer: (D)**

Explanation: A model with high bias is unable to capture the underlying patterns in the data and consistently underestimates or overestimates the true values, which means that the model is underfitting. A model with high variance is overly sensitive to the noise in the data and may produce vastly different results for different samples of the same data. Therefore it is important to maintain the balance of both variance and bias. As they are inversely proportional to each other, this relationship between bias and variance is often referred to as the bias-variance trade-off.

(A) Random forest

(B) SVM(support vector machine)

(C) Logistic regression

(D) Both A and B

**Answer: (D)**

Explanation: Support Vector Machines (SVMs) and Decision Trees are two popular machine-learning algorithms that can be used for classification and regression tasks.

A. It is computationally expensive

B. It can get stuck in local minima

C. It requires a large amount of labeled data

D. It can only handle numerical data

**Answer: (B)**

Explanation: It can get stuck in local minima

(A) RMSprop.

(B) Adagrad.

(C) Adam.

(D) Nesterov.

**Answer: (C)**

Explanation: Adam, being a popular deep learning optimizer, is based on both momentum and adaptive learning.

(A) Hyperbolic Tangent.

(B) Sigmoid.

(C) Softmax.

(D) Rectified Linear unit(ReLU).

**Answer: (A)**

Explanation: Hyperbolic Tangent activation function gives output in the range [-1,1], which is symmetric about zero.

(A) It resembles Recurrent Neural Networks(RNNs) which have feedback loops.

(B) It uses the radial basis function as an activation function.

(C) While outputting, it considers the distance of a point with respect to the center.

(D) The output given by the Radial basis function is always an absolute value.

**Answer: (A)**

Explanation: Radial basis functions do not resemble RNN but are used as an artificial neural network, which takes a distance of all the points from the center rather than the weighted sum.

(A) When you want to quickly build a prototype using neural networks.

(B) When you want to implement simple neural networks in your initial learning phase.

(C) When doing critical and intensive research in any field.

(D) When you want to create simple tutorials for your students and friends.

**Answer: (C)**

Explanation: Keras is not preferred since it is built on top of Tensorflow, which provides both high-level and low-level APIs.

(A) Deep Learning algorithms work efficiently on a high amount of data and require high computational power.

(B) Feature Extraction needs to be done manually in both ML and DL algorithms.

(C) Deep Learning algorithms are best suited for an unstructured set of data.

(D) Deep Learning is a subset of machine learning

**Answer: (B)**

Explanation: Usually, in deep learning algorithms, feature extraction happens automatically in hidden layers.

(A) Increase the number of iterations

(B) Use dimensionality reduction techniques

(C) Use cross-validation technique to reduce underfitting

(D) Use data augmentation techniques to increase the amount of data used.

**Answer: (D)**

Explanation: Options A and B can be used to reduce overfitting in a model. Option C is just used to check if there is underfitting or overfitting in a model but cannot be used to treat the issue. Data augmentation techniques can help reduce underfitting as it produces more data, and the noise in the data can help in generalizing the model.

(A) Artificial neurons are similar in operation to biological neurons.

(B) Training time for a neural network depends on network size.

(C) Neural networks can be simulated on conventional computers.

(D) The basic units of neural networks are neurons.

**Answer:** **(A)**

Explanation: Artificial neuron is not similar in working as compared to biological neuron since artificial neuron first takes a weighted sum of all inputs along with bias followed by applying an activation function to give the final result, whereas the working of biological neuron involves axon, synapses, etc.

(A) AND

(B) OR

(C) NOR

(D) XOR

**Answer: (D)**

Explanation: Perceptron always gives a linear decision boundary. However, for the Implementation of the XOR function, we need a non-linear decision boundary.

(A) Local Minima.

(B) Oscillations.

(C) Slow convergence.

(D) All of the above.

**Answer:** **(D)**

Explanation: The learning rate decides how fast or slow our optimizer is able to achieve the global minimum. So by choosing an inappropriate value of learning rate, we may not reach the global minimum; instead, we get stuck at a local minimum and oscillate around the minimum, because of which the convergence time increases.

```
import numpy as np
n_array = np.array([1, 0, 2, 0, 3, 0, 0, 5, 6, 7, 5, 0, 8])
res = np.where(n_array == 0)[0]
print(res.sum( ))
```

(A) 25

(B) 26

(C) 6

(D) None of these

**Answer: (B)**

Explanation: where( ) function gives an array of indices where the value of the particular index is zero in n_array.

```
import numpy as np
p = [[1, 0], [0, 1]]
q = [[1, 2], [3, 4]]
result1 = np.cross(p, q)
result2 = np.cross(q, p)
print((result1==result2).shape[0])
```

(A) 0

(B) 1

(C) 2

(D) Code is not executable.

**Answer: (C)**

Explanation: Cross-product of two vectors are not commutative.

```
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(2))
print(s.size)
```

(A) 0

(B) 1

(C) 2

(D) Answer is not fixed due to randomness.

**Answer: (C)**

Explanation: random function returns samples from the “standard normal” distribution.

```
import numpy as np
student_id = np.array([1023, 5202, 6230, 1671, 1682, 5241, 4532])
i = np.argsort(student_id)
print(i[5])
```

(A) 2

(B) 3

(C) 4

(D) 5

**Answer:** **(D)**

Explanation: argsort( ) function first sorts the array in ascending order and then gives the output as an index of those sorted array elements in the initial array.

```
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(4))
print(s.ndim)
```

(A) 1

(B) 2

(C) 0

(D) 3

**Answer: (A)**

Explanation: ndim function returns the dimension of the dataframe.

```
import numpy as np
my_array = np.arange(6).reshape(2,3)
result = np.trace(my_array)
print(result)
```

(A) 2

(B) 4

(C) 6

(D) 8

**Answer: (B)**

Explanation: arange( ) function gives a 1-d array with values from 0 to 5, and reshape function resizes our array to 2-d. Accordingly, trace gives the sum of diagonal elements of the result matrix.

```
import numpy as np
from numpy import linalg
a = np.array([[1, 0], [1, 2]])
print(type(np.linalg.det(a)))
```

(A) INT

(B) FLOAT

(C) STR

(D) BOOL.

**Answer: (B)**

Explanation: Final output represents the type of determinant value of the matrix formed.

You have now gone through over 30 important data science interview questions that I’m sure have helped you gain knowledge and confidence to ace your next data science interview! These multiple-choice questions have covered topics spanning from Probability and Statistics to Machine Learning and Deep Learning and are suitable for beginners, intermediate, and advanced learners. The article emphasizes the importance of understanding the fundamental concepts and techniques in data science for succeeding in data science interviews.

Do check out our other articles covering important interview questions on SQL, Time Series, Data Science and Machine Learning.

*The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.*

Lorem ipsum dolor sit amet, consectetur adipiscing elit,