The idea of creating machines that learn by themselves (i.e., artificial intelligence) has been driving humans for decades now. Unsupervised learning and clustering are the keys to fulfilling that dream. Unsupervised learning provides more flexibility but is more challenging as well. This skill test will focus on clustering techniques.

Clustering plays an important role in drawing insights from unlabeled data. Clustering machine learning algorithmsÂ classify large datasets in similar groups, which improves various business decisions by providing a meta-understanding. Recently deep learning models with neural networks are also used in clustering. In this article , you will get understanding about the clustering interview questions that will help you to clear interviews.

In this skill test, we tested our community on clustering techniques. A total of 1566 people registered for this skill test. If you missed taking the test, we have provided questions and answers. Here is your opportunity for you to find out how many questions you could have answered correctly. These can also be useful as a part of data science interview questions.

Below is the distribution of scores to help you evaluate your performance:

You can view the leaderboard here. More than 390 people participated in the skill test; the highest score was 33. Here are a few statistics about the distribution.

**Overall distribution**

Mean Score: 15.11 | Median Score: 15 | Mode Score: 16

Many people wish to be data scientists and data analysts these days and wonder if they can achieve it without a background in computer science. Be rest assured, that is totally possible! There are plenty of resources, courses, and tutorials available online that cover various data science topics, such as data analysis, data mining, big data, data analytics, data modeling, data visualization, and more. Here are some of our best recommended online resources on clustering techniques.

- An Introduction to Clustering and different methods of clustering
- Getting your clustering right (Part I)
- Getting your clustering right (Part II)

If you are just getting started with Unsupervised Learning, here are some comprehensive resources to assist you in your journey:

- Machine Learning Certification Course for Beginners
- The Most Comprehensive Guide to K-Means Clustering Youâ€™ll Ever Need
- Certified AI & ML Blackbelt+ Program

**Classification****Clustering****Reinforcement Learning****Regression**

Options:

A. 2 Only

B. 1 and 2

C. 1 and 3

D. 2 and 3

E. 1, 2, and 3

F. 1, 2, 3, and 4

**Solution: (E)**

Generally, movie recommendation systems cluster the users in a finite number of similar groups based on their previous activities and profile. Then, at a fundamental level, people in the same cluster are made similar recommendations.

In some scenarios, this can also be approached as a classification problem for assigning the most appropriate movie class to the user of a specific group of users. Also, a movie recommendation system can be viewed as a reinforcement learning problem where it learns from its previous recommendations and improves future recommendations.

**Regression****Classification****Clustering****Reinforcement Learning**

Options:

A. 1 Only

B. 1 and 2

C. 1 and 3

D. 1, 2 and 3

E. 1, 2 and 4

F. 1, 2, 3 and 4

**Solution: (E)**

At the fundamental level, Sentiment analysis classifies the sentiments represented in an image, text, or speech into a set of defined sentiment classes like happy, sad, excited, positive, negative, etc. It can also be viewed as a regression problem for assigning a sentiment score of, say, 1 to 10 for a corresponding image, text, or speech.

Another way of looking at sentiment analysis is to consider it using a reinforcement learning perspective where the algorithm constantly learns from the accuracy of past sentiment analysis performed to improve future performance.

A. True

B. False

**Solution: (A)**

Decision trees (and also random forests)can also be used for clusters in the data, but clustering often generates natural clusters and is not dependent on any objective function.

**Capping and flouring of variables****Removal of outliers**

Options:

A. 1 only

B. 2 only

C. 1 and 2

D. None of the above

**Solution: (A)**

Removal of outliers is not recommended if the data points are few in number. In this scenario, the capping and flouring of variables is the most appropriate strategy.

Options:

A. 0

B. 1

C. 2

D. 3

**Solution: (B)**

At least a single variable is required to perform clustering analysis. Clustering analysis with a single variable can be visualized with the help of a histogram.

A. Yes

B. No

**Solution: (B)**

K-Means clustering algorithm instead converses on local minima, which might also correspond to the global minima in some cases but not always. Therefore, itâ€™s advised to run the K-Means algorithm multiple times before drawing inferences about the clusters.

However, note that itâ€™s possible to receive the same clustering results from K-means by setting the same seed value for each run. But that is done by simply making the algorithm choose the set of the same random no. for each run.

Options:

A. Yes

B. No

C. Canâ€™t say

D. None of these

**Solution: (A)**

When the K-Means machine learning model has reached the local or global minima, it will not alter the assignment of data points to clusters for two successive iterations.

**For a fixed number of iterations.****The assignment of observations to clusters does not change between iterations, except for cases with a bad local minimum.****Centroids do not change between successive iterations.****Terminate when RSS falls below a threshold.**

Options:

A. 1, 3 and 4

B. 1, 2 and 3

C. 1, 2 and 4

D. All of the above

**Solution: (D)**

All four conditions can be used as possible termination conditions in K-Means clustering:

- This condition limits the runtime of the clustering algorithm, but in some cases, the quality of the clustering will be poor because of an insufficient number of iterations.
- Except for cases with a bad local minimum, this produces a good clustering, but runtimes may be unacceptably long.
- This also ensures that the algorithm has converged at the minima.
- Terminate when RSS falls below a threshold. This criterion ensures that the clustering is of the desired quality after termination. Practically, combining it with a bound on the number of iterations to guarantee termination is a good practice.

**K- Means clustering algorithm****Agglomerative clustering algorithm****Expectation-Maximization clustering algorithm****Diverse clustering algorithm**

Options:

A. 1 only

B. 2 and 3

C. 2 and 4

D. 1 and 3

E. 1,2 and 4

F. All of the above

**Solution: (D) **

Out of the options given, only the K-Means clustering algorithm and EM clustering algorithm have the drawback of converging at local minima.

Options:

A. K-means clustering algorithm

B. K-medians clustering algorithm

C. K-modes clustering algorithm

D. K-medoids clustering algorithm

**Solution: (A)**

Out of all the options, the K-Means clustering algorithm is most sensitive to outliers as it uses the mean of cluster data points to find the cluster center.

Options:

A. There were 28 data points in the clustering analysis

B. The best no. of clusters for the analyzed data points is 4

C. The proximity function used is Average-link clustering

D. The above dendrogram interpretation is not possible for K-Means clustering analysis

**Solution: (D)**

A dendrogram is not possible for K-Means clustering analysis. However, one can create a cluster gram based on K-Means clustering analysis.

**Creating different models for different cluster groups.****Creating an input feature for cluster ids as an ordinal variable.****Creating an input feature for cluster centroids as a continuous variable.****Creating an input feature for cluster size as a continuous variable.**

Options:

A. 1 only

B. 1 and 2

C. 1 and 4

D. 3 only

E. 2 and 4

F. All of the above

**Solution: (F)**

Creating an input feature for cluster ids as ordinal variables or creating an input feature for cluster centroids as a continuous variable might not convey any relevant information to the regression model for multidimensional data. But for clustering in a single dimension, all of the given methods are expected to convey meaningful information to the regression model. For example, to cluster people in two groups based on their hair length, storing clustering IDs as ordinal variables and cluster centroids as continuous variables will convey meaningful information.

A. Proximity function used

B. of data points used

C. of variables used

D. B and c only

E. All of the above

**Solution: (E)**

Change in either of the proximity function, no. of data points, or no. of variables will lead to different clustering results and hence different dendrograms.

Options:

A. 1

B. 2

C. 3

D. 4

**Solution: (B)**

Since the number of vertical lines intersecting the red horizontal line at y=2 in the dendrogram is 2, therefore, two clusters will be formed.

Options:

A. 2

B. 4

C. 6

D. 8

**Solution: (B)**

The decision of the no. of clusters that can best depict different groups can be chosen by observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the dendrogram cut by a horizontal line that can transverse the maximum distance vertically without intersecting a cluster.

In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the dendrogram below covers the maximum vertical distance AB.

**Data points with outliers****Data points with different densities****Data points with round shapes****Data points with non-convex shapes**

Options:

A. 1 and 2

B. 2 and 3

C. 2 and 4

D. 1, 2 and 4

E. 1, 2, 3 and 4

**Solution: (D)**

The K-Means clustering algorithm fails to give good results when the data contains outliers, the density spread of data points across the data space is different, and the data points follow non-convex shapes.

**Single-link****Complete-link****Average-link**

Options:

A. 1 and 2

B. 1 and 3

C. 2 and 3

D. 1, 2 and 3

**Solution: (D)**

All three methods, i.e., single link, complete link, and average link, can be used for finding dissimilarity between two clusters in hierarchical clustering( can be found in the Python library scikit-learn).

**Clustering analysis is negatively affected by the multicollinearity of features****Clustering analysis is negatively affected by heteroscedasticity**

Options:

A. 1 only

B. 2 only

C. 1 and 2

D. None of them

**Solution: (A)**

Clustering analysis is not negatively affected by heteroscedasticity, but the results are negatively impacted by the multicollinearity of features/ variables used in clustering as the correlated feature/ variable will carry extra weight on the distance calculation than desired.

**Context for Question 19: Given are six points with the following attributes**

A.

B.

C.

D.

**Solution: (A)**

For the single link or MIN version of hierarchical clustering, the proximity of two clusters is defined to be the minimum distance between any two points in the different clusters. For instance, from the table, we see that the distance between points 3 and 6 is 0.11, and that is the height at which they are joined into one cluster in the dendrogram. As another example, the distance between clusters {3, 6} and {2, 5} is given by dist({3, 6}, {2, 5}) = min(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5)) = min(0.1483, 0.2540, 0.2843, 0.3921) = 0.1483.

**Context for Question 20: Given are six points with the following attributes**** **

A.

B.

C.

D.

**Solution: (B)**

For the single link or MAX version of hierarchical clustering, the proximity of two clusters is defined as the maximum distance between any two points in the different clusters. Similarly, here points 3 and 6 are merged first. However, {3, 6} is merged with {4}, instead of {2, 5}. This is because the dist({3, 6}, {4}) = max(dist(3, 4), dist(6, 4)) = max(0.1513, 0.2216) = 0.2216, which is smaller than dist({3, 6}, {2, 5}) = max(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5)) = max(0.1483, 0.2540, 0.2843, 0.3921) = 0.3921 and dist({3, 6}, {1}) = max(dist(3, 1), dist(6, 1)) = max(0.2218, 0.2347) = 0.2347.

**Context for Question 21: Given are six points with the following attributes**** **

A.

B.

C.

D.

**Solution: (C)**

For the group average version of hierarchical clustering, the proximity of two clusters is defined to be the average of the pairwise proximities between all pairs of points in the different clusters. This is an intermediate approach between MIN and MAX. This is expressed by the following equation:

Here, the distance between some clusters. dist({3, 6, 4}, {1}) = (0.2218 + 0.3688 + 0.2347)/(3 âˆ— 1) = 0.2751. dist({2, 5}, {1}) = (0.2357 + 0.3421)/(2 âˆ— 1) = 0.2889. dist({3, 6, 4}, {2, 5}) = (0.1483 + 0.2843 + 0.2540 + 0.3921 + 0.2042 + 0.2932)/(6âˆ—1) = 0.2637. Because dist({3, 6, 4}, {2, 5}) is smaller than dist({3, 6, 4}, {1}) and dist({2, 5}, {1}), these two clusters are merged at the fourth stage.

**Context for Question 22: Given are six points with the following attributes**** **

A.

B.

C.

D.

**Solution: (D)**

Ward method is a centroid method. The centroid method calculates the proximity between two clusters by calculating the distance between the centroids of clusters. For Wardâ€™s method, the proximity between two clusters is defined as the increase in the squared error that results when two clusters are merged. The results of applying Wardâ€™s method to the sample data set of six points. The resulting clustering is somewhat different from those produced by MIN, MAX, and group average.

Options:

A. 1

B. 2

C. 3

D. 4

**Solution: (C)**

The silhouette coefficient is a measure of how similar an object is to its own cluster compared to other clusters. The number of clusters for which the silhouette coefficient is highest represents the best choice of the number of clusters.

Options:

A. Imputation with mean

B. Nearest Neighbor assignment

C. Imputation with Expectation Maximization algorithm

D. All of the above

**Solution: (C)**

All of the mentioned techniques are valid for treating missing values before clustering analysis, but only imputation with the EM algorithm is iterative in its functioning.

**Note: Soft assignment can be considered as the probability of being assigned to each cluster: say K = 3 and for some point xn, p1 = 0.7, p2 = 0.2, p3 = 0.1)****Which of the following algorithm(s) allows soft assignments?**

**Gaussian mixture models****Fuzzy K-means**

Options:

A. 1 only

B. 2 only

C. 1 and 2

D. None of these

**Solution: (C)**

Both, Gaussian mixture models and Fuzzy K-means allow soft assignments.

**C1: {(2,2), (4,4), (6,6)}**

Options:

A. C1: (4,4), C2: (2,2), C3: (7,7)

B. C1: (6,6), C2: (4,4), C3: (9,9)

C. C1: (2,2), C2: (0,0), C3: (5,5)

D. None of these

**Solution: (A)**

Finding centroid for data points in cluster C1 = ((2+4+6)/3, (2+4+6)/3) = (4, 4)

Finding centroid for data points in cluster C2 = ((0+4)/2, (4+0)/2) = (2, 2)

Finding centroid for data points in cluster C3 = ((5+9)/2, (5+9)/2) = (7, 7)

Hence, C1: (4,4), C2: (2,2), C3: (7,7)

**C1: {(2,2), (4,4), (6,6)}**

Options:

A. 10

B. 5*sqrt(2)

C. 13*sqrt(2)

D. None of these

**Solution: (A)**

Manhattan distance between centroid C1, i.e., (4, 4) and (9, 9) = (9-4) + (9-4) = 10

**If V1 and V2 have a correlation of 1, the cluster centroids will be in a straight line****If V1 and V2 have a correlation of 0, the cluster centroids will be in a straight line**

Options:

A. 1 only

B. 2 only

C. 1 and 2

D. None of the above

**Solution: (A)**

If the correlation between the variables V1 and V2 is 1, then all the data points will be in a straight line. Hence, all three cluster centroids will form a straight line as well.

Options:

A. In distance calculation, it will give the same weights for all features

B. You always get the same clusters. If you use or donâ€™t use feature scaling

C. In Manhattan distance, it is an important step, but in Euclidean distance, it is not

D. None of these

**Solution: (A)**

Feature scaling ensures that all the features get the same weight in the clustering analysis. Consider a scenario of clustering people based on their weights (in KG) with a range of 55-110 and height (in inches) with a range of 5.6 to 6.4. In this case, the clusters produced without scaling can be very misleading as the range of weight is much higher than that of height. Therefore, it is necessary to bring them to the same scale so that they have equal weightage on the clustering result.

Options:

A. Elbow method

B. Manhattan method

C. Ecludian method

D. All of the above

E. None of these

**Solution: (A)**

Out of the given options, only the elbow method is used for finding the optimal number of clusters. The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesnâ€™t give a much better modeling of the data.

**K-means is extremely sensitive to cluster center initializations****Bad initialization can lead to Poor convergence speed****Bad initialization can lead to bad overall clustering**

Options:

A. 1 and 3

B. 1 and 2

C. 2 and 3

D. 1, 2 and 3

**Solution: (D)**

All three of the given statements are true. K-means is extremely sensitive to cluster center initialization. Also, bad initialization can lead to Poor convergence speed as well as bad overall clustering.

**Try to run the algorithm for different centroid initialization****Adjust the number of iterations****Find out the optimal number of clusters**

Options:

A. 2 and 3

B. 1 and 3

C. 1 and 2

D. All of above

**Solution: (D)**

All of these are standard practices that are used in order to obtain good clustering results.

Options:

A. 5

B. 6

C. 14

D. Greater than 14

**Solution: (B)**

Based on the above results, the best choice of the number of clusters using the elbow method is 6.

Options:

A. 2

B. 4

C. 6

D. 8

**Solution: (C)**

Generally, a higher average silhouette coefficient indicates better clustering quality. In this plot, the optimal clustering number of grid cells in the study area should be 2, at which the value of the average silhouette coefficient is the highest. However, the SSE of this clustering solution (k = 2) is too large. At k = 6, the SSE is much lower. In addition, the value of the average silhouette coefficient at k = 6 is also very high, which is just lower than k = 2. Thus, the best choice is k = 6.

**Specify the number of clusters****Assign cluster centroids randomly****Assign each data point to the nearest cluster centroid****Re-assign each point to the nearest cluster centroid****Re-compute cluster centroids**

Options:

A. 1, 2, 3, 5, 4

B. 1, 3, 2, 4, 5

C. 2, 1, 3, 4, 5

D. None of these

**Solution: (A)**

The methods used for initialization in K means are Forgy and Random Partition. The Forgy method randomly chooses k observations from the data set and uses these as the initial means. The Random Partition method **randomly assigns a cluster to each observation and then proceeds to the update step, thus computing the initial mean as** the centroid of the clusterâ€™s randomly assigned points.

Options:

A. All the data points follow two Gaussian distribution

B. All the data points follow n Gaussian distribution (n >2)

C. All the data points follow two multinomial distribution

D. All the data points follow n multinomial distribution (n >2)

**Solution: (C)**

In the EM algorithm for clustering itâ€™s essential to choose the same no. of clusters to classify the data points into the no. of different distributions they are expected to be generated from and also, the distributions must be of the same type.

**Both start with random initializations****Both are iterative algorithms****Both have strong assumptions that the data points must fulfill****Both are sensitive to outliers****The expectation-maximization algorithm is a special case of K-Means****Both require prior knowledge of the no. of desired clusters****The results produced by both are non-reproducible**

Options:

A. 1 only

B. 5 only

C. 1 and 3

D. 6 and 7

E. 4, 6 and 7

F. None of the above

**Solution: (B)**

All of the above statements are true except the 5th as instead K-Means is a special case of EM algorithm in which only the centroids of the cluster distributions are calculated at each iteration.

**For data points to be in a cluster, they must be in a distance threshold to a core point****It has strong assumptions for the distribution of data points in the dataspace****It has substantially high time complexity of order O(n3)****It does not require prior knowledge of the no. of desired clusters****It is robust to outliers**

Options:

A. 1 only

B. 2 only

C. 4 only

D. 2 and 3

E. 1 and 5

F. 1, 3 and 5

**Solution: (D)**

DBSCAN can form a cluster of any arbitrary shape and does not have strong assumptions for the distribution of data points in the data space. DBSCAN has a low time complexity of order O(n log n) only.

Options:

A. [0,1]

B. (0,1)

C. [-1,1]

D. None of the above

**Solution: (A)**

The lowest and highest possible values of the F score are 0 and 1, where 1 means that every data point is assigned to the correct cluster, and 0 means that the precession and/ or recall of the clustering analysis are both 0. In clustering analysis, a high value of F score is desired.

Options:

A. 3

B. 4

C. 5

D. 6

**Solution: (D)**

Here,

True Positive, TP = 1200

True Negative, TN = 600 + 1600 = 2200

False Positive, FP = 1000 + 200 = 1200

False Negative, FN = 400 + 400 = 800

Therefore,

Precision = TP / (TP + FP) = 0.5

Recall = TP / (TP + FN) = 0.6

Hence,

F1 = 2 * (Precision * Recall)/ (Precision + recall) = 0.54 ~ 0.5

You have successfully completed our skill test focused on conceptual and practical knowledge of clustering fundamentals and its various techniques. I hope taking this test and finding the solutions has helped you gain knowledge and boost your confidence in the topic.

Hope you article and get understanding about the clustering interview questions.

If you are preparing for a data science job interview, I suggest you also check out our guides of important interview questions on logistic regression, SQL, tensor flow, k-nearest neighbor, and naive bayes.

Lorem ipsum dolor sit amet, consectetur adipiscing elit,