Interview Questions on Exploratory Data Analysis (EDA)

Radhika Last Updated : 24 Jun, 2022

10 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Are you aspiring to become a data analyst/scientist, but struggling to crack the interviews? Getting a break in the data science field can be tough. Doubly so, if you are a fresher in the field. So it’s better to be prepared before facing the interviews. And, there are a lot of rounds one has to undergo for landing up data science job and one of the most important rounds in the technical round. But, what kind of questions can be asked in the technical round? How can you prepare and what are the resources you should refer to?

This article includes a list of the top 10 plausible questions which are likely to come in a technical round for a data science field job.

I have seen candidates failing the interviews because they have good knowledge about models, but did not pay much importance in the Exploratory Data Analysis part. they failed to understand the importance of balance between EDA and modeling. So, If you can answer and understand these EDA Interview Questions, rest assured, you will give a tough fight in your job interview.

Happy learning and Good luck Guys !!

Questions:

1. What is the lifecycle of the data science project?

Data Collection
Exploratory Data Analysis
Model Training and Testing
Results Analysis from the models.

EDA Interview Questions - Data Science Lifecycle

2. What is the Difference between Univariate, Bivariate, and Multivariate analysis?

Univariate – When we analyze one variable at a time, it is called univariate data analysis. This analysis aims to describe the variable in question and find patterns that exist within it. Example: height of students
Bivariate – Bivariate data involves two different variables. The analysis of this type of data deals with causes and relationships. The investigation determines the relationship between the two variables, where one of the variables is the target variable. Example: temperature and ice cream sales in the summer season.
Multivariate – Analyzing three or more variables together is categorized under multivariate data analysis. It is similar to a bivariate but contains more than one dependent variable.
Example: data for house price prediction

3. Mention the two kinds of target variables for predictive modeling.

The two kinds of target variables are:

Numerical/Continuous variable – Variables whose values lie within a range, could be any value in that range and the time of prediction, values are not bound to be from the same range too.
For example: Height of students – 5; 5.1; 6; 6.7; 7; 4.5; 5.11
Here the range of the values is (4,7)
And, the height of some new students can/cannot be any value from this range.
Categorical variable – Variables that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group on the basis of some qualitative property.
A categorical variable that can take on exactly two values is termed a binary variable or a dichotomous variable. Categorical variables with more than two possible values are called polytomous variables
For example: Exam Result: Pass, Fail (Binary categorical variable)
The blood-type of a person: A, B, O, AB (polytomous categorical variable)

4. How to perform univariate analysis for numerical and categorical variables?

For the Numerical variables:
One can plot a Box and Whiskers plot and KDE plot to better understand the data; below is an example of the Age column plotted using both box and KDE plot.

Box plot and KDE both show that an average population age lies between 25yrs to 50 yrs roughly, and the mean of the population is 38yrs. The left skewness in the KDE plot shows that more population was between 20 and 30 years and very few aged people were in the sample, which could be verified from the box plot too, as the box is aligned more towards the Q1 and not evenly distributed.

For the Categorical variables:
Bar plots and Pie Charts are a great way to analyze categorical variables to understand the categorical data. The two plots represent the number (in a bar chart) and proportion (in a pie chart) of individuals opting for Course_types

Here, the Barplot and Pie chart shows that “Course” Course_Type was highest in number with 51.3 % people subscribing to such courses, followed by “Program” Course_Type, with the least number of “Degree” Course_Type with only 0.3% subscribing to such courses.

5. How to perform Bivariate analysis for Numerical-numerical, Categorical-Categorical, and Numerical-Categorical variables?

Univariate analysis is the analysis of one(“uni”) variable and Bivariate Analysis is the analysis of exactly two variables and is one of the simplest forms of statistical analysis, used to find out if there is a relationship between two sets of values.

Though Bivariate analysis can be performed for any two sets of variables, Bivariate analysis is performed using an independent variable and the dependent variable.

Numerical-Numerical – Here, one of the numerical variables is the target variable and the other one is any other independent numerical variable. A Scatter plot is a great way for understanding numerical-numerical variable data relationships. In the example shown, sales is the target numerical variable plotted on the y-axis against user-traffic numerical variables on the x-axis.

The scatter plot helps us in understanding that User_Traffic is increasing linearly as the Sales going up, or we can also say, as User_traffic increases the sales also increase linearly.

Categorical-Categorical – One of the Categorical variables is the target variable and another one can be an independent categorical variable. In the example below, the target variable is about default next month represented by either 0 or 1 against the education categorical independent variable.

The Bivariate for categorical and categorical variables can easily be done with the help of double bar or stacked bar charts. In the above example, we can see how defaulters( represented by 1: orange color) are highest in number for High School then University and then for Others category even when they are so less in number.

Numerical-Categorical – Here, the target variable is either categorical or numerical, and in such case, bar plots or strip plots are a great way of understanding the data. Below is an example for a bar and strip plot where sales which is the numerical variable(target) is on the y-axis and course_domain is the categorical variable represented on the x-axis.

Here, Bar Plot helps in understanding that sales for “Business” course_domain give the highest sales followed by Finance, the Development and the least sales from Software course_domain. The business gives the highest sales, and the strip plot corresponding to the same helps in understanding the minimum value of sale for this category is quite high if compared with others and maximum sales value is low than others, but at last, gives the most sales.

6. What are the different tests that are used for verifying analysis/hypothesis for numerical-numerical, categorical-categorical, and numerical-categorical variables?

For numerical-numerical data, the correlation matrix is used for understanding how much are independent variables correlated with the target variable.

In this example, sales are the target variable and correlation values that are most near to either +1 or -1 are most correlated with Sales. USer-Traffic in the orange color box has a correlation value of +0.8.

For Categorical-Numerical variable testing, the T-test or Z-test is mostly used depending upon whether the number of observations is below(Z-test) or above 30(T-test). And, if in the category column, the number of categories is more the Anova test is preferred over T/Z-test.
These tests are performed using p values and further helping in accepting or rejecting the null hypothesis made for the test columns.
For Categorical-Categorical variables, the chi sq test is used. There are two types of chi-square tests.
Chi-square goodness of fit test determines if sample data matches a population.
A chi-square test for independence tests to see whether distributions of categorical variables differ from each other.
A very small chi-square test statistic means that your observed data fits your expected data extremely well. In other words, there is a relationship. A very large chi-square test statistic means that the data does not fit very well. In other words, there isn’t a relationship.

[stextbox id=’info’]

Note: If the p-value ≤ 0.05, that indicates strong evidence against the null hypothesis; so you reject the null hypothesis. And if the p-value > 0.05, indicates weak evidence against the null hypothesis, so you accept the null hypothesis.

[/stextbox]

For more understanding of tests, one should be thoroughly familiar with basic statistics concepts.

7. During the data preprocessing step, how should one treat missing/null values? How will you deal with them?

There are three types of missing data:

MCAR: Missing Completely At Random. It is the highest level of randomness. This means that the variable with the missing values is not dependent on any other variable/feature values. An example of MCAR is a weighing scale that ran out of batteries. Some of the data will be missing simply because of bad luck.
MAR: Missing At Random. This means that the missing values in any column/feature are dependent on other feature values. For example, when placed on a soft surface, a weighing scale may produce more missing values than when placed on a hard surface. Such data are thus not MCAR. If, however, we know the surface type and if we can assume MCAR within the type of surface, then the data are MAR.
MNAR: Missing Not At Random. Missing not at random data is a more serious issue and in this case, it is advisable to check the data gathering process further and understand the reason behind missing data. For example, the weighing scale mechanism may wear out over time, producing more missing data as time progresses, but we may fail to note this. If the heavier objects are measured later in time, then we obtain a distribution of the measurements that will be distorted. MNAR includes the possibility that the scale produces more missing values for the heavier objects (as above). Another example, if most people refuse to answer some particular questions, what was the reason? Was it an unclear question or some other issue? This helps in making better business decisions and saves time to do modeling as a basic issue might lie here.

Now that we have seen what type of missing data exists in our dataset, we should check what percentage of missing information exists for features.
If the missing data type is Missing completely at random, then the missing percentage of even 20 can be ignored, but if it is the other two types of missing data, missing values should not be ignored.
If missing data is MCAR with a high percentage value, they are advised to be dropped and not to be included in the modeling part. Also, if missing data is MAR with a high percentage they can be dropped, but if the percentage is low in MAR then they shouldn’t be dropped. Features with a missing percentage of more than 10% are mostly advisable not to be included in the modeling section.
Now that unnecessary features have been dropped or ignored, the features still having missing values should be treated using Imputation. Imputation is the process of filling the missing data by some statistical methods. Imputation is useful as it replaces the missing data with an estimated value based on other available information.
If the missing values in a column or feature are numerical, the values can be imputed by the mean of the complete cases of the variable. Mean can be replaced by median if the feature is suspected to have outliers. For a categorical feature, the missing values could be replaced by the mode of the column.

8. What is an outlier and how to identify them?

An outlier is an observation point that is distant from others. They sometimes represent errors in measurement, bad data collection, or simply show variables not considered when collecting the data or can be a part of the data distribution as well. And hence they might skew the results and give insights accordingly.
There is no one method to detect outliers as every dataset is different. One thing which should be practiced in detecting outliers is that you(data analyst) can inspect the unfiltered, basic observations and decide whether a value is an outlier or not based on the domain knowledge.

After performing the above step, one can understand the data and look at outliers using these two methods:

Box plot: In descriptive statistics, a box plot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers will appear separate from the plot. (Source: Wikipedia)

Scatter plot: Scatter plot graph points on two axes using Cartesian coordinates. By graphing the points this way, we can visually identify points that fall outside the expected grouping. These points are likely to be outliers. (Source: Wikipedia)

Outliers can be dropped only if it is a garbage value. Example: height of an adult = 0 ft. This cannot be true, as the height cannot be a string value. In this case, outliers can be removed. If the outliers have extreme values, they can be removed. For example, if all the data points are clustered between zero to 10, but one point lies at 100, then we can remove this point. If you cannot drop outliers, you can normalize the data. This way, the extreme data points are pulled to a similar range.

9. How can the data be normalized?

Data can be normalized by either transforming the data or by scaling the data down in a particular range.

Transformation – If the data is left-skewed, log transformation is the best way to make them behave in the normal distribution, and if the data is right-skewed, exponential transformation helps in transforming them into a normal distribution.
Scaling – There are two scalers used on a wide base
- Normalization (Min-Max Scaler): This scales down the data between 0 and 1 range where minimum value corresponds to 0 and maximum was 1.
  
  A value is normalized as follows:y = (x – min) / (max – min), where the minimum and maximum values pertain to the value x being normalized

- Standardization (Standard Scaler): This scaler helps in making a normal distribution in standard normal distribution where the mean is represented by 0 and the standard deviation is represented by 1.

A value is standardized as follows: y = (x – mean) / standard_deviation

Note: If the distribution of the quantity is normal, then it should be standardized, otherwise, the data should be normalized. Standardization can give values that are both positive and negative centered around zero. It may be desirable to normalize data after it has been standardized.

Voila !!

End Notes !!

EDA constitutes a major part of the interview questions. I hope this was helpful. Do let me know here if there are more important EDA interview questions that you think I forgot to add to this article.

Thank you 🙂

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

blogathon EDA Interview Questions

Radhika

Beginner Data Exploration Interview Prep Interviews

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Hemanth

Thank you so much for sharing this information.

Aabha

All the questions are explained so well . Really thanks for sharing

Nibedita

Thank you for the great article. Just one question and i might be wrong, shouldn't the t-test be done if number of observations are less than 30 and Z test if observation is greater than 30?

Sanjeev kumar

Photos are not loading properly. It will be easier to understand with photos that you have uploaded. Please look into it.

p koteswara rao

For Categorical-Numerical variable testing, the T-test or Z-test is mostly used depending upon whether the number of observations is below(Z-test) or above 30(T-test). This statement is wrong ,the number of observations is below 30 ( t-test) small sample test and above 30 (Z test)large sample test. please make it correct.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Interview Questions on Exploratory Data Analysis (EDA)

Introduction

Questions:

1. What is the lifecycle of the data science project?

2. What is the Difference between Univariate, Bivariate, and Multivariate analysis?

3. Mention the two kinds of target variables for predictive modeling.

4. How to perform univariate analysis for numerical and categorical variables?

5. How to perform Bivariate analysis for Numerical-numerical, Categorical-Categorical, and Numerical-Categorical variables?

6. What are the different tests that are used for verifying analysis/hypothesis for numerical-numerical, categorical-categorical, and numerical-categorical variables?

7. During the data preprocessing step, how should one treat missing/null values? How will you deal with them?

8. What is an outlier and how to identify them?

9. How can the data be normalized?

Voila !!

End Notes !!

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID