*This article was published as a part of the Data Science Blogathon.*

“The more, the merrier”.

It is a perfect saying for the amount of analysis done on any dataset.

As more and more opt for a career in Data Science, the more is the need to have a Fastrack way to guide each and everyone through the path. I learned python as the base to start and then gradually added skills that helped me grow in the data science domain.

In this post, I will be adding all the important steps and python functions you can use for Exploratory Data Analysis (EDA) on any dataset.

Okay, today’s plan is to run our fingers through data and figure out as much as we can but all in an optimized way. I am writing this article to share user-defined functions to help and shorten the EDA coding time.

The most important steps to follow in a project are:

- Importing the data
- Data validation
- Column datatype
- Imputing null/missing values

- Data exploration (EDA)
- Univariate
- Bivariate
- Multivariate

- Feature Engineering
- Transformation/Scaling
- Model building (applying machine-learning algorithms) and tuning
- Score calculation

From the above, we will be covering the functions for EDA. Again, if you feel any issues while using those or you need any help on any other part, please let me know in the comments. There are several options to implement, but I have chosen the most generalized way.

## Index

- Introduction
- Univariate analysis
- Bi-variate analysis
- Multi-variate analysis
- Helpful functions
- Summary

__Introduction__

__Introduction__

The most important and time-consuming part of any analytics problem is understanding the data. It is better to spend time studying the data rather than coding the same thing again and again.

The functions we are going to build today are pretty general and you can adapt them as per your requirement.

**The pseudo-code for a user-defined function in python is:**

**Function Definition:**

def func_name(parameters ): # function name and parameters"function_steps"function_commandsreturn [return_value]

**Function call:**

func_name(parameters)

__Function for__** Univariate analysis**:

**Univariate analysis**:

Moving onto EDA, we can define any function once, and call it by passing the feature name from the dataset as parameters. I have attached a GitHub link that demonstrates the implementation of all functions described below – Github – https://github.com/r-pant/data-hacks/blob/master/big%20mart%20sales/file1.ipynb

#### Categorical:

Below function plots count plot for the feature being passed to the function.

def plot_cat(var, l=8,b=5):plt.figure(figsize = (l, b))sns.countplot(df1[var], order = df1[var].value_counts().index)

#### Continuous:

- For a simple distplot for continuous feature

def plot_cont(var, l=8,b=5):plt.figure(figsize=(l, b))sns.distplot(df1[var])plt.xlabel(var)

2. To view a detailed kde plot with all details:

# plot kde plot with median and Std valuesdef plot_cont_kde(var, l=8,b=5):mini = df1[var].min()maxi = df1[var].max()ran = df1[var].max()-df1[var].min()mean = df1[var].mean()skew = df1[var].skew()kurt = df1[var].kurtosis()median = df1[var].median()st_dev = df1[var].std()points = mean-st_dev, mean+st_devfig, axes=plt.subplots(1,2)sns.boxplot(data=df1,x=var, ax=axes[0])sns.distplot(a=df1[var], ax=axes[1], color='#ff4125')sns.lineplot(points, [0,0], color = 'black', label = "std_dev")sns.scatterplot([mini, maxi], [0,0], color = 'orange', label = "min/max")sns.scatterplot([mean], [0], color = 'red', label = "mean")sns.scatterplot([median], [0], color = 'blue', label = "median")fig.set_size_inches(l,b)plt.title('std_dev = {}; kurtosis = {};nskew = {}; range = {}nmean = {}; median = {}'.format((round(points[0],2),round(points[1],2)), round(kurt,2),round(skew,2),(round(mini,2),round(maxi,2), round(ran,2)),round(mean,2), round(median,2)))

**Functions for Bi-variate analysis**:

**Functions for Bi-variate analysis**:

The bi-variate analysis is very helpful in finding out correlation patterns and to test our hypothesis. This will help us infer and build different features to feed into our model.

__ Categorical-Categorical__:

def BVA_categorical_plot(data, tar, cat):'''take data and two categorical variables,calculates the chi2 significance between the two variablesand prints the result with countplot & CrossTab'''#isolating the variablesdata = data[[cat,tar]][:]#forming a crosstabtable = pd.crosstab(data[tar],data[cat],)f_obs = np.array([table.iloc[0][:].values,table.iloc[1][:].values])#performing chi2 testfrom scipy.stats import chi2_contingencychi, p, dof, expected = chi2_contingency(f_obs)#checking whether results are significantif p<0.05:sig = Trueelse:sig = False#plotting grouped plotsns.countplot(x=cat, hue=tar, data=data)plt.title("p-value = {}n difference significant? = {}n".format(round(p,8),sig))#plotting percent stacked bar plot#sns.catplot(ax, kind='stacked')ax1 = data.groupby(cat)[tar].value_counts(normalize=True).unstack()ax1.plot(kind='bar', stacked='True',title=str(ax1))int_level = data[cat].value_counts()

__Categorical-Continuous:__

Here, I have used two functions, one to calculate z-value and the others to plot the relation between our features.

def TwoSampleZ(X1, X2, sigma1, sigma2, N1, N2):'''function takes mean, standard dev., and no. of observations and returns: p-value calculated for 2-sampled Z-Test'''from numpy import sqrt, abs, roundfrom scipy.stats import normovr_sigma = sqrt(sigma1**2/N1 + sigma2**2/N2)z = (X1 - X2)/ovr_sigmapval = 2*(1 - norm.cdf(abs(z)))return pval--------------------------------------------------------------------------------------------------------------------------def Bivariate_cont_cat(data, cont, cat, category):#creating 2 samplesx1 = data[cont][data[cat]==category][:] # all categorical featuresx2 = data[cont][~(data[cat]==category)][:] # all continuous features#calculating descriptivesn1, n2 = x1.shape[0], x2.shape[0]m1, m2 = x1.mean(), x2.mean() # calculates meanstd1, std2 = x1.std(), x2.mean() # calculates standard deviation#calculating p-valuesz_p_val = TwoSampleZ(m1, m2, std1, std2, n1, n2)#tabletable = pd.pivot_table(data=data, values=cont, columns=cat, aggfunc = np.mean)#plottingplt.figure(figsize = (15,6), dpi=140)#barplotplt.subplot(1,2,1)sns.barplot([str(category),'not {}'.format(category)], [m1, m2])plt.ylabel('mean {}'.format(cont))plt.xlabel(cat)plt.title(' n z-test p-value = {}n {}'.format(z_p_val,table))# boxplotplt.subplot(1,2,2)sns.boxplot(x=cat, y=cont, data=data)plt.title('categorical boxplot')

__Continuous-Continuous:__

__Continuous-Continuous:__

#Defining a function to calculate correlation among columns:def corr_2_cols(Col1, Col2):res = pd.crosstab(df1[Col1],df1[Col2])# res = df1.groupby([Col1, Col2]).size().unstack()res['perc'] = (res[res.columns[1]]/(res[res.columns[0]] + res[res.columns[1]]))return res

__Functions for Multi-variate analysis__:

__Functions for Multi-variate analysis__:

def Grouped_Box_Plot(data, cont, cat1, cat2):#boxplotsns.boxplot(x=cat1, y=cont, hue=cat2, data=data, orient='v')plt.title('Boxplot')

__Summary__

*All the above functions help us cut the time and reduce redundancy in our code.*

*There are times when you will be in need to change the type of plot or add more details in the same. You can alter any function as per your requirement. Do note “Always follow a structure to complete your EDA”. I have shared the steps above you should follow while working with the dataset.*

*-Rohit*