Heart Disease Prediction Using Logistic Regression on UCI Dataset

Arvind N 03 Dec, 2023

7 min read

This article was published as a part of the Data Science Blogathon.

Source: https://health.clevelandclinic.org/what-is-a-mild-heart-attack-and-is-it-a-big-deal-or-not/

Overview

Hi everyone!

In this article, we study, in detail, the hyperparameters, code and libraries used for heart disease prediction using logistic regression on the UCI heart disease dataset.

Link to the dataset: https://www.kaggle.com/arviinndn/heart-disease-prediction-uci-dataset/data

Source: https://archive.ics.uci.edu/ml/datasets/heart+disease

Overview
Importing Libraries
Data Exploration and Visualization
Standard Scaling
Train-Test Split
Model Fitting and Prediction
Hyperparameter Details
Conclusion
Frequently Asked Questions
- Sources

Importing Libraries

#Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Numpy: Numpy is an open-source python library for handling n-dimensional arrays, written in the C programming language. Python is also written in the C programming language. Loading Numpy in the memory enables the Python interpreter to work with array computing in a fast and efficient manner. Numpy offers the implementation of various mathematical functions, algebraic routines and Fourier transforms. Numpy supports different hardware and computing technologies and is well suited for GPU and distributed computing. The high-level language used provides ease of use with respect to the various Numpy functionalities.

Pandas: Pandas is a fast open-source data analysis tool built on top of Python. Pandas allow various data manipulation activities using Pandas DataFrame objects. The different Pandas methods used in this study will be explained in detail later.

Matplotlib: Matplotlib is a Python library that enables plotting publication-quality graphs, static and interactive graphs using Python. Matplotlib plots can be exported to various file formats, can work with third-party packages and can be embedded in Jupyter notebooks. Matplotlib methods used are explained in detail as we encounter them.

Seaborn: Seaborn is a statistical data visualization tool for Python built over Matplotlib. The library enables us to create high-quality visualizations in Python.

Data Exploration and Visualization

dataframe = pd.read_csv('heart_disease_dataset_UCI.csv')

The read_csv method from the Pandas library enables us to read the *.csv (comma-separated value) file format heart disease dataset published by UCI into the dataframe. The DataFrame object is the primary Pandas data structure which is a two-dimensional table with labelled axes – along rows and along with columns. Various data manipulation operations can be applied to the Pandas dataframe along rows and columns.

dataframe.head(10)

The Pandas dataframe head(10) method enables us to get a peek at the top 10 rows of the dataframe. This helps us in gaining an insight into the various columns and an insight into the type and values of data being stored in the dataframe.

dataframe.info()

The Pandas dataframe info() method provides information on the number of row-entries in the dataframe and the number of columns in the dataframe. Count of non-null entries per column, the data type of each column and the memory usage of the dataframe is also provided.

dataframe.isna().sum()

The Pandas dataframe isna().sum() methods provide the count of null values in each column.

plt.<a onclick="parent.postMessage({'referent':'.matplotlib.pyplot.figure'}, '*')">figure(figsize=(15,10))
sns.<a onclick="parent.postMessage({'referent':'.seaborn.heatmap'}, '*')">heatmap(dataframe.corr(),linewidth=0.01,annot=True,cmap="winter")
plt.<a onclick="parent.postMessage({'referent':'.matplotlib.pyplot.show'}, '*')">show()

The Matplotlib.figure API implements the Figure class which is the top-level class for all plot elements. Figsize = (15,10) defines the plot size as 15 inches wide and 10 inches high.

The Seaborn heatmap API provides the colour encoded plot for 2-D matrix data. The Pandas dataframe corr() method provides pairwise correlation (movement of the two variables in relation to each other) of columns in the dataframe. NA or null values are excluded by this method. The method allows us to find positive and negative correlations and strong and weak correlations between the various columns and the target variable. This can help us in feature selection. Weakly correlated features can be neglected. Positive and negative correlations can be used to describe model predictions. Positive correlation implies that as the value of one variable goes up, the value of the other variable also goes up. A negative correlation implies that as the value of one variable goes down, the value of the other variable also goes down. Zero correlation implies that there is no linear relationship between the variables. linewidth gives the width of the line that divides each cell in the heatmap. Setting can not to True, labels each cell with the corresponding correlation value. cmap value defines the mapping of the data value to the colorspace.

dataframe.hist(figsize=(12,12))

The Pandas dataframe hist method plots the histogram of the different columns, with figsize equal to 12 inches wide and 12 inches high.

Standard Scaling

X = dataframe.iloc[:,0:13] y = dataframe.iloc[:,13]

Next, we split our dataframe into features (X) and target variable (y) by using the integer-location based indexing ‘iloc’ dataframe property. We select all the rows and the first 13 columns as the X variable and all the rows and the 14th column as the target variable.X = X.values y = y.values

We extract and return a Numpy representation of X and y values using the dataframe values property for our machine learning study.from sklearn.preprocessing import StandardScaler X_std=StandardScaler().fit_transform(X)

We use the scikit-learn (sklearn) library for our machine learning studies. The scikit-learn library is an open-source Python library for predictive data analysis and machine learning and is built on top of Numpy, SciPy and Matplotlib. The SciPy ecosystem is used for scientific computing and provides optimized modules for Linear Algebra, Calculus, ODE solvers and Fast Fourier transforms among others. The sklearn preprocessing module implements function like scaling, normalizing and binarizing data. The StandardScaler standardizes the features by making the mean equal to zero and variance equal to one. The fit_transform() method achieves the dual purpose of (i) the fit() method by fitting a scaling algorithm and finding out the parameters for scaling (ii) the transform method, where the actual scaling transformation is applied by using the parameters found in the fit() method. Many machine learning algorithms are designed based on the assumption of expecting normalized/scaled data and standard scaling is thus one of the methods that help in improving the accuracy of machine learning models.

Train-Test Split

from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(X_std,y,test_size=0.25,random_state=40)

The sklearn model_selection class implements different data splitter classes (split into train and test sets, KFold train and test sets etc.), Hyper-parameter optimizers (search over a grid to find optimal hyperparameters) and model validation functionalities (evaluate the metrics of the cross-validated model etc).

N.B. – KFold (K=10) cross-validation means splitting the train set into 10 parts. 9 parts are used for training while the last part is used for testing. Next, another set of 9 parts (different from the previous set) is used for training while the remaining one part is used for testing. This process is repeated until each part forms one test set. The average of the 10 accuracy scores on 10 test sets is the KFold cross_val_score.

The train_test_split method from the sklearn model_selection class is used to split our features (X) and targets (y) into training and test sets. The test size = 0.25 specifies that 25 % of data is to be kept in the test set while setting a random_state = 40 ensures that the algorithm generates the same set of training and test data every time the algorithm is run. Machine learning algorithms are random by nature and setting a random_state ensures that the results are reproducible.

Model Fitting and Prediction

from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrixlr=LogisticRegression(C=1.0,class_weight=’balanced’,dual=False, fit_intercept=True, intercept_scaling=1,max_iter=100,multi_class=’auto’, n_jobs=None,penalty=’l2′,random_state=1234,solver=’lbfgs’,tol=0.0001, verbose=0,warm_start=False)model1=lr.fit(X_train,y_train) prediction1=model1.predict(X_test)cm=confusion_matrix(y_test,prediction1) sns.heatmap(cm,annot=True,cmap=’winter’,linewidths=0.3, linecolor=’black’,annot_kws={“size”:20})TP=cm[0][0] TN=cm[1][1] FN=cm[1][0] FP=cm[0][1]print(‘Testing Accuracy for Logistic Regression:’,(TP+TN)/(TP+TN+FN+FP))

The sklearn.metrics module includes score functions, performance metrics and distance metrics among others. The confusion_matrix method provides the accuracy of classification in a matrix format.

The sklearn linear_model class implements a variety of linear models like Linear regression, Logistic regression, Ridge regression, Lasso regression etc. We import the LogisticRegression class for our classification studies. A LogisticRegression object is instantiated.

Hyperparameter Details

The parameter C specifies regularization strength. Regularization implies penalizing the model for overfitting. C=1.0 is the default value for LogisticRegressor in the sklearn library.

The class_weight=’balanced’ method provides weights to the classes. If unspecified, the default class_weight is = 1. Class weight = ‘balanced’ assigns class weights by using the formula (n_samples/(n_classes*np.bin_count(y))). e.g. if n_samples =100, n_classes=2 and there are 50 samples belonging to each of the 0 and 1 classes, class_weight = 100/(2*50) = 1

dual = False is preferable when n_samples > n_features. dual formulation is implemented only for the L2 regularizer with liblinear solver.

N.B. Liblinear solver utilizes the coordinate-descent algorithm instead of the gradient descent algorithms to find the optimal parameters for the logistic regression model. E.g. in the gradient descent algorithms, we optimize all the parameters at once. While coordinate descent optimizes only one parameter at a time. In coordinate descent, we first initialize the parameter vector (theta = [theta₀, theta₁ …….. theta_n]). In the kth iteration, only theta_i^k is updated while (theta₀^k… theta_i-1^k and theta_i+1^k-1…. theta_n^k-1) are fixed.

fit_intercept = True The default value is True. Specifies if a constant should be added to the decision function.

intercept_scaling = 1 The default value is 1. Is applicable only when the solver is liblinear and fit_intercept = True. [X] becomes [X, intercept_scaling]. A synthetic feature with constant value = intercept_scaling is appended to [X]. The intercept becomes, intercept scaling * synthetic feature weight. Synthetic feature weight is modified by L1/L2 regularizations. To lessen the effect of regularization on synthetic feature weights, high intercept_scaling value must be chosen.

max_iter = 100 (default). A maximum number of iterations is taken for the solvers to converge.

multi_class = ‘ovr’, ‘multinomial’ or auto(default). auto selects ‘ovr’ i.e. binary problem if the data is binary or if the solver is liblinear. Otherwise auto selects multinomial which minimises the multinomial loss function even when the data is binary.

n_jobs (default = None). A number of CPU cores are utilized when parallelizing computations for multi_class=’ovr’. None means 1 core is used. -1 means all cores are used. Ignored when the solver is set to liblinear.

penalty: specify the penalty norm (default = L2).

random_state = set random state so that the same results are returned every time the model is run.

solver = the choice of the optimization algorithm (default = ‘lbfgs’)

tol = Tolerance for stopping criteria (default = 1e-4)

verbose = 0 (for suppressing information during the running of the algorithm)

warm_start = (default = False). when set to True, use the solution from the previous step as the initialization for the present step. This is not applicable for the liblinear solver.

Next, we call the fit method on the logistic regressor object using (X_train, y_train) to find the parameters of our logistic regression model. We call the predict method on the logistic regressor object utilizing X_test and the parameters predicted using the fit() method earlier.

We can calculate the confusion matrix to measure the accuracy of the model using the predicted values and y_test.

The parameters for the sns (seaborn) heatmap have been explained earlier. The linecolor parameter specifies the colour of the lines that will divide each cell. The annot_kws parameter passes keyword arguments to the matplotlib method – fontsize in this case.

Finally, we calculate the accuracy of our Logistic regression model
using the confusion matrix (True Positive + True Negative)/a Total number
of test samples = 89.47%.

Conclusion

This brings us to the end of the article. In this article, we developed a logistic regression model for heart disease prediction using a dataset from the UCI repository. We focused on gaining an in-depth understanding of the hyperparameters, libraries and code used when defining a logistic regression model through the scikit-learn library.

Please write comments and reviews as applicable. Feedback is always welcome.

This article is an improvisation on the Heart disease dataset analysis posted on Analytics Vidhya. https://www.analyticsvidhya.com/blog/2022/02/heart-disease-prediction-using-machine-learning-2/

Thanks

My name is Narayanan Arvind and I am working as an AI/ML R&D Engineer, at IN-D by Intain. Connect with me on Linkedin: https://www.linkedin.com/in/arvind-narayanan-2b0632167/