Aakash93 Makwana — Updated On November 20th, 2023
Data Science Healthcare Intermediate Kaggle Statistics


With the rise of AI, we’ve come to rely more on data-driven decision-making to simplify the lives of working professionals. Whether it’s supply chain logistics or approving a loan for a customer, data holds the key. Leveraging the power of data science in the medical field can yield groundbreaking results. By analyzing vast amounts of modern medicine, data scientists can uncover patterns that can lead to discoveries and treatments. With the potential to revolutionize the healthcare industry, integrating data science into the medical domain is not just a good idea; it’s a necessity.

Modern Medicine | Data Science

Learning Objectives

In this article, we’ll explore how to analyze a medical dataset to create a model that predicts which medications a patient should take when faced with a specific diagnosis. It sounds intriguing, so let’s dive right in!

This article was published as a part of the Data Science Blogathon.


We will be downloading and using the Open Source dataset from kaggle:

Link here

The dataset contains

  • Age and gender of the patient
  • Diagnosis of the patient
  • Antibiotics used to treat patient
  • Dosage of the antibiotics in grams
  • Route of application of antibiotics
  • Frequency of usage of antibiotics
  • Duration of treatment using antibiotics in days
  • Indiction of antibiotics

Loading Libraries and Data set

Here, we shall import relevant libraries for the required for our exercise. Then we shall load the dataset into a dataframe and view few rows.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import collections
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,classification_report

df = pd.read_csv("/kaggle/input/hospital-antibiotics-usage/Hopsital Dataset.csv")


Output | Modern Medicine | Data Science

The column names are pretty intuitive and they do make sense. Let us examine by looking at a few statistics.

Basic Statistics

# "Let's look at some stats"


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 833 entries, 0 to 832
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Age                 833 non-null    object
 1   Date of Data Entry  833 non-null    object
 2   Gender              833 non-null    object
 3   Diagnosis           833 non-null    object
 4   Name of Drug        833 non-null    object
 5   Dosage (gram)       833 non-null    object
 6   Route               833 non-null    object
 7   Frequency           833 non-null    object
 8   Duration (days)     833 non-null    object
 9   Indication          832 non-null    object
dtypes: object(10)
memory usage: 65.2+ KB


  • Every column is a string column
  • Let’s convert column Age, Dosage(gram), Duration(days) to numeric.
  • Let’s also convert Date of Data Entry into date time.
  • Indication has one null value, all other column don’t have null values.

Data Preprocessing

Let’s clean some columns. In the previous step, we saw that all columns are integers. So, for starters, we will convert Age, Dosage and Duration to numeric values. Similarly, we will convert the Date of Data Entry into datetime type. Instead of directly converting them, we will create new columns i.e. we will create a Age2 column that will be a numeric version of Age column and so on.

df['Age2'] = pd.to_numeric(df['Age'],errors='coerce')
df['Dosage (gram)2'] = pd.to_numeric(df['Dosage (gram)'],errors='coerce')
df['Duration (days)2'] = pd.to_numeric(df['Duration (days)'],errors='coerce')
df['Date of Data Entry2'] = pd.to_datetime(df['Date of Data Entry'],errors='coerce')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 833 entries, 0 to 832
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Age                  833 non-null    object        
 1   Date of Data Entry   833 non-null    object        
 2   Gender               833 non-null    object        
 3   Diagnosis            833 non-null    object        
 4   Name of Drug         833 non-null    object        
 5   Dosage (gram)        833 non-null    object        
 6   Route                833 non-null    object        
 7   Frequency            833 non-null    object        
 8   Duration (days)      833 non-null    object        
 9   Indication           832 non-null    object        
 10  Age2                 832 non-null    float64       
 11  Dosage (gram)2       831 non-null    float64       
 12  Duration (days)2     831 non-null    float64       
 13  Date of Data Entry2  831 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(3), object(10)
memory usage: 91.2+ KB

After converting, we see some null values. Let us look at that data.

df[(df['Dosage (gram)2'].isnull())
  | (df['Duration (days)2'].isnull())
  | (df['Age2'].isnull())
  | (df['Date of Data Entry2'].isnull())
  ]#import csv


Output 2 | Modern Medicine | Data Science

Seems like some garbage values in dataset. Let’s remove them.

Now, we will also substitute the values of new column in the older columns and drop the newly created columns

df = df[~((df['Dosage (gram)2'].isnull())
  | (df['Duration (days)2'].isnull())
  | (df['Age2'].isnull())
  | (df['Date of Data Entry2'].isnull()))

df['Age'] = df['Age2'].astype('int')
df['Dosage (gram)'] = df['Dosage (gram)2']
df['Date of Data Entry'] = df['Date of Data Entry2']
df['Duration (days)'] = df['Duration (days)2'].astype('int')
df = df.drop(['Age2','Dosage (gram)2','Date of Data Entry2','Duration (days)2'],axis=1)



Output 3 | Modern Medicine | Data Science

Looking at Some Stats

Now, we shall look at some statistics across all columns. We will use the describe function and pass the include argument to get the values statistics across all columns. Let us examine that





  • We can see that the age is 1, and the max is 90.
  • Similarly, looking at the Data Entry column – we know that it contains data for 19-Dec-2019 between 1 and 7 pm.
  • Gender has 2 unique values.
  • Dosage has a minimum value of 0.02 grams and a max weight of 960 grams (seems extreme).
  • Duration has a value of 1 and a max value of 28 days.
  • Diagnosis, Name of Drug, Route, Frequency, and Indication columns have different values. We will examine them.

Univariate EDA Route and Frequency Column

Here, we will try to examine the Route and Frequency column. We will use the value_counts() function.



IV      534
Oral    293
IM        4
Name: count, dtype: int64

BD     430
TDS    283
OD     110
QID      8
Name: count, dtype: int64


  • There are 3 different routes of Drug treatment. The most used Route of Drug treatment is IV, and the least used is IM.
  • There are 4 different Frequency of Drug treatment. BD is the most used Frequency of Drug treatment, and the least used is QID.

Univariate EDA Diagnosis Column

Here, we will examine the Diagnosis column. We will use the value_counts() function.




Upon running the above code, we see that there are 263 different values. Also, we can see that each patient can have multiple diagnosis (separated by comma). So, let us try to build a wordcloud to make more sense of this column. We will try to remove stopwords from the wordcloud to filter out unnecessary noise.

text = " ".join(diagnosis for diagnosis in adf.Diagnosis)
print ("There are {} words in the combination of all diagnosis.".format(len(text)))

stopwords = set(STOPWORDS)

# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')


There are 35359 words in the combination of all diagnosis.
combination of diagnosis | Modern Medicine | Data Science


  • There are 35359 words in the combination of all diagnosis. A lot of them will be repetitive.
  • Looking at the wordcloud, Chest infection and koch lung seem to be more frequent diagnosis.
  • We can see other conditions like – diabetes, myeloma, etc.

Let’s look at the top and bottom 10 words/phrases. We will use the Counter function within the collections library.

A = collections.Counter([i.strip().lower() 
  for i in text.split(',') if (i.strip().lower()) not in stopwords ])
print('Top 10 words/phrases')
print('\nBottom 10 words/phrases')


Top 10 words/phrases
[('col', 77),
 ('chest infection', 68),
 ('ihd', 55),
 ('copd', 40),
 ('hypertension', 38),
 ('ccf', 36),
 ('type 2 dm', 32),
 ("koch's lung", 28),
 ('ckd', 28),
 ('uraemic gastritis', 18)]

Bottom 10 words/phrases
[('poorly controlled dm complicated uti', 1),
 ('poorly controlled dm septic shock', 1),
 ('hypertension hepatitis', 1),
 ('uti hepatitis', 1),
 ('uti uti', 1),
 ('cff ppt by chest infection early col', 1),
 ('operated spinal cord', 1),
 ('hematoma and paraplegia he', 1),
 ('rvi with col glandular fever', 1),
 ('fever with confusion ccf', 1)]


  • Among the top 10 words/phrases, col and chest infections are more frequent. 77 patients are diagnosed with col, and 68 are diagnosed with chest infection.
  • Among the bottom 10 words/phrases, we see that each of the listed ones occurs only once.

Univariate EDA Indication Column

Here, we will examine the Indication column. We will use the value_counts() function.



chest infection            92
col                        32
uti                        30
type 2 dm                  25
prevention of infection    22
pad(lt u.l)                 1
old stroke                  1
fainting attack             1
cheat infection             1
centepede bite              1
Name: count, Length: 220, dtype: int64

chest infection            0.110843
col                        0.038554
uti                        0.036145
type 2 dm                  0.030120
prevention of infection    0.026506
pad(lt u.l)                0.001205
old stroke                 0.001205
fainting attack            0.001205
cheat infection            0.001205
centepede bite             0.001205
Name: proportion, Length: 220, dtype: float64

We see that there are 220 values of this Indication column. Seems like some spelling errors are prevalent as well – chest vs cheat. We can clean the same. For current exercise – let us consider top 25% indications.

top_indications = df['Indication'].value_counts(1).reset_index()
top_indications['cum_proportion'] = top_indications['proportion'].cumsum()
top_indications = top_indications[top_indications['cum_proportion']<0.25]


  Indication proportion cum_proportion
0 chest infection 0.110843 0.110843
1 col 0.038554 0.149398
2 uti 0.036145 0.185542
3 type 2 dm 0.030120 0.215663
4 prevention of infection 0.026506 0.242169

We will use this dataframe in the bivariate analysis and modeling exercise later.

Univariate EDA Name of Drug column

Here, we will examine this column. We will use the value_counts() function.

display(df['Name of Drug'].nunique())
display(df['Name of Drug'].value_counts())



Name of Drug
ceftriaxone                    221
co-amoxiclav                   162
metronidazole                   59
cefixime                        58
septrin                         37
clarithromycin                  32
levofloxacin                    31
amoxicillin+flucloxacillin      29
ceftazidime                     24
cefepime                        14
cefipime                        13
clindamycin                     12
rifaximin                       10
amikacin                         9
cefoperazone                     9
coamoxiclav                      9
meropenem                        8
ciprofloxacin                    7
gentamicin                       5
pen v                            5
rifampicin                       5
azithromycin                     5
cifran                           4
mirox                            4
amoxicillin                      4
streptomycin                     4
ceftazidine                      4
clarthromycin                    4
amoxicillin+flucoxacillin        3
cefoparazone+sulbactam           3
linezolid                        3
ofloxacin                        3
norfloxacin                      3
imipenem                         2
flucloxacillin                   2
ceftiaxone                       2
cefaziclime                      2
ceftriaxone+sulbactam            2
cefexime                         2
pipercillin+tazobactam           1
amoxicillin+flucloxiacillin      1
pentoxifylline                   1
menopem                          1
levefloxacin                     1
pentoxyfylline                   1
doxycyclin                       1
amoxicillin+flucoxiacillin       1
vancomycin                       1
cefteiaxone                      1
dazolic                          1
amoxicillin+flucloaxcin          1
amoxiclav                        1
doxycycline                      1
cefoperazone+sulbactam           1
nitrofurantoin                   1


  • There are 55 unique drugs.
  • Again, we see spelling errors – ceftriaxone vs ceftiaxone. The unique drug count may be lower.

Let us consider the top 5 drugs here. We will create a column cum_proportion to store the cumulative proportion that each drug is contributing.

top_drugs = (df['Name of Drug'].value_counts(1).reset_index())
top_drugs['cum_proportion'] = top_drugs['proportion'].cumsum()
top_drugs = top_drugs.head()


  Name of Drug proportion cum_proportion
0 ceftriaxone 0.265945 0.265945
1 co-amoxiclav 0.194946 0.460890
2 metronidazole 0.070999 0.531889
3 cefixime 0.069795 0.601685
4 septrin 0.044525 0.646209

The top 5 drugs are given to 64.6% of the patients.

PS : If we correct the spelling errors – this could be more than 64.6%.

We will use this dataframe in the bivariate analysis and modelling exercise later.

Bivariate Analysis Indication vs Name of Drug

Here, we will consider the top_indications and the top_drugs dataframes we created and try to see the distribution across them i.e. we compare top 5 drug name vs top 25% Indication. We will use pivot_table() function.

   &(df['Name of Drug'].isin(top_drugs['Name of Drug']))]
.pivot_table(index='Indication',columns='Name of Drug',values='Age',aggfunc='count')


Name of Drug cefixime ceftriaxone co-amoxiclav metronidazole septrin
chest infection 7.0 22.0 27.0 3.0 1.0
col 1.0 14.0 3.0 3.0 1.0
prevention of infection 2.0 6.0 3.0 2.0 1.0
type 2 dm 2.0 7.0 4.0 1.0 NaN
uti 3.0 9.0 2.0 NaN NaN


  • for a chest infection, co-amoxiclav is the most recommended medicine, followed by ceftriaxone.
  • Similar observations can be drawn for other indications.

Note: These medicines are prescribed under several other factors – like the Age, Gender, Diagnosis, and Medical history of the patient.

Bivariate Analysis Indication vs Age

Here, we will try to understand if a few conditions appear for older patients. We will consider the top_indications and examine the mean and median values of age.



  mean median count
chest infection 57.173913 61.5 92
col 48.031250 48.0 32
prevention of infection 42.136364 46.0 22
type 2 dm 63.560000 62.0 25
uti 50.233333 53.5 30


  • Indications – col and prevention of infection are more observed in younger patients.
  • uti is observed in mid-age patients
  • Indications: chest infection and type 2 dm are more observed in older patients.

Bivariate Analysis Name of Drug vs Age

Here, we will try to understand if specific drugs are used for older patients. We will consider the top_drugs and examine the mean and median values of age.

(df[df['Name of Drug'].isin(top_drugs['Name of Drug'])]
.groupby('Name of Drug')['Age']


  mean median count
Name of Drug      
septrin 44.513514 40.0 37
cefixime 43.137931 42.0 58
ceftriaxone 50.484163 49.0 221
metronidazole 53.661017 54.0 59
co-amoxiclav 56.518519 60.0 162


  1. septrin and cefixime is prescribed to younger patients
  2. ceftriaxone is prescribed to mid-age patients
  3. metronidazole and co-amoxiclav is prescribed to older patients

Modeling Approach

Here, we will try to help the Pharmacist or the Prescribing Doctor. The problem statement is to identify which Drug will be given to the patient basis the Diagnosis, Age and Gender column.

Few considerations and assumptions:

  1. For this exercise – we will consider only the top 5 drugs and mark the rest as Others.
  2. So, for each value of (Diagnosis, Age, Gender) – the objective is to identify the drug that will be recommended. The baseline model is (1/6) = 16.67% accuracy.

Let us see if we can try to beat that. We will create a copy of the dataframe for the modelling exercise.

adf = df.copy()
adf['Output'] = np.where(df['Name of Drug'].isin(top_drugs['Name of Drug']),
                    df['Name of Drug'],'Other')


Other            294
ceftriaxone      221
co-amoxiclav     162
metronidazole     59
cefixime          58
septrin           37
Name: count, dtype: int64

We have seen this distribution earlier as well. It means that the classes are not evenly distributed. We will need to keep this in mind while initializing the model.

Feature Engineering

We will try to capture more frequent words/ngrams in the diagnosis column. We will use the Count Vectorizer module.

vectorizer = CountVectorizer(max_features=150,stop_words='english',
X = vectorizer.fit_transform(adf['Diagnosis'].str.lower())


array(['abscess', 'acute', 'acute ge', 'af', 'aki', 'aki ckd',
       'aki ckd retro', 'alcoholic', 'anaemia', 'art', 'bite', 'bleeding',
       'bone', 'ca', 'cap', 'ccf', 'ccf chest', 'ccf chest infection',
       'ccf increased', 'ccf increased lft', 'ccf koch', 'ccf koch lung',
       'cerebral', 'cerebral infarct', 'chest', 'chest infection',
       'chest infection pre', 'chronic', 'ckd', 'ckd chest infection',
       'ckd retro', 'col', 'col portal', 'col portal hypertension',
       'copd', 'copd chest', 'copd chest infection', 'debility',
       'debility excessive', 'debility excessive vomitting', 'diabetes',
       'disease', 'disease renal', 'disease renal impairment', 'dm',
       'dm ihd', 'edema', 'effusion', 'encephalopathy', 'excessive',
       'excessive vomitting', 'excessive vomitting uraemic', 'failure',
       'fever', 'gastritis', 'gastritis hcv', 'gastritis hcv aki', 'ge',
       'general', 'general debility', 'general debility excessive',
       'gi bleeding', 'hcv', 'hcv aki', 'hcv aki ckd', 'hcv col', 'heart',
       'hepatic', 'hepatic encephalopathy', 'hepatitis', 'ht',
       'ht disease', 'ht disease renal', 'hypertension', 'ihd',
       'impairment', 'impairment koch', 'impairment koch lungs',
       'increased', 'increased lft', 'infarct', 'infection',
       'infection pre', 'infection pre diabetes', 'koch', 'koch lung',
       'koch lung copd', 'koch lungs', 'koch lungs ccf', 'left',
       'left sided', 'leg', 'lft', 'lung', 'lung copd', 'lung copd chest',
       'lungs', 'lungs ccf', 'lungs ccf increased', 'marrow', 'multiple',
       'multiple myeloma', 'multiple myeloma ckd', 'myeloma',
       'myeloma ckd', 'newly', 'old', 'pleural', 'pleural effusion',
       'pneumonia', 'portal', 'portal hypertension', 'pre',
       'pre diabetes', 'pulmonary', 'pulmonary edema', 'renal',
       'renal impairment', 'renal impairment koch', 'retro', 'right',
       'rvi', 'rvi stage', 'rvi stage ht', 'septic', 'septic shock',
       'severe', 'severe anaemia', 'severe anaemia multiple', 'shock',
       'sided', 'snake', 'snake bite', 'stage', 'stage ht',
       'stage ht disease', 'stroke', 'tb', 'type', 'type dm',
       'type dm ihd', 'uraemic', 'uraemic gastritis',
       'uraemic gastritis hcv', 'uti', 'uti type', 'uti type dm',
       'vomitting', 'vomitting uraemic', 'vomitting uraemic gastritis'],

The above list showcases the top 150 ngrams observed in the diagnosis column after removing the stopwords.

Dataset Creation

Here, we will create a single dataframe containing the features we just made and the age and gender column as input. We will use Label Encoder to convert the Drug Names into numeric values.

feature_df = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names_out())
feature_df['Age'] = adf['Age'].fillna(0).astype('int')
feature_df['Gender_Male'] = np.where(adf['Gender']=='Male',1,0)

le = LabelEncoder()
feature_df['Output'] = le.fit_transform(adf['Output'])

Now, we will do a train test split. We will keep 20% of the data as the test set. We will use the random_state argument to ensure reproducibility.

X_train, X_test, y_train, y_test = train_test_split(
  test_size=0.2, random_state=42)


Here, I have tried using the Random Forest model. You can try with other models as well. We will use the random_state argument to ensure reproducibility. We will use the class_weight parameter since we had seen earlier that the classes are not evenly distributed.

clf = RandomForestClassifier(max_depth=6, random_state=0, 
clf.fit(X_train, y_train)

Let us see the accuracy and other metrics for the training dataset,

# accuracy on X_train data
final_accuracy = clf.score(X_train, y_train)
print("final_accuracy is : ",final_accuracy)

# creating a confusion matrix for determining and visualizing the accuracy score
clf_predict = clf.predict(X_train)
print(classification_report(y_train, clf_predict,


final_accuracy is :  0.411144578313253
               precision    recall  f1-score   support

        Other       0.84      0.30      0.44       236
     cefixime       0.16      0.86      0.27        49
  ceftriaxone       0.74      0.26      0.39       176
 co-amoxiclav       0.48      0.47      0.48       127
metronidazole       0.39      0.59      0.47        49
      septrin       0.45      0.93      0.61        27

     accuracy                           0.41       664
    macro avg       0.51      0.57      0.44       664
 weighted avg       0.64      0.41      0.43       664

Similarly, let us see accuracy and other metrics for the test dataset.

# accuracy on X_test data
final_accuracy = clf.score(X_test, y_test)
print("final_accuracy is : ",final_accuracy)

# creating a confusion matrix for determining and visualizing the accuracy score
clf_predict = clf.predict(X_test)
print(classification_report(y_test, clf_predict,


final_accuracy is :  0.38323353293413176
               precision    recall  f1-score   support

        Other       0.71      0.38      0.49        58
     cefixime       0.08      0.56      0.14         9
  ceftriaxone       0.36      0.09      0.14        45
 co-amoxiclav       0.64      0.71      0.68        35
metronidazole       0.31      0.40      0.35        10
      septrin       0.44      0.40      0.42        10

     accuracy                           0.38       167
    macro avg       0.42      0.42      0.37       167
 weighted avg       0.53      0.38      0.41       167

Key Observations:

  • The model is 41.11% accurate on train data and 38.32% on test data.
  • If we look at the f1-score, we see that for Other drug names – the training dataset had a value of 0.44, and the test dataset had a value of 0.49
  • We can also see that the test f1-score is lower for cefixime and ceftriaxone. While cefixime has only 9 samples, ceftriaxone has 45 samples. So, we need to analyze these data points to understand the scope of improving our feature sets.


In this article, we analyzed a medical dataset end to end. Then, we cleaned the dataset. We saw basic statistics, distribution, and even a wordcloud to understand the columns. Then, we created a problem statement to help the Pharmacist or the Prescribing Doctor with the modern medicine. It was to identify which Drug would be given to the patient basis the Diagnosis, Age, and Gender column.

Key Takeaways

  • We considered the top 150 words/bigrams/trigrams from the Diagnosis column for the modeling exercise. Let’s say we dissected the diagnosis column and created 150 features.
  • We had the Age and Gender feature as well. So, in total, we had 152 features.
  • Using these 152 features, we tried to determine which medicine/drug would be prescribed.
  • The baseline/random model was 16.67% accurate. Our model was 41.11% accurate on train data and 38.32% on test data. That is a significant improvement over the baseline model.

Some things to try out to improve the model performance:

  1. Try TF-IFD or embeddings to extract features.
  2. Try different models and use grid search and cross-validation to optimize accuracy.
  3. Try predicting more medicines.

Thanks for reading my article. Feel free to connect with me on LinkedIn to discuss this.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.