## Introduction

We are going to use the cloud Jupyter Notebook to ease the setting up of the environment and completely focus on creating a package. We have also used Object Orient Programming (OOP) fundamentals like Inheritance, objects, class, magic functions, and that I won’t be discussing how each part work. I am sharing my experience of creating Python packages while taking AWS Machine Learning Foundations It took me 10 minutes to create a Python package once I knew who to build it, so practise will make you better at coding packages.

Have you wondered how Python packages like Scikit-learn, pandas, and NumPy are built? They are all based on OOP to create complex and easy-to-use packages. Why do data scientists need to build python packages? In modern times OOP is a necessity and it helps users to build program and share it with the public or within organizations.

## Parent Class

So, without wasting time let’s dive into the coding and discuss our Parent class Distribution which will be used by Gaussian and Binomial Classes. We are using the Jupyter notebook magic function “%%witefile ” to create python files.

`%%writefile distributions/general.py`

The outputs of all Jupyter notebook cells are in bold

This will create a python file into the distributions folder and to make things simple you need to create a test folder that should contain your test files, a distribution folder that should contain all your package’s files, and a data folder that contains a .txt file containing data.

The Distribution class takes 2 arguments, mean and standard deviation, it also contains the read_data_file function to access the file and store it to the data variable.

```%%writefile distributions/general.py
class Distribution:
def __init__(self, mu=0, sigma=1):
"""
Generic distribution class for calculating and
visualizing a probability distribution.
Attributes:
mean (float) representing the mean value of the distribution
stdev (float) representing the standard deviation of the distribution
data_list (list of floats) a list of floats extracted from the data file
"""
self.mean = mu
self.stdev = sigma
self.data = []
"""
Function to read in data from a txt file. The txt file should have
one number (float) per line. The numbers are stored in the data attribute.
Args:
file_name (string): name of a file to read from
Returns:
None
"""
with open(file_name) as file:
data_list = []
while line:
data_list.append(int(line))
file.close()
self.data = data_list```
`writing distributions/general.py`

## Testing Distribution Class

Everything in this class is working smoothly. We have added mean, standard deviation, and loaded the random.txt file and you can see the results.

```from distributions.general import Distribution  # importing python file from directory
general1 = Distribution(20, 3)  # creating object with 20 mean, 30 standard deviation
print("Mean: {}, standard deviation: {}".format(general1.mean, general1.stdev))
## everything is working perfectly```
Mean: 20, standard deviation: 3
Data Loaded: [10, 23, 45, 12, 23, 45, 67, 100, 300, 250, 45, 68, 29, 59, 239, 934, 12, 321, 12, 32, 1]

## Gaussian Distribution

The Gaussian distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. – Wikipedia

### Mean

The mean of a list of numbers is the sum of all of the numbers divided by the number of numbers. Mean – Wikipedia

### Standard Deviation

This is a measure of the amount of variation of a set of values using the mean. The BMJ

### Probability density function

The parameter mu is the mean, while the parameter sigma is the standard deviation. The x is the value in a list.

### Gaussian Class

Without going deep into how inheritance work and how the magic function works, let’s discuss the functions:

1. Initialize the parent class Distribution
2. Create calculate_mean function -> calculate
mean from data
3. Create calculate_stdev function -> calculate
standard deviation from data
4. Create plot_histogram function -> display
histogram using matplotlib library
5. Create pdf function -> calculate
proabaility density function of data using mean and stdev
6. Create plot_histogram_pdf function -> plot the
normalized histogram of the data and a plot of the
probability
density function along the same range.
Gaussian distributions objects
8. Create magic function __repr__ -> output the
characteristics of the Gaussian instance
```%%writefile distributions/gaussian.py
import math
import matplotlib.pyplot as plt
from .general import Distribution

class Gaussian(Distribution):
"""
Gaussian distribution class for calculating and
visualizing a Gaussian distribution.

Attributes:
mean (float) representing the mean value of the distribution
stdev (float) representing the standard deviation of the distribution
data_list (list of floats) a list of floats extracted from the data file

"""
def __init__(self, mu=0, sigma=1):

Distribution.__init__(self, mu, sigma)

def calculate_mean(self):

"""
Function to calculate the mean of the data set.

Args:
None

Returns:
float: mean of the data set

"""

avg = 1.0 * sum(self.data) / len(self.data)

self.mean = avg

return self.mean

def calculate_stdev(self, sample=True):

"""
Function to calculate the standard deviation of the data set.

Args:
sample (bool): whether the data represents a sample or population

Returns:
float: standard deviation of the data set

"""

if sample:
n = len(self.data) - 1
else:
n = len(self.data)

mean = self.calculate_mean()

sigma = 0

for d in self.data:
sigma += (d - mean) ** 2

sigma = math.sqrt(sigma / n)

self.stdev = sigma

return self.stdev

def plot_histogram(self):
"""
Function to output a histogram of the instance variable data using
matplotlib pyplot library.

Args:
None

Returns:
None
"""
plt.hist(self.data)
plt.title('Histogram of Data')
plt.xlabel('data')
plt.ylabel('count')

def pdf(self, x):
"""
Probability density function calculator for the gaussian distribution.

Args:
x (float): point for calculating the probability density function

Returns:
float: probability density function output
"""

return (1.0 / (self.stdev * math.sqrt(2*math.pi))) * math.exp(-0.5*((x - self.mean) / self.stdev) ** 2)

def plot_histogram_pdf(self, n_spaces = 50):

"""
Function to plot the normalized histogram of the data and a plot of the
probability density function along the same range

Args:
n_spaces (int): number of data points

Returns:
list: x values for the pdf plot
list: y values for the pdf plot

"""

mu = self.mean
sigma = self.stdev

min_range = min(self.data)
max_range = max(self.data)

# calculates the interval between x values
interval = 1.0 * (max_range - min_range) / n_spaces

x = []
y = []

# calculate the x values to visualize
for i in range(n_spaces):
tmp = min_range + interval*i
x.append(tmp)
y.append(self.pdf(tmp))

# make the plots
fig, axes = plt.subplots(2,sharex=True)
axes.hist(self.data, density=True)
axes.set_title('Normed Histogram of Data')
axes.set_ylabel('Density')

axes.plot(x, y)
axes.set_title('Normal Distribution for n Sample Mean and Sample Standard Deviation')
axes.set_ylabel('Density')
plt.show()

return x, y

"""
Function to add together two Gaussian distributions

Args:
other (Gaussian): Gaussian instance

Returns:
Gaussian: Gaussian distribution

"""

result = Gaussian()
result.mean = self.mean + other.mean
result.stdev = math.sqrt(self.stdev ** 2 + other.stdev ** 2)

return result

def __repr__(self):

"""
Function to output the characteristics of the Gaussian instance

Args:
None

Returns:
string: characteristics of the Gaussian

"""

return "mean {}, standard deviation {}".format(self.mean, self.stdev)```
`writing distributions/gaussian.py `

### Experimenting

Testing __repr__ magic function.

```from distributions.gaussian import Gaussian

Gaussian(20,3) ## checking the object magic function```
`mean 20, standard deviation 3`

Initializing gaussian1 object with 25 mean and 2 standard deviations and then reading random.txt file from data folder.

```gaussian1 = Gaussian(25, 2)

Calculating probability function on 25 means and 2 Stdev on data

Then, calculating the mean and Stdev of data changes the original mean to 125.1 and Stdev to 210.77 and eventually changing our probability density function.

```print("Before pdf : ",round(gaussian1.pdf(25),5)) # checking pdf on initial mean and stdev

print("mean :",round(gaussian1.calculate_mean(),2)) # calculating mean of data
print("Stdev :",round(gaussian1.calculate_stdev(),2)) # calculating stdev of data

print("After pdf : ",round(gaussian1.pdf(75),5)) # calcuation pdf of data```
```Before pdf :  0.19947
mean : 125.1
Stdev : 210.77
After pdf :  0.00184```

Plotting histogram with PDF looks Awsome.

`gaussian1.plot_histogram_pdf();`

### The unittest

The unittest, a unit testing framework was originally inspired by JUnit and has a similar flavour as major unit testing frameworks in other languages. It supports test automation, sharing of setup and shutdown code for tests, aggregation of tests into collections, and independence of the tests from the reporting framework, see the Documentation.

### Creating Gaussian Class Test File

Test-First is a great tool. It creates better understanding and productivity in the team. The result is high-quality code – both in terms of early success in finding bugs and implementing features correctly. – Gil Zilberfeld

We are going to use the unittest library to test all of our functions so that in the future if we make any changes we can detect errors within few seconds.

Creating TestGaussianClass and has all the functions to test the functions in Gaussian class. We have used the assertEqual method to hack the validity of functions. I have tested these values myself and then added them individually to test every possibility.

```%%writefile test/test_gaussian.py

import unittest
from distributions.gaussian import Gaussian

class TestGaussianClass(unittest.TestCase):
def setUp(self):
self.gaussian = Gaussian(35, 6)

def test_initialization(self):
self.assertEqual(self.gaussian.mean, 35, 'incorrect mean')
self.assertEqual(self.gaussian.stdev, 6, 'incorrect standard deviation')

self.assertEqual(self.gaussian.data,
[10, 23, 45, 12, 23, 45, 67, 100, 300, 250, 45, 68, 29, 59, 239, 934, 12, 321, 12, 32, 1], 'data not read in correctly')

def test_meancalculation(self):
self.assertEqual(self.gaussian.calculate_mean(),
sum(self.gaussian.data) / float(len(self.gaussian.data)), 'calculated mean not as expected')

def test_stdevcalculation(self):
self.assertEqual(round(self.gaussian.calculate_stdev(), 2), 210.77, 'sample standard deviation incorrect')
self.assertEqual(round(self.gaussian.calculate_stdev(0), 2), 205.69, 'population standard deviation incorrect')

def test_pdf(self):
self.assertEqual(round(self.gaussian.pdf(25), 5), 0.01658,
'pdf function does not give expected result')
self.gaussian.calculate_mean()
self.gaussian.calculate_stdev()
self.assertEqual(round(self.gaussian.pdf(75), 5), 0.00184,
'pdf function after calculating mean and stdev does not give expected result')

gaussian_one = Gaussian(25, 3)
gaussian_two = Gaussian(30, 4)
gaussian_sum = gaussian_one + gaussian_two

self.assertEqual(gaussian_sum.mean, 55)
self.assertEqual(gaussian_sum.stdev, 5)

if __name__ == '__main__':
unittest.main()```
`writing test/test_gaussian.py`

### Running Test

Let’s run our test file from the test folder using !python. As you can see all tests have passed. I did get multiple errors first and it helped me better understand how my Gaussian class is working at every level.

`!python test/test_gaussian.py`
```......
----------------------------------------------------------------------
Ran 6 tests in 0.002s

OK```

## Binomial Distribution

The binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes-no question, and each with its own Boolean-valued outcome: success (with probability p) or failure (with probability q = 1 − p). Binomial distribution – Wikipedia

### Binomial Class

We will be using the mathematical functions mention above to create mean, standard deviation, and probability density functions. We have done the hard work in Gaussian class now we just need to follow the pattern.

1. Initialize the probability and size variable – > p,n
2. Initialize the parent class Distribution – > calculating
mean and Stdev and adding it to the parent class.
3. Create calculate_mean function -> calculate
mean from data
4. Create calculate_stdev function -> calculate
standard deviation from data
5. Create replace_stats_with_data function
>
That will calculate, probability and size from data and
then update the new mean and standard deviation.
6. Create plot_bar function -> display
bar chart using the matplotlib library
7. Create pdf function -> calculate
the probability density function of data using mean and stdev
8. Create plot_bar_pdf function -> plot
the pdf of the binomial distribution
together with two Gaussian distributions objects
10. Create magic function __repr__ -> output
the characteristics of the Gaussian instance
```%%writefile distributions/binomial.py

import math
import matplotlib.pyplot as plt
from .general import Distribution

class Binomial(Distribution):
""" Binomial distribution class for calculating and
visualizing a Binomial distribution.

Attributes:
mean (float) representing the mean value of the distribution
stdev (float) representing the standard deviation of the distribution
data_list (list of floats) a list of floats to be extracted from the data file
p (float) representing the probability of an event occurring
n (int) the total number of trials

"""

def __init__(self, prob=.5, size=20):

self.p = prob
self.n = size
mu = self.calculate_mean()
sigma = self.calculate_stdev()
Distribution.__init__(self, mu, sigma)

def calculate_mean(self):

"""Function to calculate the mean from p and n

Args:
None

Returns:
float: mean of the data set

"""
self.mean = self.p*self.n

return self.mean

def calculate_stdev(self):

"""Function to calculate the standard deviation from p and n.

Args:
None

Returns:
float: standard deviation of the data set

"""
self.stdev = math.sqrt(self.n * self.p * (1 - self.p))

return self.stdev

def replace_stats_with_data(self):

"""Function to calculate p and n from the data set

Args:
None

Returns:
float: the p value
float: the n value

"""

self.n = len(self.data)
self.p = sum(self.data)/len(self.data)
self.mean = self.calculate_mean()
self.stdev = self.calculate_stdev()

return self.p,self.n

def plot_bar(self):
"""Function to output a histogram of the instance variable data using
matplotlib pyplot library.

Args:
None

Returns:
None
"""
plt.bar(x = ['0', '1'], height = [(1 - self.p) * self.n, self.p * self.n])
plt.title('Bar Chart of Data')
plt.xlabel('outcome')
plt.ylabel('count')

def pdf(self, k):
"""Probability density function calculator for the gaussian distribution.

Args:
k (float): point for calculating the probability density function

Returns:
float: probability density function output
"""
a = math.factorial(self.n) / (math.factorial(k) * (math.factorial(self.n - k)))
b = (self.p ** k) * (1 - self.p) ** (self.n - k)

return a * b

def plot_bar_pdf(self):

"""Function to plot the pdf of the binomial distribution

Args:
None

Returns:
list: x values for the pdf plot
list: y values for the pdf plot

"""

x = []
y = []

# calculate the x values to visualize
for i in range(self.n + 1):
x.append(i)
y.append(self.pdf(i))

# make the plots
plt.bar(x, y)
plt.title('Distribution of Outcomes')
plt.ylabel('Probability')
plt.xlabel('Outcome')

plt.show()

return x, y

"""Function to add together two Binomial distributions with equal p

Args:
other (Binomial): Binomial instance

Returns:
Binomial: Binomial distribution

"""

try:
assert self.p == other.p, 'p values are not equal'
except AssertionError as error:
raise

result = Binomial()
result.p = self.p
result.n = self.n + other.n
result.mean = self.calculate_mean()
result.stdev = self.calculate_stdev()
return result
pass

def __repr__(self):

"""Function to output the characteristics of the Binomial instance

Args:
None

Returns:
string: characteristics of the Gaussian

"""

return "mean {}, standard deviation {}, p {}, n {}".format(self.mean,self.stdev,self.p,self.n)```
`writing distributions/binomial.py`

### Experimenting

Testing __repr__ magic function

```from distributions.binomial import Binomial

Binomial(0.6,300)```
`mean 180.0, standard deviation 8.48528137423857, p 0.6, n 300`

B1 = Binomial(0.4, 20) # creating object with 0.4 probability, 30 size
print(“Mean: {}, standard deviation: {}”.format(B1.mean, round(B1.stdev,2)))

```Mean: 8.0, standard deviation: 2.19
Data Loaded : [1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1]```

Testing pdf of the initial value of p 0.4 and n 20

Then, using replace_stats_with_data to calculate p and n of data and then recalculating PDF.

print(“Initial PDF”,round(B1.pdf(5), 5))

p, n = B1.replace_stats_with_data()
print(“Data p: {}, n: {}”.format(round(p,3),n))

print(“Data PDF”,round(B1.pdf(5), 5))

```Initial PDF 0.07465
Data p: 0.471, n: 17
Data PDF 0.06923```

Testing bar plot

`B1.plot_bar()`

Testing Probability Density Function bar plot.

`B1.plot_bar_pdf();`

### Binomial Class Test Function

Again, we are going to use the unittest library to test all of our functions so that in the future if we make any changes we can detect errors within few seconds.

Creating TestBinomialClass and has all the functions to test the functions in Binomial class. We have used the assertEqual method to hack the validity of functions. I have tested these values myself and then added them individually to test every possibility.

```%%writefile test/test_binomial.py

import unittest
from distributions.binomial import Binomial

class TestBinomialClass(unittest.TestCase):
def setUp(self):
self.binomial = Binomial(0.4, 20)

def test_initialization(self):
self.assertEqual(self.binomial.p, 0.4, 'p value incorrect')
self.assertEqual(self.binomial.n, 20, 'n value incorrect')

self.assertEqual(self.binomial.data,
[1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1], 'data not read in correctly')

def test_calculatemean(self):
mean = self.binomial.calculate_mean()
self.assertEqual(mean, 8)

def test_calculatestdev(self):
stdev = self.binomial.calculate_stdev()
self.assertEqual(round(stdev,2), 2.19)

def test_replace_stats_with_data(self):
p, n = self.binomial.replace_stats_with_data()
self.assertEqual(round(p,3), 0.471)
self.assertEqual(n, 17)

def test_pdf(self):
self.assertEqual(round(self.binomial.pdf(5), 5), 0.07465)
self.assertEqual(round(self.binomial.pdf(3), 5), 0.01235)

self.binomial.replace_stats_with_data()
self.assertEqual(round(self.binomial.pdf(5), 5), 0.06923)
self.assertEqual(round(self.binomial.pdf(3), 5), 0.00963)

binomial_one = Binomial(.4, 20)
binomial_two = Binomial(.4, 60)
binomial_sum = binomial_one + binomial_two

self.assertEqual(binomial_sum.p, .4)
self.assertEqual(binomial_sum.n, 80)

if __name__ == '__main__':
unittest.main()```
`writing test/test_binomial.py`

### Running Test

Running the test_binomial.py shows that no error was found during testing.

`!python test/test_binomial.py`
```.......
----------------------------------------------------------------------
Ran 7 tests in 0.002s

OK```

## Creating __init__.py Function

We need to create __init__.py file in the distributions folder to initial the Classes within the python file. This will help us call specific classes directly.

We have initiated Binomial and Gaussian class.

`%%writefile distributions/__init__.py`
```from .binomial import Binomial
from .gaussian import Gaussian```
`writing distributions/__init__.py`

## Creating setup.py Function

```setup.py function giving all necessary information about the package. It uses setuptools library to create package name, version, description, author name, etc. We have to create this file outside the distribution folder.
%%writefile setup.py
from setuptools import setup
setup(name='distributions',
version='0.2',
description='Gaussian and Binomial distributions',
packages=['distributions'],
author = "Abid Ali Awan",
author_email = "[email protected]",
zip_safe=False)```
`Writing setup.py`

## Directory

You can see how your directory should look like.

## Installing distributions Package

Let’s install the package by going into the Linux terminal and going into the work directory by using a cd. Then, use pip install . or pip install -U .

`(venv) root@deepnote:~/work # pip install -U .`

``````Processing /work
Building wheels for collected packages: distributions
Building wheel for distributions (setup.py) ... done
Created wheel for distributions: filename=distributions-0.2-py3-none-any.whl size=4800 sha256=39bc76cbf407b2870caea42b684b05efc15641c0583f195f36a315b3bc4476da
Stored in directory: /tmp/pip-ephem-wheel-cache-ef8q6wh9/wheels/95/55/fb/4ee852231f420991169c6c5d3eb5b02c36aea6b6f444965b4b
Successfully built distributions
Installing collected packages: distributions
Attempting uninstall: distributions
Found existing installation: distributions 0.2
Uninstalling distributions-0.2:
Successfully uninstalled distributions-0.2
Successfully installed distributions-0.2``````

## Testing our Package

Let’s test our package directly from the terminal and it seems like everything is working fine. Well done you have created your first python package.

```>>> from distributions import Gaussian
>>> from distributions import Binomial
>>>
>>> print(Gaussian(20,6))
mean 20, standard deviation 6
>>> print(Binomial(0.4,50))
mean 20.0, standard deviation 3.4641016151377544, p 0.4, n 50
>>>```

Do ask me questions in the comments and that’s it, I hope you enjoyed my article.

If you are still facing problems, I will be making my GitHub repo public after 15th August on a link. You can find more of my work on GitHub, Deepnote, and Medium. I share my articles on LinkedIn every week so follow me there if you want to know more about beginners’ hacks. • 