Bayesian Decision Theory – Discriminant Functions and Normal Density(Part 3)

CHIRAG GOYAL 31 May, 2021 • 6 min read

This article was published as a part of the Data Science Blogathon

Introduction

This is Part-3 of the 4-part blog series on Bayesian Decision Theory.

In the previous article, we discussed the generalized cases for taking decisions in the Bayesian Decision Theory. Now, in this article, we will cover some new concepts including Discriminant Functions and Normal Density in Bayesian Decision Theory.

For previous articles, Links are Part-1 and Part-2.

The topics covered in this article are:

1. Classifiers, Discriminant Functions, and Decision Surfaces

2. The Normal Density

Univariate normal density
Multivariate normal density

Let’s get started,

Pattern classifiers can be represented in many different ways. Most used among all is using a set of discriminant function g_i(x), i=1, . . . , c. The decision of the classifier works as assigning feature vector x to class w_i– if a certain decision rule is to be followed like the followed earlier i.e.

g_i(x) > g_j(x) for j!=i

Hence this classifier can be viewed as a network that computes the c discriminant function and chooses the action to choose the state of nature that has the highest discriminant.

Classifiers, Discriminant Functions, and Decision Surfaces – NeuMachine| Bayesian Decision

Fig. The functional structure of a general statistical pattern classifier includes d inputs and discriminant functions g_i(x). A subsequent step determines which of the discriminant values is the maximum and categorizes the input pattern accordingly. The arrows show the direction of the flow of information, though frequently the arrows are omitted when the direction of flow is self-evident.

Image Source: Google Images

Generally g_i(x) = -R(a_i | x), for minimum conditional risk we get the maximum discriminant function.

Things can be further simplified by taking g_i(x) = P(w_i | x), so the maximum discriminant function corresponds to the maximum posterior probability.

Thus the choice of a discriminant function is not unique. We can temper the function by multiplying by the same positive constant or by shifting them by the same constant without any influence on the decision. These observations eventually lead to significant computational and analytical simplification. An example of discriminant function modification with tempering with the output decision is :

g_i(x)= P(ω_i|x)= p(x|ω_i)P(ω_i) / sum(p(x|ω_j)P(ω_j))

g_i(x)= p(x|ω_i)P(ω_i)

g_i(x)= ln p(x|ω_i) + ln P(ω_i)

There will be no change in the decision rule.

The aim of any decision rule is to divide the feature space into c decision regions, which are R₁, R₂, R₃, . . , R_c. As discussed earlier if g_i(x) >g_j(x) for all j !=i, then x is in R_i, and the decision rule leads us to assign the features x to the state of nature w_i. The regions are separated by decision boundaries.

Bayesian Decision| two dimensional two category classifier

Fig. In this two-dimensional two-category classifier, the probability densities are Gaussian, the decision boundary consists of two hyperbolas, and thus the decision region R₂ is not simply connected. The ellipses mark where the density is 1/e times that at the peak of the distribution.

Image Source: Google Images

The Two Category Case

We can always build a dichotomizer (a special name for a classifier that classifies into two categories) for simplification. We used the decision rule that assigned x to w₁ if g₁> g₂, but we can define a single discriminant function,

g(x) ≡ g₁(x) − g₂(x),

And the decision rule decides w₁ if g(x) > 0; otherwise it decides w₂.

Hence dichotomizer can be seen as a system that computes a single discriminant function g(x) and classifies the x according to the sign of the output. The above equation can be further simplified as

g(x)= P(ω₁|x) −P(ω₂|x)

g(x)=ln(p(x|ω₁)/ p(x|ω₂)) + ln(P(w₁)/P(w₂))

Normal Density

Till now we are well aware that the Bayes classifiers are determined by class- conditional densities p(x|w_i) and the priors. The most attractive density function that has been investigated is none other than multivariate normal density.

Further in this article, we get a brief exposition of multivariate normal density.

Univariate Normal density

The continuous univariate normal density p(x) can be given as,

density

The expected value of x or the average or mean over the feature space.

𝜇 ≡ E [ x ] = Integration (from – ∞ to ∞ ): xp(x) dx

Variance is given as

σ² ≡ E [ (x − μ)²] = Integration (from – ∞ to ∞ ): (x − μ)²p(x) dx

This density is fully governed by these two parameters: its mean and variance. We also write p(x)=N (𝜇, 𝜎²) which is read as x is distributed normally with the mean of 𝜇 and variance 𝜎²

Density Curve of the Normal Distribution | CK-12 Foundation

The entropy of any distribution is given by

H(p(x)) = Integration (from – ∞ to ∞ ): p(x) ln p(x) dx

Which is measured in nats, but if log2 is used then the unit is a bit. The entropy of any distribution is a non-negative entity that given as an idea of fundamental uncertainty in the values of instances selected randomly from a distribution. As matter of fact, the normal distribution has the maximum entropy of all distribution having a given mean and variance.

Why Gaussian is Important?

The central limit theorem, states that the aggregated effect of a large number of small random independent disturbances will eventually lead to Gaussian distribution. Many real-life patterns -from handwritten characters to speech sounds — can be viewed as some ideal or prototype pattern corrupted by a large number of random processes.

Multivariate Normal Density

A multivariate normal distribution in dimensions of d is given as,

p(x) = (1/(2π)^d/2|Σ|^1/2 )exp[ −1/2(x − μ)^tΣ⁻¹(x − μ) ]

where,

x = d-component column vector

μ = d-component mean vector

Σ = d by d covariance matrix

|Σ| and Σ⁻¹ are the determinant and inverse respectively

(x – μ)^t is the transpose of (x – μ)

Some basic prerequisites are

Inner product

a^tb = sum(from i=1 to i=d): a_ib_i

Mean

μ ≡ E [ x ] = Integration (from – ∞ to ∞ ): xp(x) dx

Covariance matrix

Σ ≡ E [(x − μ)(x − μ)^t] = Integration (from – ∞ to ∞ ): (x − μ)(x − μ)^tp(x) dx

If x_iis the i^th component of x, μ_i the i^th component of μ, and σ_ij the ij^th component of Σ, then

μ_i = E [ x_i]

and,

σ_ij= E [(x_i − μ_i)(x_j − μ_j)]

The covariance matrix holds a very important part of the discussion. The covariance matrix is always positive semidefinite and symmetric, here we will restrict our attention to the case in which the covariance matrix is positive definite, for the determinants to be positive.

σ_ii are the variances and σ_ijare the covariances. If σ_ij =0 then x_i and x_j are statistically independent.

This ends today’s discussion!

In the next article, we will discuss the calculation of discriminant functions for normal density under different conditions and try to interpret all of those functions, and see the uses of all those cases in the real-life use-cases of Bayesian Decision Theory.

Discussion Problem

Determine the optimal decision boundary of Naive Bayes classifier where w = {w₁, w₂} and p(x|w₁)= N(1, 1.5) and p(x|w₂)= N(2, 2.5). Probability of prior is given as P(w₁)= 1/7 and P(w₂)=6/7 and loss matrix is given as [ [ 4, 3], [ 1, 2] ].

Note: Here N(x, y) indicates the normal density.

End Notes

Thanks for reading!

If you liked this and want to know more, go visit my other articles on Data Science and Machine Learning by clicking on the Link

Please feel free to contact me on Linkedin, Email.

Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you.

About the author

Chirag Goyal

Currently, I am pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the Indian Institute of Technology Jodhpur(IITJ). I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

CHIRAG GOYAL 31 May 2021

I am currently pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the Indian Institute of Technology Jodhpur(IITJ). I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence. Feel free to connect with me on Linkedin.

Beginner Maths Statistics

Bayesian Decision Theory – Discriminant Functions and Normal Density(Part 3)

Introduction

Let’s get started,

The Two Category Case