**Introduction**

This is **Part-4** of the **four-part blog series** on **Bayesian Decision Theory**.

In the previous article, we talked about discriminant functions, classifiers, Decision Surfaces, and Normal density including Univariate as well as Multivariate Normal Distribution. Now, in the final article, we will look at the discriminant functions for normal density under various considerations in Bayesian Decision Theory.

For previous articles, Links are **Part-1**, **Part-2, and Part-3.**

**Let’s get started,**

As discussed in previous article, our discriminant function is given by,

**g _{i}(x) = ln p(x|ω_{i}) + ln P(ω_{i})**

As for the normal density **p(x|ω _{i})** follows the multivariate normal distribution, so our discriminant function can be written as

g_{i}(x)= -1/2(x-μ_{i})^{t}Σ_{i}^{–1}(x-μ_{i}) – d/2ln2π – 1/2 ln(|Σ_{i}|) +ln(P(w_{i}))

We will now examine this discriminant function in detail by dividing the covariance into different cases.

**Case-1 ( Σ**_{i} = σ^{2} I )

_{i}= σ

^{2}I )

This case occurs when σ_{ij} =0 for i!=j i.e, the covariances are zero and the variance of feature i.e σ_{ii} remains σ^{2}. This implies that the features are statistically independent and each feature has variance σ^{2}. As |Σ_{i}| and d/2ln2π term are independent of i and will not change accordingly to the cases, as well as remain unimportant, so they can be ignored.

Substituting our assumption in the normal discriminant function we get,

**g _{i}(x) = – ||x – μ_{i}||^{2}/2σ^{2}+ ln P(w_{i})**

Where the Euclidean norm is

**||x − μ _{i}**||

^{2}= (x − μ_{i})^{t}(x − μ_{i})Here we notice that our discriminant function is the sum of two terms, the squared distance from the mean is normalized by the variance and the other term is the log of prior. Our decision favors the prior if x is near the mean.

Stressing more on our equation is can be further expanded as,

g_{i}(x) = − 1/2σ^{2}[x^{t}x − 2μ_{i}^{t}x + μ_{i}^{t}μ_{i}] + ln P(ω_{i})

Here the quadratic term **x ^{t}x** is the same for all i, which could also be ignored. Hence, we arrive at the equivalent linear discriminant functions.

**g _{i}(x)= w_{i}^{T} x +w_{i0}**

Comparing

**w _{i} = 1/σ^{2} μ_{i }**

And **w _{i0} = −1/2σ^{2}μ_{i}^{t}μ_{i }+ ln P(ω_{i})**

where w_{0} is the threshold or bias in the i^{th} direction

Such classifiers that use linear discriminant functions are often called linear machines.

We choose the decision surface for any linear machine such that the hyperplane is defined by the equation **g _{i}(x)= g_{j}(x) **for bi-categorical with the highest posterior probabilities.

Applying the above condition we get,

**w ^{t}(x − x_{0}) = 0**

**w = μ _{i} − μ_{j} **

x|_{0 }= 1/2(μ_{i}+ μ_{j}) − σ^{2}/ ||μ_{i}− μ_{j}|^{2}ln P(w_{i})/P(ω_{j})(μ_{i}− μ_{j})

The above equation is of a hyperplane through the point x_{0} and orthogonal to vector w. We can also notice that w = µ_{i} −µ_{j}, the hyperplane separating the two regions R_{i} and R_{j}.

Further, if P(ω_{i})= P(ω_{j}) the subtractive term in x_{0} vanishes, and the hyperplane is perpendicular bisector or halfway the means.

*Fig. As the priors change, the decision boundary throughout point x _{0} shifts away from the more common class mean (two-dimensional Gaussian distributions)*

*Fig. As the priors change, the decision boundary throughout point x _{0} shifts away from the more common class mean (one-dimensional Gaussian distributions)*

* Image Source: *link

** **

**Case- 2 ( Σ**_{i} = Σ )

_{i}= Σ )

These cases rarely occur, but hold importance when having a transition from case-1 to the more generalized case-3. In this case, the covariance matrix for all of the classes is identical. Even here we can ignore the term d/2ln2π and |Σ_{i}|. This eventually leads us to the equation,

g_{i}(x) = −1/2(x − μ_{i})^{t}Σ^{−1}(x − μ_{i}) + ln P(ω_{i})

If the prior probabilities P(w_{i}) are the same for all the classes the decision rule remains that to choose the feature vector x to the class with the nearest mean vector of class c. If the prior probabilities are biased the decision is in favor of the prior more likely class.

The quadratic form **(x−µ _{i})^{t}Σ^{−1}(x−µ_{i})** can be expanded as we again notice that the term

**x**can be ignored as it is independent of i. After this term is dropped we get the linear discriminant function.

^{t}Σ^{−1}x**g _{i}(x) = w_{i}^{t}x + w_{i0}**

**w _{i} = Σ^{−1}μ_{i }**

**w _{i0} =−1/2μ_{i}^{t}Σ^{−1}μ_{i}+ln P(ω_{i}) **

As the discriminant function is linear the outcome of decision boundaries are again the hyperplane with the equation,

**w ^{t}(x − x_{0}) = 0**

**w = Σ ^{−1}(μ_{i} −μ_{j})**

x_{0}= 1/2 (μ_{i}+ μ_{j}) − { ln [P(ωi)/P(ωj)]/(μ_{i}− μ_{j})^{t}Σ^{−1}(μ_{i}− μ_{j}) }(μ_{i}– μ_{j})

The difference this hyperplane has as compared from the on in case 1 is that it is not orthogonal to the line between the means and also it does not intersect halfway the point between the means if the priors are equal.

Fig. *The contour lines are elliptical in shape because the covariance matrix is not diagonal. However, both densities show the same elliptical shape. The prior probabilities are the same, and so the point x0 lies halfway between the 2 means. The decision boundary is not orthogonal to the red line. Instead, it is tilted so that its points are of equal distance to the contour lines in w1 and those in w2.*

*Fig. The contour lines are elliptical, but the prior probabilities are different. Although the decision boundary is a parallel line, it has been shifted away from the more likely class. With sufficient bias, the decision boundary can be shifted so that it no longer lies between the 2 means*

* Image Source: *link

** **

**Case- 3 ( Σ**_{i} = arbitrary )

_{i}= arbitrary )

We now come to the most realistic case of the covariance being different for each class, Now the only term dropped from the discriminant function equation of normal multivariate density is the d/2ln2π.

The resulting discriminant function no more remains linear, as it inherently stays quadratic:

g_{i}(x) = x^{t}W_{i}x + w_{i}^{t}x + w_{i0}

**W _{i} = −1/2 Σ_{i}^{−1}**

**w _{i} = Σ_{i}^{−1}μ_{i} **

**w _{i0} = −1/2μ_{i}^{t}Σ_{i}^{−1}μ_{i} − 1/2ln |Σ_{i}| + ln P(ω_{i})**

The decision surfaces are hyperquadrics i.e general form of hyperplanes, pairs of hyperplanes, hyperspheres, hyperparaboloids, and more freaky shapes. The extension to the more than two categories is straightforward. Some cool decision boundaries of 2-D and 3-D can be seen below

*Fig. Two bivariate normals, with completely different covariance matrix, are showing a hyper quadratic decision boundary.*

*Fig. Example of parabolic decision surface.*

*Fig. Example of straight decision surface.*

*Fig. Example of hyperbolic decision surface.*

**Image Source: link**

**This completes our Bayesian Decision Theory in a detailed manner!**

**Discussion Problems**

__Problem-1:__

Consider a two-class one-feature classification problem with gaussian densities p(x|w1) = N(0, 1) and p(x|w2) = N(1, 2). Assume equal prior probabilities. Now, answer the below questions:

1. Determine the decision boundary.

2. If prior probabilities are changed and P(w1) = 0.6 and P(w2) = 0.4, then calculate the new decision boundary?

__Problem-2:__

If there are 3n variables and one binary label, how many numbers of parameters are required to represent the naive Bayes classifier?

__Problem-3:__

In the discriminant functions for the normal density where features are independent and such feature vector has the same variance σ^{2}, then we can easily calculate the determinant **Σ **by____?

Try to solve the Practice Questions and answer them in the comment section below.

For any further queries feel free to contact me.

** **

**End Notes**

*Thanks for reading!*

If you liked this and want to know more, go visit my other articles on Data Science and Machine Learning by clicking on the **Link**

Please feel free to contact me on **Linkedin, Email**.

Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you.

__About the author__

__About the author__

**Chirag Goyal**

Currently, I am pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the **Indian Institute of Technology Jodhpur(IITJ). **I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence.

*The media shown in this article on Bayesian Decision Theory are not owned by Analytics Vidhya and are used at the Author’s discretion.*