Bayesian Decision Theory – Discriminant Functions For Normal Density(Part 4)

Chirag Goyal Last Updated : 08 Jun, 2021

6 min read

This article was published as a part of the Data Science Blogathon

Introduction

This is Part-4 of the four-part blog series on Bayesian Decision Theory.

In the previous article, we talked about discriminant functions, classifiers, Decision Surfaces, and Normal density including Univariate as well as Multivariate Normal Distribution. Now, in the final article, we will look at the discriminant functions for normal density under various considerations in Bayesian Decision Theory.

For previous articles, Links are Part-1, Part-2, and Part-3.

Let’s get started,

As discussed in previous article, our discriminant function is given by,

g_i(x) = ln p(x|ω_i) + ln P(ω_i)

As for the normal density p(x|ω_i) follows the multivariate normal distribution, so our discriminant function can be written as

g_i(x)= -1/2(x-μ_i)^tΣ_i^–1(x-μ_i) – d/2ln2π – 1/2 ln(|Σ_i|) +ln(P(w_i))

We will now examine this discriminant function in detail by dividing the covariance into different cases.

Case-1 ( Σ_i = σ² I )

This case occurs when σ_ij =0 for i!=j i.e, the covariances are zero and the variance of feature i.e σ_ii remains σ². This implies that the features are statistically independent and each feature has variance σ². As |Σ_i| and d/2ln2π term are independent of i and will not change accordingly to the cases, as well as remain unimportant, so they can be ignored.

Substituting our assumption in the normal discriminant function we get,

g_i(x) = – ||x – μ_i||²/2σ²+ ln P(w_i)

Where the Euclidean norm is

||x − μ_i||² = (x − μ_i)^t(x − μ_i)

Here we notice that our discriminant function is the sum of two terms, the squared distance from the mean is normalized by the variance and the other term is the log of prior. Our decision favors the prior if x is near the mean.

Stressing more on our equation is can be further expanded as,

g_i(x) = − 1/2σ² [x^tx − 2μ_i^tx + μ_i^tμ_i] + ln P(ω_i)

Here the quadratic term x^tx is the same for all i, which could also be ignored. Hence, we arrive at the equivalent linear discriminant functions.

g_i(x)= w_i^T x +w_i0

Comparing

w_i = 1/σ² μ_i

And w_i0 = −1/2σ²μ_i^tμ_i+ ln P(ω_i)

where w₀ is the threshold or bias in the i^th direction

Such classifiers that use linear discriminant functions are often called linear machines.

We choose the decision surface for any linear machine such that the hyperplane is defined by the equation g_i(x)= g_j(x) for bi-categorical with the highest posterior probabilities.

Applying the above condition we get,

w^t(x − x₀) = 0

w = μ_i − μ_j

x₀= 1/2(μ_i + μ_j) − σ²/ ||μ_i − μ_j||² ln P(w_i)/P(ω_j)(μ_i − μ_j)

The above equation is of a hyperplane through the point x₀ and orthogonal to vector w. We can also notice that w = µ_i −µ_j, the hyperplane separating the two regions R_i and R_j.

Further, if P(ω_i)= P(ω_j) the subtractive term in x₀ vanishes, and the hyperplane is perpendicular bisector or halfway the means.

Bayesian Decision Theory 1

Fig. As the priors change, the decision boundary throughout point x₀ shifts away from the more common class mean (two-dimensional Gaussian distributions)

Bayesian Decision Theory 2

Fig. As the priors change, the decision boundary throughout point x₀ shifts away from the more common class mean (one-dimensional Gaussian distributions)

Image Source: link

Case- 2 ( Σ_i = Σ )

These cases rarely occur, but hold importance when having a transition from case-1 to the more generalized case-3. In this case, the covariance matrix for all of the classes is identical. Even here we can ignore the term d/2ln2π and |Σ_i|. This eventually leads us to the equation,

g_i(x) = −1/2(x − μ_i)^tΣ⁻¹(x − μ_i) + ln P(ω_i)

If the prior probabilities P(w_i) are the same for all the classes the decision rule remains that to choose the feature vector x to the class with the nearest mean vector of class c. If the prior probabilities are biased the decision is in favor of the prior more likely class.

The quadratic form (x−µ_i)^tΣ⁻¹(x−µ_i) can be expanded as we again notice that the term x^tΣ⁻¹x can be ignored as it is independent of i. After this term is dropped we get the linear discriminant function.

g_i(x) = w_i^tx + w_i0

w_i = Σ⁻¹μ_i

w_i0 =−1/2μ_i^tΣ⁻¹μ_i+ln P(ω_i)

As the discriminant function is linear the outcome of decision boundaries are again the hyperplane with the equation,

w^t(x − x₀) = 0

w = Σ⁻¹(μ_i −μ_j)

x₀ = 1/2 (μ_i + μ_j) − { ln [P(ωi)/P(ωj)]/(μ_i − μ_j)^tΣ⁻¹(μ_i − μ_j) }(μ_i – μ_j)

The difference this hyperplane has as compared from the on in case 1 is that it is not orthogonal to the line between the means and also it does not intersect halfway the point between the means if the priors are equal.

Bayesian Decision Theory 3

Fig. The contour lines are elliptical in shape because the covariance matrix is not diagonal. However, both densities show the same elliptical shape. The prior probabilities are the same, and so the point x0 lies halfway between the 2 means. The decision boundary is not orthogonal to the red line. Instead, it is tilted so that its points are of equal distance to the contour lines in w1 and those in w2.

Case 2

Fig. The contour lines are elliptical, but the prior probabilities are different. Although the decision boundary is a parallel line, it has been shifted away from the more likely class. With sufficient bias, the decision boundary can be shifted so that it no longer lies between the 2 means

Image Source: link

Case- 3 ( Σ_i = arbitrary )

We now come to the most realistic case of the covariance being different for each class, Now the only term dropped from the discriminant function equation of normal multivariate density is the d/2ln2π.

The resulting discriminant function no more remains linear, as it inherently stays quadratic:

g_i(x) = x^tW_ix + w_i^tx + w_i0

W_i = −1/2 Σ_i⁻¹

w_i = Σ_i⁻¹μ_i

w_i0 = −1/2μ_i^tΣ_i⁻¹μ_i − 1/2ln |Σ_i| + ln P(ω_i)

The decision surfaces are hyperquadrics i.e general form of hyperplanes, pairs of hyperplanes, hyperspheres, hyperparaboloids, and more freaky shapes. The extension to the more than two categories is straightforward. Some cool decision boundaries of 2-D and 3-D can be seen below

Bayesian Decision Theory case 3

Fig. Two bivariate normals, with completely different covariance matrix, are showing a hyper quadratic decision boundary.

Example of parabolic decision surface

Fig. Example of parabolic decision surface.

Example of straight decision surface

Fig. Example of straight decision surface.

Example of hyperbolic decision surface

Fig. Example of hyperbolic decision surface.

Image Source: link

This completes our Bayesian Decision Theory in a detailed manner!

Discussion Problems

Problem-1:

Consider a two-class one-feature classification problem with gaussian densities p(x|w1) = N(0, 1) and p(x|w2) = N(1, 2). Assume equal prior probabilities. Now, answer the below questions:

1. Determine the decision boundary.

2. If prior probabilities are changed and P(w1) = 0.6 and P(w2) = 0.4, then calculate the new decision boundary?

Problem-2:

If there are 3n variables and one binary label, how many numbers of parameters are required to represent the naive Bayes classifier?

Problem-3:

In the discriminant functions for the normal density where features are independent and such feature vector has the same variance σ², then we can easily calculate the determinant Σ by____?

Try to solve the Practice Questions and answer them in the comment section below.

For any further queries feel free to contact me.

End Notes

Thanks for reading!

If you liked this and want to know more, go visit my other articles on Data Science and Machine Learning by clicking on the Link

Please feel free to contact me on Linkedin, Email.

Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you.

About the author

Chirag Goyal

Currently, I am pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the Indian Institute of Technology Jodhpur(IITJ). I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence.

The media shown in this article on Bayesian Decision Theory are not owned by Analytics Vidhya and are used at the Author’s discretion.

Chirag Goyal

I am a B.Tech. student (Computer Science major) currently in the pre-final year of my undergrad. My interest lies in the field of Data Science and Machine Learning. I have been pursuing this interest and am eager to work more in these directions. I feel proud to share that I am one of the best students in my class who has a desire to learn many new things in my field.

Beginner Maths Statistics

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.