Introduction

When looking at binary classification problems, a common modelling approach is logistic regression, which makes use of the logistic function to determine whether an observation belongs to one of \(K\) classes. However, while logistic regression is a valid approach, alternative methods may be required. In particular, for datasets where classes are completely (or almost completely) separate. In this article, we discuss two methods that do not suffer from this class separation issue: linear discriminant analysis (“LDA”) and quadratic discriminant analysis (“QDA”).

Linear Discriminant Analysis

The Bayes classifier assigns observations to a class, \(k\), based on the predictor vector, \(\textbf{X} = x = (x_{1},x_{2},...,x_{p})\), that maximises \(p_{k}(x) = Pr(Y=k|\textbf{X}=x)\). By making use of Bayes theorem, we can approximate \(p_{k}(x)\) by using estimates for the distribution of the elements of \(\textbf{X}\) in the \(k\)-th class and the probability that an observation is in the \(k\)-th class. We denote the distribution of \(\textbf{X}\) as \(f_{k}(x)\) and the probability that an observation is in the \(k\)-th class as \(Pr(Y=k)=\pi_{k}\). Applying Bayes theorem yields the following expression:

\[Pr(Y=k|\textbf{X}=x) = \frac{\pi_{k}f_{k}(x)}{\sum_{l=1}^{K}\pi_{l}f_{l}(x)}\]

Estimating \(Pr(Y=k)\) is relatively simple as it is just the proportion of our sample training data that belongs to class \(k\) so \(Pr(Y=k)=\pi_{k}=\frac{n_{k}}{n}\).

For LDA, in order to estimate \(f_{k}(x)\), we make simplifying assumptions as follows:

Plugging these into the expression for \(p_{k}(x)\) and taking logarithms produces the following for the case where we have only one predictor variable (\(p=1\)):

\[\delta_{k}(x) = x\frac{\mu_{k}}{\sigma^{2}} - \frac{\mu_{k}^{2}}{2\sigma^{2}} + log(\pi_{k})\]

An observation with predictor variable \(x\) is assigned to the class \(k\) for which \(\delta_{k}(x)\) is largest.

The above expression is based on the one-dimensional case where there is only one predictor variable, but we can easily extend this to the \(p\)-dimensional case by changing our assumption about the distribution of \(f_{k}(x)\) to the multivariate Gaussian distribution. The assumptions on mean and variance are the same, with means expressed as a (\(p \times 1\)) vector, \(\mu_{k}\), with a mean value for each predictor variable. The variances will be contained in a (\(p \times p\)) covariance matrix, \(\Sigma\), with predictor variances on the leading diagonal and the covariance between pairs of predictor variables in the off-diagonal elements. In this case our function, \(\delta_{k}(x)\), becomes:

\[\delta_{k}(x) = x^{T}\Sigma^{-1}\mu_{k} - \frac{1}{2}\mu_{k}^{T}\Sigma^{-1}\mu_{k} + log(\pi_{k})\]

Quadratic Discriminant Analysis

In the section on LDA, we noted our assumption that the variance-covariance matrix is constant across classes. If this holds, we would expect that the Bayes classifier decision boundary will be a linear function and the LDA model should be a good fit. However, if the covariance differs between classes, the Bayes classifier decision boundary may be quadratic (i.e. non-linear) and a QDA model may provide a better fit.

QDA makes the same assumptions as LDA with respect to the distribution of \(\textbf{X}\) and the mean of \(\textbf{X}\) within each class but differs in assuming that each class may have a unique covariance matrix, \(\Sigma_{k}\) for the predictor \(\textbf{X}\).

QDA Assumptions:

In the case of QDA, our function \(\delta_{k}(x)\) becomes:

\[\delta_{k}(x) = -\frac{1}{2}(x-\mu_{k})^{T}\Sigma_{k}^{-1}(x-\mu_{k}) - \frac{1}{2}log(|\Sigma_{k}|) + log(\pi_{k})\]

Given that this expression is quadratic in \(x\), we expect that this method will provide a closer approximation of the Bayes classifier for a dataset with non-constant covariance between classes.

LDA vs QDA: What happens when our assumptions aren’t met?

In order to illustrate the impact our assumptions have on the predictive power of the LDA and QDA methods, we simulate two datasets with \(n=2000\) observations:

We visualise the datasets in the figure below:

First, we fit LDA and QDA models to the dataset with equal covariance by splitting the data into “Train” and “Test” portions and evaluating the model performance. In the equal covariance case, we use a relatively small number of training observations (\(n_{train}=50\)) to illustrate the bias-variance trade-off between LDA and QDA.

LDA Equal Covariance Test Performance:

##          Truth
## Predicted    1    2    3
##         1  193    8    0
##         2    4  553   14
##         3    0   22 1156

QDA Equal Covariance Test Performance:

##          Truth
## Predicted    1    2    3
##         1  168    5    0
##         2   29  555   13
##         3    0   23 1157

In this case, we see that LDA produces an accuracy of 97.54% versus QDA which produces an accuracy of 96.41%. Given the relatively small number of training observations and the clear separation, it is likely that the increased flexibility of the QDA model has resulted in higher variance and lower overall accuracy compared to the LDA model. When visualising the dataset, we would expect that the decision boundaries would be roughly linear and therefore the improved performance of the LDA model is not surprising.

Next, we look at the dataset with unequal covariance between classes and fit the LDA and QDA models.

LDA Unequal Covariance Test Performance:

##          Truth
## Predicted   1   2   3
##         1   0   0   0
##         2  54 134   9
##         3  11  12 280

QDA Unequal Covariance Test Performance:

##          Truth
## Predicted   1   2   3
##         1  49   0   2
##         2   9 145   4
##         3   7   1 283

For this dataset, we see that the LDA model has an accuracy of 82.8% versus the QDA model which has an accuracy of 95.4%. When visualising the dataset it is clear that the optimal decision boundary is likely to be non-linear and therefore we would expect the QDA model to produce a higher accuracy in this case.

References

James, G., Witten, D., Hastie, T. & Tibshirani, R. (2021) An Introduction to Statistical Learning. 2nd ed. New York, NY, Springer Science+Business Media, LLC, part of Springer Nature.