Bagging - Ensemble methods

Most ensemble methods tend to reduce bias and/or variance that increases the accuracy of the prediction.

Bagging Reduces Variance and Boosting works to reduce the bias on models that under fit.

The first thing that is required to understand the concept of Bagging is to understand the how bootstrapping works.

A Bootstrap sample is a bag that is a subset of N examples that were picked from the dataset at random, in the same way, we make M bags from the same dataset.

Statistically, each bootstrap (bag) will contain approximately 63.2% uniques examples if the dataset is large**.**

Now as we know how to create Bootstraps, we now use each bag as training sample input for our model and get M distributions (hypotheses), we now select a distribution having the highest accuracy as our final model. These are the high-level steps that are there for understanding the approach of bootstrapping, in bagging, we take the mean of all the hypotheses created by each of the bags as our final hypothesis.

To understand why this is effective we have to understand the following terms Bias, Variance, and Noise that will, in turn, help us to understand the working of this algorithm more clearly.

Let us now define each of the terms clearly:

Variance: A model with high variance tells that the model is probably overfitting the given data(measures the overfitting factor), i.e. how much the hypothesis h(x) varies from one training set to another. Given as : E[(h(x)−**h¨(x))^**2], where h(x) is one hypothesis h¨(x) is the average of all hypotheses.
Bias (a measure of mistake ): A model with high bias tells that the model is probably underfitting the given data(measures the underfitting factor), i.e. describes the average error of h(x) from observed function f(x). Given as:

E[(h¨(x)−f¨(x))^2] , where h¨(x) and f¨(x) are means of respective values.
Noise: It is the difference between the actual and observed output, that is inherently present in the data given as :

E[(y - f(x))^2], where y is true value and f(x) denotes observed value.

Given a true label y* = f(x*)+e where is f(x*) is true function and e is added to denote noise. We now would decompose the error of prediction on

MSE:

Z = E[(y*-h(x*))^2]

where h(x*) is a predicted value(Y-hat) of a hypothesis h.

When we solve the above equation using a lemma :

E[(Z - **Z¨)^**2] = E[Z^2] - Z¨^2 (where Z¨ is the average value of Z)

We get :

Error = Variance + (Bias)^2 +(Noise)^2

(The math behind is not that important here for understanding this concept)

Bagging is related to the trade-off between Bias and variance. It reduces Variance, by keeping the bias constant or slightly increasing it (Applicable in some datasets, ML is not absolute) which decreases the Error overall. So we mostly use classifiers having high variance that tend to overfit easily, Deep Decision Trees, Support vector machines, kNN with lower k's, etc. can be used for bagging to increase the model's performance on test data.

References used:

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote18.html

Chapter 14 -Pattern Recognition and Machine Learning by Christopher Bishop