Density Estimation

We want to learn about the distribution of the population from which the data were drawn. More specifically, we want to formally estimate the shape of the distribution, i.e. get a “reliable” visual representation such as a histogram.

Histogram

Subintervals of the histogram are called bins. The width of the interval is called binwidth. Small binwidth leads to more bins and shows local details, which may or may not be meaningful. Large binwidth shows a smoother, large-scale picture, but we may lose interesting information. Here are some histograms of a random sample of size 200 generated from N(0, 1)1. We have a tradeoff between competing goals.

R code for generating the figure.

Histograms of the same sample with different numbers of bins.

The histogram has some drawbacks:

Histograms are not smooth even if the underlying distribution is continuous - binning discretizes the result.
Histograms are sensitive to the choice of the class interval.
(related) Histograms depend on the choice of endpoints of the intervals. Both 2 and 3 are about the visual appearance.

Kernel density estimation

A smoother approach which gets around some of the drawbacks of the histogram is called kernel density estimation. It gives a continuous estimate of the distribution. It also removes dependence on endpoints, but the choice of binwidth (drawback 2) has an analogous issue here.

Procedure

For any $x$ in a local neighborhood of each data value $x_i$, fitting is controlled in a way that depends on the distance of $x$ from $x_i$. Close-by points are weighted more. As the distance increases, the weight decreases.

The weights are determined by a function called the kernel, which has an associated bandwidth. The kernel, $K(u)$, determines how weights are calculated, and the bandwidth, $\Delta$ or $h$, determines scaling, or how near/far points are considered “close” enough to matter.

Let $\hat{f}(x)$ denote the kernel density estimator of $f(x)$, the PDF of the underlying distribution. $\hat{f}(x)$ is defined as

$$ \hat{f}(x) = \frac{1}{n\Delta} \sum\limits_{i=1}^n {K\left(\frac{x-x_i}{\Delta}\right)} $$

where $n$ is the sample size, $K$ is the kernel, and $x_i$ is the observed data.

To be called a kernel, $K(\cdot)$ has to satisfy certain properties: $K(\cdot)$ is a smooth function such that

$$ \begin{cases} K(x) \geq 0 \\ \int{K(x)dx} = 1 \\ \int{xK(x)dx = 0} \\ \int{x^2K(x)dx > 0} \end{cases} $$

The first two constraints make it a density, and the second set of constraints ensures it has mean 0 and has a variance.