A bagplot is a graphical tool for displaying the distribution of a multivariate dataset. It was developed by Rousseeuw and Van Zomeren as an extension of the classical boxplot, which is used for univariate data.
A bagplot is a scatterplot of a dataset, with two lines drawn around the points to indicate the “bag” of points that make up the bulk of the data. The lines are drawn such that they include 50% of the points, with the remaining points outside the lines forming the “fence” of points that may be potential outliers.
For example, consider the following dataset of two-dimensional points:
(1,1), (1,2), (1,3), (1,4), (1,5), (2,2), (2,3), (2,4), (2,5), (3,3), (3,4), (3,5), (4,4), (4,5), (5,5)
We can create a bagplot of this dataset by first drawing a scatterplot of the points:
Next, we need to compute the median and median absolute deviation (MAD) of the x- and y-coordinates of the points. The median is simply the middle value of the sorted dataset, and the MAD is the median of the absolute deviations from the median. For the x-coordinates, the median is 2 and the MAD is 1. For the y-coordinates, the median is 4 and the MAD is 1.
With these values, we can now draw the lines that will make up the bag of the bagplot. These lines are drawn such that they include 50% of the points, and are defined as follows:
The lower line is the median minus 1.5 times the MAD
The upper line is the median plus 1.5 times the MAD
For the x-coordinates, this gives us a lower line at 2-1.51=0.5 and an upper line at 2+1.51=3.5. For the y-coordinates, this gives us a lower line at 4-1.51=2.5 and an upper line at 4+1.51=5.5.
We can now draw these lines on the scatterplot, resulting in the following bagplot:
As we can see, the lines enclose most of the points in the dataset, with only a few points outside the lines (indicated by the red dots). These points can be considered potential outliers, since they fall outside the bag of most of the data.
The bagplot is a useful tool for identifying potential outliers in multivariate data, and can be used in a variety of applications, such as quality control and statistical analysis. It is particularly useful for datasets with many points and/or high dimensions, where traditional methods such as the boxplot may not be effective.
One potential limitation of the bagplot is that it is based on arbitrary constants (e.g. the factor of 1.5 used to compute the lines around the points), which may not always be appropriate for all datasets. Alternative methods, such as the minimum covariance determinant estimator, have been proposed to address this issue.
Overall, the bagplot is a powerful and flexible tool for visualizing and analyzing multivariate data, and can provide valuable insights into the underlying distribution of a dataset.