Variance

The variance of a dataset is the mean-squared deviation between the data and the empirical mean. For a dataset of points with empirical mean , the variance is given by:

Variance is not a good measure of spread

The variance is the squared deviation, and as such is of the wrong scale and has different units to the data. As a result, we introduce a new measure of spread:

Standard Deviation

The Standard Deviation is the root-mean-squared deviation from the mean, given by:

This is a much more useful measure of spread than variance.

68-95-99.7 Rule

For data following a normal distribution, it can be useful to know that, roughly:

  • 68% of the data falls within 1 standard deviation of the mean;
  • 95% of the data falls within 2 standard deviations of the mean;
  • 99.7% of the data falls within 3 standard deviations of the mean. As such, this is known as the 68-95-99.7 Rule.

Mean Absolute Difference

The Mean Absolute Difference (MAD) is given by:

and is typically more robust than the standard deviation. This means that it is less sensitive to outliers.

Inter-quartile Range

The inter-quartile range (IQR), sometimes inter-quartile distance (IQD) is the range of the central half of the dataset. That is,

Detecting Outliers

A common method for detecting outliers uses the IQR. Any datum that is

  • Larger than ; or
  • Smaller than . Is considered to be an outlier.