Robust Scale Estimates

A robust scale estimate, also known as a robust measure of scale, attempts to quantify the statistical dispersion (variability, scatter, spread) in a set of data which may contain outliers. In such datasets, the usual variance or standard deviation scale estimate can be rendered useless by even a single outlier.

Median Absolute Deviation (MAD)

The median absolute deviation (MAD) is defined as

\[MAD = 1.4826 \times \textrm{median} \left\{ \left| x_i - \textrm{median} \left( x \right) \right| \right\}\]

In words, first the median of all samples is computed. Then the median is subtracted from all samples in the input to find the deviation of each sample from the median. The median of all absolute deviations is then the MAD. The factor \(1.4826\) makes the MAD an unbiased estimator of the standard deviation for Gaussian data. The median absolute deviation has an asymptotic efficiency of 37%.

gsl_stats_mad0(data)
gsl_stats_mad(data)

These functions return the median absolute deviation of data. The mad0 function calculates

\(\textrm{median} \left\{ \left| x_i - \textrm{median} \left( x \right) \right| \right\}\)

(i.e. the \(MAD\) statistic without the bias correction scale factor).

\(S_n\) Statistic

The \(S_n\) statistic developed by Croux and Rousseeuw is defined as

\[S_n = 1.1926 \times c_n \times \textrm{median}_i \left\{ \textrm{median}_j \left( \left| x_i - x_j \right| \right) \right\}\]

For each sample \(x_i, 1 \le i \le n\), the median of the values \(\left| x_i - x_j \right|\) is computed for all \(x_j, 1 \le j \le n\). This yields \(n\) values, whose median then gives the final \(S_n\). The factor \(1.1926\) makes \(S_n\) an unbiased estimate of the standard deviation for Gaussian data. The factor \(c_n\) is a correction factor to correct bias in small sample sizes. \(S_n\) has an asymptotic efficiency of 58%.

gsl_stats_Sn0_from_sorted_data(sorted_data)
gsl_stats_Sn_from_sorted_data(sorted_data)

These functions return the \(S_n\) statistic of sorted_data. The elements of the array must be in ascending numerical order. There are no checks to see whether the data are sorted, so the function gsl_sort() should always be used first. The Sn0 function calculates

\(\textrm{median}_i \left\{ \textrm{median}_j \left( \left| x_i - x_j \right| \right) \right\}\)

(i.e. the \(S_n\) statistic without the bias correction scale factors).

\(Q_n\) Statistic

The \(Q_n\) statistic developed by Croux and Rousseeuw is defined as

\(Q_n = 2.21914 \times d_n \times \left\{ \left | x_i - x_j \right | , i < j \right\}_{ (k) }\)

The factor \(2.21914\) makes \(Q_n\) an unbiased estimate of the standard deviation for Gaussian data. The factor \(d_n\) is a correction factor to correct bias in small sample sizes.The order statistic is

\[\begin{split}k = \left( \begin{array}{c} \left\lfloor \frac{n}{2} \right\rfloor + 1 \\ 2 \end{array} \right)\end{split}\]

\(Q_n\) has an asymptotic efficiency of 82%.

gsl_stats_Qn0_from_sorted_data(sorted_data)
gsl_stats_Qn_from_sorted_data(sorted_data)

These functions return the \(Q_n\) statistic of sorted_data. The elements of the array must be in ascending numerical order.There are no checks to see whether the data are sorted, so the function : func:gsl_sort() should always be used first.The Qn0 function calculates \(\left\{ \left| x_i - x_j \right|, i < j \right\}_{(k)}\) (i.e. \(Q_n\) without the bias correction scale factors).