Robust Location Estimates

A location estimate refers to a typical or central value which best describes a given dataset. The mean and median are both examples of location estimators. However, the mean has a severe sensitivity to data outliers and can give erroneous values when even a small number of outliers are present. The median on the other hand, has a strong insensitivity to data outliers, but due to its non-smoothness it can behave unexpectedly in certain situations. GSL offers the following alternative location estimators, which are robust to the presence of outliers.

Trimmed Mean

The trimmed mean, or truncated mean, discards a certain number of smallest and largest samples from the input vector before computing the mean of the remaining samples. The amount of trimming is specified by a factor \(\alpha \in [0,0.5]\). Then the number of samples discarded from both ends of the input vector is \(\left\lfloor \alpha n \right\rfloor\), where \(n\) is the length of the input. So to discard 25% of the samples from each end, one would set \(\alpha = 0.25\).

gsl_stats_trmean_from_sorted_data(data, alpha)

This function returns the trimmed mean of sorted_data. The elements of the array must be in ascending numerical order. There are no checks to see whether the data are sorted, so the function gsl_sort() should always be used first. The trimming factor \(\alpha\) is given in alpha. If \(\alpha \ge 0.5\), then the median of the input is returned.

Gastwirth Estimator

Gastwirth’s location estimator is a weighted sum of three order statistics,

\[gastwirth = 0.3 \times Q_{\frac{1}{3}} + 0.4 \times Q_{\frac{1}{2}} + 0.3 \times Q_{\frac{2}{3}}\]

where \(Q_{\frac{1}{3}}\) is the one-third quantile, \(Q_{\frac{1}{2}}\) is the one-half quantile (i.e. median), and \(Q_{\frac{2}{3}}\) is the two-thirds quantile.

gsl_stats_gastwirth_from_sorted_data(sorted_data)

This function returns the Gastwirth location estimator of sorted_data. The elements of the array must be in ascending numerical order. There are no checks to see whether the data are sorted, so the function gsl_sort() should always be used first.