Robust Location Estimates
A location estimate refers to a typical or central value which best describes a given
dataset. The mean and median are both examples of location estimators. However, the
mean has a severe sensitivity to data outliers and can give erroneous values when
even a small number of outliers are present. The median on the other hand, has
a strong insensitivity to data outliers, but due to its non-smoothness it can
behave unexpectedly in certain situations. GSL offers the following alternative
location estimators, which are robust to the presence of outliers.
Trimmed Mean
The trimmed mean, or truncated mean, discards a certain number of smallest and largest
samples from the input vector before computing the mean of the remaining samples. The
amount of trimming is specified by a factor \(\alpha \in [0,0.5]\). Then the
number of samples discarded from both ends of the input vector is
\(\left\lfloor \alpha n \right\rfloor\), where \(n\) is the length of the input.
So to discard 25% of the samples from each end, one would set \(\alpha = 0.25\).
-
gsl_stats_trmean_from_sorted_data(data, alpha)
This function returns the trimmed mean of sorted_data.
The elements of the array must be in ascending numerical order.
There are no checks to see whether the data are sorted, so the function
gsl_sort()
should always be used first.
The trimming factor \(\alpha\) is given in alpha.
If \(\alpha \ge 0.5\), then the median of the input is returned.
Gastwirth Estimator
Gastwirth’s location estimator is a weighted sum of three order statistics,
\[gastwirth = 0.3 \times Q_{\frac{1}{3}} + 0.4 \times Q_{\frac{1}{2}} + 0.3 \times Q_{\frac{2}{3}}\]
where \(Q_{\frac{1}{3}}\) is the one-third quantile, \(Q_{\frac{1}{2}}\) is the one-half
quantile (i.e. median), and \(Q_{\frac{2}{3}}\) is the two-thirds quantile.
-
gsl_stats_gastwirth_from_sorted_data(sorted_data)
This function returns the Gastwirth location estimator of sorted_data.
The elements of the array must be in ascending numerical order.
There are no checks to see whether the data are sorted, so the function
gsl_sort()
should always be used first.