rcr

Github Repository Documentation Status Build Status arXiv Paper Zenodo (DOI)

RCR, or Robust Chauvenet Rejection, is advanced, and easy to use, outlier rejection. Originally published in Maples et al. 2018, this site will show you how to use RCR in Python. RCR can be applied to weighted data and used for model-fitting, and we have incorporated rejecting outliers in bulk to have the best of both computational efficiency and accuracy (see the Installation Guide).

RCR has been carefully calibrated, and extensively simulated. It can be applied to samples with both large contaminants and large contaminant fractions (sometimes in excess of 90% contaminated). Finally, because RCR runs on a C++ backend, it is quite fast.

Introduction

The simplest form of outlier rejection is sigma clipping, where measurements that are more than a specified number of standard deviations from the mean are rejected from the sample. This number of standard deviations should not be chosen arbitrarily, but is a function of your sample’s size. A simple prescription for this was introduced by William Chauvenet in 1863. Sigma clipping plus this prescription, applied iteratively, is what we call traditional Chauvenet rejection.

However, both sigma clipping and traditional Chauvenet rejection make use of non-robust quantities: the mean and the standard deviation are both sensitive to the very outliers that they are being used to reject. This limits such techniques to samples with small contaminants or small contamination fractions.

Robust Chauvenet Rejection (RCR) instead first makes use of robust replacements for the mean, such as the median and the half-sample mode, and similar robust replacements that we have developed for the standard deviation.

Basic Usage Example

Here’s a quick example of RCR in action: we have a dataset of \(N=1000\) measurements, 85% of which are contaminants. The contaminants are sampled from one side of a Gaussian/normal distribution with standard deviation \(\sigma=10\), while the uncontaminated points are from a regular, symmetric Gaussian with \(\sigma=1\). Both distributions are centered at \(\mu=0\).

The question is, how can we recover the \(\mu\) and \(\sigma\) of the underlying distribution, in the face of such heavy contamination? The example below shows how to do it with RCR.

import numpy as np
import rcr

np.random.seed(18318) # get consistent random results

N = 1000 # total measurement count
frac_contaminated = 0.85 # fraction of sample that will be contaminated

# symmetric, uncontaminated distribution
mu = 0 
sigma_uncontaminated = 1
uncontaminated_samples = np.random.normal(mu, sigma_uncontaminated, 
    int(N * (1 - frac_contaminated)))

# one-sided contaminants
sigma_contaminated = 10
contaminated_samples = np.abs(np.random.normal(mu, sigma_contaminated, 
    int(N * frac_contaminated)))

# create whole dataset
data = np.concatenate((uncontaminated_samples, contaminated_samples))
np.random.shuffle(data)

# perform RCR
# initialize RCR with rejection technique:
# (chosen from shape of uncontaminated + contaminated distribution)
r = rcr.RCR(rcr.LS_MODE_68)
r.performBulkRejection(data) # perform outlier rejection

# View results
cleaned_data = r.result.cleanY
cleaned_mu = r.result.mu
cleaned_sigma = r.result.stDev


# plot data
import matplotlib.pyplot as plt

ydata = np.random.uniform(0, 1, N) # project randomly into 2D for better visualization
plt.figure(figsize=(8,5))
ax = plt.subplot(111)
ax.plot(data, ydata, "k.", label="Data pre-RCR", alpha=0.75, ms=4)
ax.plot(cleaned_data, ydata[r.result.indices], "bo", 
    label="Data post-RCR", alpha=0.4, ms=4)

# plot results
cont_mean = np.mean(data)
cont_sigma = np.std(data)

ax.axvspan(mu - sigma_uncontaminated, mu + sigma_uncontaminated, color='g', 
    alpha=0.25, label="1-$\sigma$ region of true\nuncontaminated distribution")
ax.axvline(x=cont_mean, c='r', lw=3, ls="-", alpha=0.75, 
    label="Pre-RCR sample mean of data")
ax.axvspan(cont_mean - cont_sigma, cont_mean + cont_sigma,  color='r', 
    fill=False, alpha=0.75, hatch="/", label="1-$\sigma$ region of data, pre-RCR")

ax.axvline(x=cleaned_mu, c='b', lw=3, ls="-", alpha=0.75, 
    label="RCR-recovered $\mu$ of\nuncontaminated distribution")
ax.axvspan(cleaned_mu - cleaned_sigma, cleaned_mu + cleaned_sigma, color='b', 
    fill=False, alpha=0.75, hatch= '\\', 
    label="1-$\sigma$ region of uncontaminated\ndistribution, after RCR")

plt.xlim(-5, 30)
plt.ylim(0, 1)
plt.xlabel("data")
plt.title("Results of RCR being used on an" + 
    " {}% contaminated dataset".format(frac_contaminated*100))
plt.yticks([])

box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.65, box.height])

ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

Output:

_images/intro.svg

For a more in-depth explanation of using RCR for this type of one-dimensional outlier rejection, see Rejecting 1D Outliers.

License and Attribution

The Paper

The original paper for RCR can be found here. If you use RCR, please cite it as:

@article{maples2018robust,
    title={Robust Chauvenet Outlier Rejection},
    author={{Maples}, M.P. and {Reichart}, D.E. and {Konz}, N.C. and {Berger}, T.A. and {Trotter}, A.S. and {Martin}, J.R. and {Dutton}, D.A. and {Paggen}, M.L. and {Joyner}, R.E. and {Salemi}, C.P.},
    journal={The Astrophysical Journal Supplement Series},
    volume={238},
    number={1},
    pages={2},
    year={2018},
    publisher={IOP Publishing}
}

and use it according to the license (the essential point here is just to contact us if you want to use RCR for commercial purposes).