Rejecting Outliers While Model Fitting

Introduction

In its most simple form, RCR is an excellent tool for detecting and rejecting outliers within heavily contaminated one-dimensional datasets, as shown in Rejecting 1D Outliers. However, this only scratches the surface of RCR. In its more generalized form, RCR can also be used to reject outliers within some \(n\)-dimensional dataset while also simultaneously fitting a model to that dataset. This section will explain how this can be done fairly easily in practice, while avoiding going into unnessarily technicalities. We recommend reading Rejecting 1D Outliers before tackling this section, as the following is essentially a generalization of that section.

For the case of one-dimensional data (see Rejecting 1D Outliers), RCR can be thought of as being used to reject outliers from some dataset \(\left\{y_i\right\}_{i=1}^N\), distributed about a single, parameterized “true” value \(y\). In this case, we often wish to get a best estimate of \(y\), in order to properly characterize the underlying measurement distribution; \(y\) is just a one-dimensional model that we want to fit to the data, characterized by location and scale parameters like \(\mu\) and \(\sigma\).

If we generalize the dimensionality of this, we can imagine using RCR on measurements distributed about some \(n\)-dimensional model function \(y(\vec{x}|\vec{\theta})\), where \(\vec{x}\) is an \(n\)-dimensional vector of the model’s independent variable(s), and \(\vec{\theta}\) is an \(M\)-dimensional vector of the model’s parameters. In this case, we say that \(y(\vec{x}|\vec{\theta})\) is an \(n\)-dimensional, \(M\)-parameter model. For a more concrete example of this, consider a simple linear model \(y = b + mx\). In this case, \(n = 1\), and our parameter vector is just \(\vec{\theta} = (b ,m)\).

For this more-general case, what does our dataset look like? Each datapoint will be some value of \(y(\vec{x}|\vec{\theta})\) associated with a value for \(\vec{x}\). As such, if we have \(N\) datapoints in total, indexing each by \(i\), our dataset can be written compactly as \(\left\{\left(\vec{x}_i, y_i\right)\right\}_{i=1}^N\) (be sure to avoid getting \(N\) confused with \(n\) here; the former is the number of datapoints that we’re fitting the model to, while the latter is the dimensionality of the dataset/model).

Last but not least, before we get into the code, it’s important to point out that in order for RCR to fit any arbitrary model function to a dataset, (partial) derivatives of the model function with respect to each model parameter must be supplied (due to the specific algorithm that is used for fitting). For example, consider a one-dimensional exponential model of the form \(y(\vec{x}|\vec{\theta})=be^{mx}\). If we choose to order the model parameters as \(\vec{\theta} = (b ,m)\), then our model parameter derivatives are

\[\frac{\partial y(\vec{x}|\vec{\theta})}{\partial b} = e^{mx} \qquad \mathrm{ and } \qquad \frac{\partial y(\vec{x}|\vec{\theta})}{\partial m} = xbe^{mx} .\]

Implementation

Finally, we have everything that we need to use RCR for outlier rejection during model fitting. Although rcr supports any arbitrary \(n\)-dimensional nonlinear model function (as long as the model parameter derivatives are well-defined), for simplicity let’s consider a simple linear model \(y(\vec{x}|\vec{\theta})=b + mx\). The parameter partial derivatives are then simply

\[\frac{\partial y(\vec{x}|\vec{\theta})}{\partial b} = 1 \qquad \mathrm{ and } \qquad \frac{\partial y(\vec{x}|\vec{\theta})}{\partial m} = x .\]

Before we start coding, it’s important to consider the following:

Note

Within rcr, model functions and their derivatives must be defined exactly with arguments 1) x and 2) params, where x is the \(n\)-dimensional list or numpy array (or float, in the case of \(n=1\)) of independent variable(s), and params is the \(M\)-dimensional list/array of model parameters. Make sure to maintain consistent ordering of the model parameters vector throughout your code.

Now, onto the code; let’s start by defining our model function and its parameter derivatives:

def linear(x, params): # model function
    return params[0] + x * params[1]

def d_linear_1(x, params): # first model parameter derivative
    return 1

def d_linear_2(x, params): # second model parameter derivative
    return x

Next, let’s start creating our dataset. We’ll have \(N=200\) points total, with 85% of the datapoints being outliers. Our “true” model that the datapoints will be generated about will have parameters of \(b=0\) and \(m=1\). In code, this is simply:

import numpy as np

N = 200 # number of datapoints
f = 0.85 # fraction of datapoints that are outliers

params_true = [0, 1] # parameters of "true" model

We’ll generate our datapoints in a certain range of \(x\) values about the “true” model line. For this example, we’ll make uncontaminated datapoints that are Gaussian/normally distributed, with standard deviation \(\sigma=1\), about the true model. In order to highlight the power of RCR with dealing with especially difficult outliers, we’ll generate one-sided outliers/contaminants, sampled from the positive side of a Gaussian with \(\sigma=10\). In code, this will take the form of:

sigma_uncontaminated = 1 # standard deviations used to generate datapoints
sigma_contaminated = 10

# generate x-datapoints randomly in an interval
x_range = (-10, 10)
xdata_uncontaminated = np.random.uniform(
    x_range[0], x_range[1], int(N * (1 - f)))
xdata_contaminated = np.random.uniform(
    x_range[0], x_range[1], int(N * f))


# generate y-datapoints about the true model:
# symmetric uncontaminated distribution
ydata_uncontaminated = np.random.normal(
    loc=linear(xdata_uncontaminated, params_true),
    scale=sigma_uncontaminated
    )

# one-sided contaminated distribution
ydata_contaminated = linear(xdata_contaminated, params_true) + np.abs(
    np.random.normal(0, sigma_contaminated, int(N * f)))


# combine dataset
xdata = np.concatenate((xdata_contaminated, xdata_uncontaminated))
ydata = np.concatenate((ydata_contaminated, ydata_uncontaminated))

Let’s plot the dataset over the true, underlying model:

# plot dataset
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
ax = plt.subplot(111)

ax.plot(xdata_contaminated, ydata_contaminated, "k.",
    label="Pre-RCR dataset", alpha=0.75, ms=4)
ax.plot(xdata_uncontaminated, ydata_uncontaminated, "k.",
    alpha=0.75, ms=4)


# plot model
x_model = np.linspace(x_range[0], x_range[1], 1000)
ax.plot(x_model, linear(x_model, params_true),
    "b--", label="True model", alpha=0.5, lw=2)

plt.xlim(-10, 10)
plt.ylim(-15, 25)
plt.xlabel("$x$")
plt.ylabel("$y$")

box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.65, box.height])
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

plt.show()

Output:

../_images/preRCR.svg

Clearly, these outliers are pretty nasty. This looks like a job for RCR. First, we need to supply an initial guess for the model parameters, to give the fitting engine within RCR a starting place. Approaching this dataset with no knowledge of what is or isn’t an outlier, it would be hard to tell what the true best fit should be; as such, we’ll use an initial guess that naively should work with the data, but is pretty far off of the true values of \(b=0\) and \(m=1\); let’s try \(b=5\) and \(m=1.5\):

guess = [5, 1.5]

Next, we’ll need to initialize the model, as an instance of the rcr.FunctionalForm class. The required arguments (in order) to construct an instance of this class are 1) the model function, 2) the (\(n\)-dimensional) \(x\)-data, 3) the \(y\)-data, 4) a list of the model parameter derivative functions, in order and 5) the guess for the parameters. This is implemented as:

model = rcr.FunctionalForm(linear,
    xdata,
    ydata,
    [d_linear_1, d_linear_2],
    guess
)

Now, we’re finally ready to run RCR on the dataset/model. Our uncontaminated distribution of data is symmetric, while our contaminated distribution is one-sided/completely asymmetric. Therefore, following the Table of Rejection Techniques, the rejection technique that will perform best on this dataset is LS_MODE_68. Given this, we’ll perform RCR as usual, except now, we need to tell our instance of the RCR class that we’re fitting to our specific parametric model:

r = rcr.RCR(rcr.LS_MODE_68) # setting up for RCR with this rejection technique

r.setParametricModel(model)
# tell RCR that we are model fitting, and give it the model of choice

r.performBulkRejection(ydata) # perform RCR

That was only a few lines of code, but what actually happened here? Essentially, (see The Paper for more details), RCR can iteratively reject outliers and fit the model to the data at the same time. As such, we can access the same outlier-rejection results from r.result as in Rejecting 1D Outliers, while also having model-fitting results from our model, with the member model.result:

best_fit_parameters = model.result.parameters # best fit parameters

rejected_data = r.result.rejectedY # rejected and non-rejected data
nonrejected_data = r.result.cleanY
nonrejected_indices = r.result.indices
# indices from original dataset of nonrejected data

print(best_fit_parameters)

Output:

[1.2367288755077883, 1.004037971689524]

Before we discuss this result, it’s teaching to compare it to the traditional method of ordinary least-squares fitting; we’ll summarize this in a plot, as follows:

# plot results

plt.figure(figsize=(8, 5))
ax = plt.subplot(111)

ax.plot(xdata_contaminated, ydata_contaminated, "k.",
    label="Pre-RCR dataset", alpha=0.75, ms=4)
ax.plot(xdata_uncontaminated, ydata_uncontaminated, "k.",
    alpha=0.75, ms=4)

ax.plot(xdata[nonrejected_indices], ydata[nonrejected_indices], "bo",
    label="Post-RCR dataset", alpha=0.4, ms=4)

# plot true model
ax.plot(x_model, linear(x_model, params_true),
    "b--", label="True model", alpha=0.5, lw=2)

# plot regular linear least squares best fit
from scipy.stats import linregress

slope_lsq, intercept_lsq, _, _, _ = linregress(xdata, ydata)

ax.plot(x_model, linear(x_model, [intercept_lsq, slope_lsq]),
    "r-", label="Least-squares best fit", alpha=0.5, lw=2)

# plot RCR-fitted model
ax.plot(x_model, linear(x_model, best_fit_parameters),
    "g-", label="RCR best fit", alpha=0.5, lw=2)


plt.xlim(-10, 10)
plt.ylim(-15, 25)
plt.xlabel("$x$")
plt.ylabel("$y$")

box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.65, box.height])
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

print("Least-squares fit results:", intercept_lsq, slope_lsq)

plt.show()

Output:

Least-squares fit results:
7.202089027278407 1.0412871059773106
../_images/postRCR.svg

RCR gave us a best fit values of \(b=1.237\) and \(m=1.004\), while traditional linear least squares gave \(b=7.202\) and \(m=1.041\). The slope (true value of \(m=1\)) was recovered very well in both cases, but this isn’t super surprising, given that both the contaminated and uncontaminated measurement distributions were generated without any scatter along the \(x\)-axis. However, due to the heavy scatter/contamination along the \(y\)-axis, the least-squares result for the intercept \(b\) is, expectly, heavily biased by the outliers, very far off of the true value of \(b=1\). However, RCR was able to successfully reject many of the outliers, while maintaining almost all of the uncontaminated distribution (shown in blue circles), giving a best fit \(b=1.237\) that is significantly closer to the true value of \(b=1\) than the least-squares result.

Overall, the RCR fit (green line) is clearly a much better fit (true best fit in blue dashed line) than the least squares best fit (red line).

Data with Uncertainties and/or Weights

Realistically, many datasets will have measurements that have uncertainties, or error bars, as practically all physical measurements cannot truly be made with exact precision. In most model-fitting scenarios, only uncertainties in the dependent variable (\(y\)) are considered, with any uncertainties in the independent variable(s) \(\vec{x}\) considered to be negligible (for a more generalized treatment, that includes such \(\vec{x}\)-uncertainties, as well as uncertainty in the dataset that cannot solely be attributed to the data error bars, see e.g. Konz 2020). In this case, which we take for RCR, our dataset becomes \(\left\{\left(\vec{x}_i, y_i \pm \sigma_{y,i}\right)\right\}_{i=1}^N\), i.e. our measurement error bars/uncertainties are \(\left\{\sigma_{y,i}\right\}_{i=1}^N\).

Just as in one-dimensional RCR, weights \(w_i\) can also be applied to model-fitting datasets (e.g. Weighting Data). We note that the inclusion of error bars as described in the previous paragraph is not mutual exclusive with such weighting; both weights and error bars can be used in practice.

To use a dataset with error bars and/or weights with model-fitting RCR, simply use the optional arguments error_y and weights of the rcr.FunctionalForm() constructor, where the former is an ordered vector/list of measurement uncertainties \(\left\{\sigma_{y,i}\right\}_{i=1}^N\), and the latter is an ordered vector/list of measurement weights \(\left\{ w_i\right\}_{i=1}^N\). An example of this is given in the following section.

Model Parameter Uncertainties/Error Bars

In many cases, we often want not just best fit parameters for a model and dataset, but also uncertainties, or “error bars” for these parameters. This is easily available in rcr, again via the model.result object, as model.result.parameter_uncertainties. However, before we go into a worked code example, note the following:

Note

In rcr, best fit model parameter uncertainties can only be calculated if error bars/uncertainties and/or weights were given for the dataset before fitting.

Now, let’s try adding error bars to our linear dataset, same as above. First, we’ll initialize the error bars, randomly, giving higher error, on average, to the contaminants:

error_y_uncontaminated = np.random.uniform(low=0.1, high=1, size=int(N * (1 - f)))
error_y_contaminated = np.random.uniform(low=1, high=2, size=int(N * f))

error_y = np.concatenate((error_y_contaminated, error_y_uncontaminated))

Next, let’s initailize the model as before, except now using the optional keyword argument error_y to input the error bars. We then can perform RCR as usual.

# instantiate model
model = rcr.FunctionalForm(linear,
    xdata,
    ydata,
    [d_linear_1, d_linear_2],
    guess,
    error_y=error_y
)

# initialize and perform RCR as usual
r = rcr.RCR(rcr.LS_MODE_68) # setting up for RCR with this rejection technique
r.setParametricModel(model) # tell RCR that we are model fitting
r.performBulkRejection(ydata) # perform RCR

Let’s check out the results:

# view results
best_fit_parameters = model.result.parameters # best fit parameters
best_fit_parameter_errors = model.result.parameter_uncertainties # and their uncertainties

rejected_data = r.result.rejectedY # rejected and non-rejected data
nonrejected_data = r.result.cleanY
nonrejected_indices = r.result.indices

print(best_fit_parameters)
print(best_fit_parameter_errors)

Output:

[6.612942587028933, 0.9732622673909074]
[1.6299290812536242, 0.3258511725157285]

So, our RCR-recovered best fit is \(b = 6.61 \pm 1.63\) and \(m = 0.973 \pm 0.326\). Unfortunately, this fit isn’t nearly as good as when we didn’t have measurement uncertainties. But why? To see, let’s plot the dataset alongside the fit:

# plot results

plt.figure(figsize=(8, 5))
ax = plt.subplot(111)

ax.errorbar(xdata_contaminated, ydata_contaminated, yerr=error_y_contaminated,
    fmt="k.", label="Pre-RCR dataset", alpha=0.75, ms=4)
ax.errorbar(xdata_uncontaminated, ydata_uncontaminated, yerr=error_y_uncontaminated,
    fmt="k.", alpha=0.75, ms=4)

ax.plot(xdata[nonrejected_indices], ydata[nonrejected_indices], "bo",
    label="Post-RCR dataset", alpha=0.4, ms=4)

# plot true model
ax.plot(x_model, linear(x_model, params_true),
    "b--", label="True model", alpha=0.5, lw=2)

# plot RCR-fitted model
ax.plot(x_model, linear(x_model, best_fit_parameters),
    "g-", label="RCR best fit", alpha=0.5, lw=2)


plt.xlim(-10, 10)
plt.ylim(-15, 25)
plt.xlabel("$x$")
plt.ylabel("$y$")

box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.65, box.height])
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

plt.show()

Output:

../_images/postRCR_erry.svg

Adding error bars, or intrinsic uncertainties, to the measurements in the dataset introduced even more overall uncertainty to the data, beyond just the extrinsic uncertainty, or scatter/sample variance of the datapoints themselves. That, combined with the extremely high contaminant fraction of 85%, made it so that RCR was unable to tell apart the contaminants from the non-outlier datapoints, under-rejecting the outliers, as shown in the plot. As such, the final dataset that the model was fit to included too many outliers, biasing the fitted line to have too high an intercept. RCR would’ve worked better if either/both 1) there were smaller error bars or 2) the fraction of contaminants was lower.

Applying Prior Knowledge to Model Parameters (Advanced)

Let’s say that we want to fit some model to a dataset, and we know certain, prior information about one of the parameters of the model, \(a\), in advance. From the point of view of Bayesian inference, this can be formalized by specifying the prior probability distribution, or prior probability density function (PDF) of that parameter \(p(a)\). For example, let’s say that for the linear dataset/model above, we know a priori that the intercept \(b\) should be \(b=0\), with uncertainty of \(1\), i.e. \(b=0 \pm 1\). This translates to a prior probability distribution of a Gaussian with mean \(\mu=0\) and standard deviation \(\sigma=1\), i.e.

\[p(b) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}b^2}.\]

However, let’s say that we don’t know anything in advance about the slope \(m\). In this case, we say that the prior on \(m\) is uninformative, i.e. all values are equally likely (again, this is before we even consider any data), which manifests mathematically as

\[p(m) \propto 1.\]

In rcr, prior probability distributions can be specified for any or all of the parameters of a model, which will affect the rejection of outliers (essentially by modifying the rejection probability of certain measurements according to the prior probabilities of all of the model parameter solutions that these measurements can contribute to). For simplicity and ease-of-use, we’ve included two types of common priors within the library, as well as allowing for any sort of custom prior PDF. These options are described in the table below.

Types of Model Parameter Priors in RCR

Prior Type Parameters Needed To Specify
GAUSSIAN_PRIORS Means and standard deviations of some or all model parameters
CONSTRAINED_PRIORS Lower and/or upper bounds on some or all model parameters
MIXED_PRIORS Combination of some or all of the two above
CUSTOM_PRIORS For some or all model parameters \(a_j\), custom prior PDF \(p(a_j)\)

Now, how can we use these different types of priors in practice?

Gaussian Priors

Let’s say that you want to apply Gaussian/normal prior probability distributions on some (or all) of your model parameters. To do so, you’ll first need to create a list, where each element of the list corresponds to a model parameter, and is itself a list of 1) the mean of the Gaussian for that parameter’s prior and 2) the standard deviation of the same. If no Gaussian prior is desired for a certain parameter, just give NaNs for those fields.

This is pretty dense, so we’ll show a specific instance of this usage. Following the example within the introduction to this section (Applying Prior Knowledge to Model Parameters (Advanced)), lets use the same linear model as before, and apply a Gaussian prior to the intercept \(b\), with mean \(\mu=0\) and standard deviation \(\sigma=1\). We’ll use no prior (uninformative) on the slope \(m\). From here, our list of parameters (not model parameters) that describe the Gaussian priors will be:

gaussianParams = [
        [0,             1], # mu = 0, sigma = 1
        [float('nan'),  float('nan')]
        # no prior on the slope parameter, so just use NaNs
    ]

Now, to introduce these priors before performing any fitting/RCR, we’ll need to create an instance of the Priors class from rcr, making sure to specify which type of prior we’re implementing using the correct object from the above table (in this case GAUSSIAN_PRIORS). Here it is in code:

mypriors = rcr.Priors(rcr.GAUSSIAN_PRIORS, gaussianParams)

From here, RCR can be performed as usual, by 1) supplying the optional argument has_priors=True to the FunctionalForm constructor when initializing the model, and after that 2) initializing the priors attribute of your model with your Priors object:, e.g.:

model = rcr.FunctionalForm(linear,
    x,
    y,
    [linear_partial1, linear_partial2],
    guess,
    has_priors=True
)

model.priors = mypriors

From here RCR can be utilized with this model given the usual methods.

Constrained/Bounded Priors

Another very common type of prior is to give hard constraints/bounds on certain model parameters. Following the same linear example, let’s say that we know that the slope \(m\) of our model should be nonnegative (this type of prior is often for some physical reason), but we don’t know anything about the intercept \(b\).

Similar to the usage of Gaussian priors, to implement this we’ll need to create a list where each element corresponds to a model parameter, and is itself a list of 1) the lower bound and 2) the upper bounds that we want to give the corresponding parameter if we only want to supply one (or neither) of the bounds, just use a NaN instead. Following our chosen example, this list can be coded as

paramBounds = [
        [float('nan'),  float('nan')]
        [0,             float('nan')]  # constrain m > 0
    ]

Next, we need to instantiate an rcr.Priors object, in a similar manner to the case of Gaussian priors (except now being sure to specify CONSTRAINED_PRIORS):

mypriors = rcr.Priors(rcr.CONSTRAINED_PRIORS, paramBounds)

Finally, we’ll need to initialize our model with the priors as in the end of the previous section (again with has_priors=True), and then we’re good to go.

Both Gaussian and/or Constrained (Mixed) Priors

What if we want to apply Gaussian priors to some model parameters, constrained priors to others, or even a mix of both for certain parameters (e.g. force a parameter to be positive, while also making it Gaussian-distributed)? To do this, simply create the lists that specify these priors— paramBounds and gaussianParams following the previous examples—and supply them both to the constructor for your Priors object, making sure to specify the priors type as MIXED_PRIORS:

mypriors = rcr.Priors(rcr.MIXED_PRIORS, gaussianParams, paramBounds)

From here, RCR can be used as normal, after initializing our model (with has_priors=True) and supplying the model with the Priors object.

Custom Priors

In the most general case, RCR can work with any type of prior probability distributions/density functions. To implement this, you’ll need a function \(\vec{p}(\vec{\theta})\) that takes in a vector of model parameters \(\vec{\theta}\), and returns a vector of each parameter’s prior probability density function evaluated given the corresponding parameter’s value.

As an example, let’s consider that for our linear model, we’d like to 1) place an (unusual) prior on \(b\):

\[p(b) = e^{-|b|}\left|\cos^2b\right|,\]

and 2) constrain \(m\) to be within the interval \((0, 2]\). We can then implement \(\vec{p}(\vec{\theta})\) as:

def prior_pdfs(model_parameters):
    pdfs = np.zeros(2) # vector of model parameter density function values
    b = model_parameters[0]
    pdfs[0] = np.exp(-np.abs(b)) * np.abs(np.cos(b)**2.)

    b = model_parameters[0]
    pdfs[1] = 1 if 0 < m <= 2 else 0
    # p(m) = 0 if m is outside bounds of (0, 2]

    return pdfs

After such a \(\vec{p}(\vec{\theta})\) is defined, we’ll need to use it to instantiate an rcr.Priors object as usual, this time declaring our type of priors as CUSTOM_PRIORS:

mypriors = rcr.Priors(rcr.CUSTOM_PRIORS, prior_pdfs)

After creating our model (with has_priors=True) and supplying it with our Priors object mypriors, RCR can then be used as usual.