How do I check for outliers in a simple regression with one predictor variable?
A simple way to check for outliers is to evaluate either standardized or studentized residuals and see if there are many with high values e.g. > +/- 2. The key reason for studentizing is that the variances of the residuals at different predictor values are different.
This can be done as follows:
- Standardize both the response variable and the predictor variable by subtracting their means and dividing by their standard deviations, call these $$y_text{s}$$ and $$x_text{s}$$.
- Evaluate a Pearson or Spearman correlation, R.
- Obtain the i-th raw residual as $$Y_text{si} - Rx_text{si}$$
- To obtain the standardized residual just divide by the standard deviation of the residuals. The mean raw residual should be zero.
- The studentized residual may also be used to identify potential outliers. This divides the raw residual by its standard error, SE_RES.
SE_RES equals $$s \sqrt{1 - h_text{ii}}$$ where s equals $$\sum_{i}(Y_text{si} - Rx_text{si}$$)/(N-2) for N observations and $$h_text{ii}$$ equals $$\frac{1}{N} + \frac{x_text{si}text{2}}{\sum_{i}x_text{si}text{2}}
Studentised residuals may be evaluated using this [attachment:student.xls spreadsheet.]
Outliers without adjusting for other variables
In this case where we are interested in outliers of a variable unadjusted for any others the studentized residual is approximately equal to the standardizes residual (ie a z-score) for large N.
In this case $$h_text{ii}$$ equals 1/N and s is the standard deviation since the predicted value for Y is simply its mean.
So SE_RES equals $$s \sqrt{1 - h_text{ii}}$$ = SD $$\sqrt{1 - 1/N}$$ = $$ SD \sqrt{\frac{N-1}{N}}$$.
The studentized outlier is therefore equal to $$\frac{Y - \mbox{mean(Y)}}{\mbox{SD} \sqrt{\frac{N-1}{N}}} \approx \frac{Y - \mbox{mean(Y)}}{\mbox{SD}}$$