Size: 1779
Comment:
|
← Revision 17 as of 2016-01-19 11:23:02 ⇥
Size: 1798
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 29: | Line 29: |
The studentized outlier is therefore equal to (Y - mean(Y))/[SD (N-1)/N] \approx (Y - mean(Y))/SD when N is large. | The studentized outlier is therefore equal to (Y - mean(Y))/[SD (N-1)/N] which approximately equals (Y - mean(Y))/SD when N is large. |
How do I check for outliers in a simple regression with one predictor variable?
A simple way to check for outliers is to evaluate either standardized or studentized residuals and see if there are many with high values e.g. > +/- 2. The key reason for studentizing is that the variances of the residuals at different predictor values are different.
This can be done as follows:
- Standardize both the response variable and the predictor variable by subtracting their means and dividing by their standard deviations, call these y(s) and x(s).
- Evaluate a Pearson or Spearman correlation, R.
- Obtain the i-th raw residual as Y(si) - Rx(si)
- To obtain the standardized residual just divide by the standard deviation of the residuals. The mean raw residual should be zero.
- The studentized residual may also be used to identify potential outliers. This divides the raw residual by its standard error, SE_RES.
SE_RES equals s Sqrt[1 - h(ii)] where s equals Sum over i (Y(si) - Rx(si))/(N-2) for N observations and h(ii) equals 1/N + x(si)2 /Sum over i x(si)2
Studentised residuals may be evaluated using this spreadsheet.
Outliers without adjusting for other variables
In this case where we are interested in outliers of a variable unadjusted for any others the studentized residual is approximately equal to the standardized residual (ie a z-score) for large N.
In this case h(ii) equals 1/N and s is the standard deviation since the predicted value for Y is simply its mean.
So it follows SE_RES which equals s Sqrt{1 - h(ii) = SD Sqrt(1 - 1/N) = SD Sqrt[(N-1)/N].
The studentized outlier is therefore equal to (Y - mean(Y))/[SD (N-1)/N] which approximately equals (Y - mean(Y))/SD when N is large.