Size: 1966
Comment: converted to 1.6 markup
|
Size: 1768
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 7: | Line 7: |
1. Standardize both the response variable and the predictor variable by subtracting their means and dividing by their standard deviations, call these $$y_text{s}$$ and $$x_text{s}$$. | 1. Standardize both the response variable and the predictor variable by subtracting their means and dividing by their standard deviations, call these y(s) and x(s). |
Line 11: | Line 11: |
1. Obtain the i-th raw residual as $$Y_text{si} - Rx_text{si}$$ | 1. Obtain the i-th raw residual as Y(si) - Rx(si) |
Line 17: | Line 17: |
SE_RES equals $$s \sqrt{1 - h_text{ii}}$$ where s equals $$\sum_{i}(Y_text{si} - Rx_text{si}$$)/(N-2) for N observations and $$h_text{ii}$$ equals $$\frac{1}{N} + \frac{x_text{si}^text{2}}{\sum_{i}x_text{si}^text{2}} | SE_RES equals s Sqrt[1 - h(ii)] where s equals Sum i(Y(si) - Rx(si})/(N-2) for N observations and h(ii) equals 1/N + x(si)^2 ^/Sum i x(si)^2 ^ |
Line 25: | Line 25: |
In this case $$h_text{ii}$$ equals 1/N and s is the standard deviation since the predicted value for Y is simply its mean. | In this case h(ii) equals 1/N and s is the standard deviation since the predicted value for Y is simply its mean. |
Line 27: | Line 27: |
So it follows SE_RES which equals $$s \sqrt{1 - h_text{ii}}$$ = SD $$\sqrt{1 - 1/N}$$ = $$ SD \sqrt{\frac{N-1}{N}}$$. | So it follows SE_RES which equals s Sqrt{1 - h(ii) = SD Sqrt(1 - 1/N) = SD Sqrt[(N-1)/N]. |
Line 29: | Line 29: |
The studentized outlier is therefore equal to $$\frac{Y - \mbox{mean(Y)}}{\mbox{SD} \sqrt{\frac{N-1}{N}}} \approx \frac{Y - \mbox{mean(Y)}}{\mbox{SD}}$$ when N is large. | The studentized outlier is therefore equal to (Y - mean(Y))/[SD (N-1)/N] \approx (Y - mean(Y))/SD when N is large. |
How do I check for outliers in a simple regression with one predictor variable?
A simple way to check for outliers is to evaluate either standardized or studentized residuals and see if there are many with high values e.g. > +/- 2. The key reason for studentizing is that the variances of the residuals at different predictor values are different.
This can be done as follows:
- Standardize both the response variable and the predictor variable by subtracting their means and dividing by their standard deviations, call these y(s) and x(s).
- Evaluate a Pearson or Spearman correlation, R.
- Obtain the i-th raw residual as Y(si) - Rx(si)
- To obtain the standardized residual just divide by the standard deviation of the residuals. The mean raw residual should be zero.
- The studentized residual may also be used to identify potential outliers. This divides the raw residual by its standard error, SE_RES.
SE_RES equals s Sqrt[1 - h(ii)] where s equals Sum i(Y(si) - Rx(si})/(N-2) for N observations and h(ii) equals 1/N + x(si)2 /Sum i x(si)2
Studentised residuals may be evaluated using this spreadsheet.
Outliers without adjusting for other variables
In this case where we are interested in outliers of a variable unadjusted for any others the studentized residual is approximately equal to the standardized residual (ie a z-score) for large N.
In this case h(ii) equals 1/N and s is the standard deviation since the predicted value for Y is simply its mean.
So it follows SE_RES which equals s Sqrt{1 - h(ii) = SD Sqrt(1 - 1/N) = SD Sqrt[(N-1)/N].
The studentized outlier is therefore equal to (Y - mean(Y))/[SD (N-1)/N] \approx (Y - mean(Y))/SD when N is large.