Sunday, May 4, 2008

Learning Statistics using R:Detecting Outliers with IQR

Most of the material here is taken from the course website!
An Outlier is an observation/data-point in set of observations (data set) that is far removed in values from other observations. An outlier is either very large or very small value and as noted earlier affects the mean of the data set. More about Outlier is here.

The (1.5 * IQR) criteria for finding outliers (1.5 times IQR):
An observation is suspected outliers if it is:
  1. Below Q1 - (1.5 * IQR): that is Q1 minus 1.5 times IQR.
  2. Above Q3 + (1.5 * IQR): that is Q2 minus 1.5 times IQR.

The following picture illustrates the 1.5 * IQR rule:
Example: Best Actress Oscar Winners:
We continue with our Best Actress Oscar winner data set. Here we will try to locate names of actress whose age is beyond the 1.5 * IQR range.
# we have data in 'actress' data frame; just check the quantile
> quantile(Age, names=T)   
0%   25%   50%   75%  100% 
21.00 32.50 35.00 41.25 80.00 

# Lets check the IQR 
> IQR(Age)
[1] 8.75

# Now lets see how to retrieve value for Q1 - First quantile
> quantile(Age,0.25) 
25% 32.5

# (Is this correct method?)
# As can be seen above passing a value 0.25 returns value of the first quantile. 
# now lets see what is the value for Q1 - (1.5 * IQR)
>quantile(Age, 0.25) - (IQR(Age) * 1.5) 
  25% 19.375
# Okay so this also works!

# now how do we get names of all actress whose age is less than 19.375?
>which(Age < (quantile(Age, 0.25) - IQR(Age) * 1.5))
integer(0)
# So in our data set there is no actress whose age is less than 19.375 since the smallest is of age 21!

# now lets try to find upper/higher suspected outliers. 
# Remember its Q3 + (IQR * 1.5)
>quantile(Age,0.75) + (IQR(Age) * 1.5) 
  75% 
54.375

# Okay so far so good; lets get age that are greater than 54.375
> Age > quantile(Age,0.75) + (IQR(Age) * 1.5 ) 
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[13] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# See it returns TRUE in the indices where our condition matches.

> which(Age > quantile(Age,0.75) + (IQR(Age) * 1.5 ))
[1] 12 16 20
# which command returns the same but actual index number; but how do we get names?

> actress[which(Age > quantile(Age,0.75) + (IQR(Age) * 1.5 )),]   
    Year      Name                   Movie              Age
12 1981   Kathryn Hepburn    On Golden Pond             74
16 1985   Geraldine Page     A Trip to the Bountiful    61
20 1989   Jessica Tandy      Driving Miss Daisy         80

# so the comma (,) at the end does the trick. Lets try to do without "which" command.


>actress[ Age > (quantile(Age,.75, names=T) + (IQR(Age) * 1.5)),]  
    Year      Name                   Movie              Age
12 1981   Kathryn Hepburn    On Golden Pond             74
16 1985   Geraldine Page     A Trip to the Bountiful    61
20 1989   Jessica Tandy      Driving Miss Daisy         80

# so finally we got suspected outliers!
Other methods:
We can draw boxplot to visually detect outliers. (But following does not seem that helpful?)

> boxplot(Age)
# or
> plot(lm(Age ~ 1))
# or using library car
> library("car")
> outlier.test(lm(Age ~ 1))

max|rstudent| = 3.943262, degrees of freedom = 30,
unadjusted p = 0.0004461754, Bonferroni p = 0.01427761

Observation: 20

No comments: