An Outlier is an observation/data-point in set of observations (data set) that is far removed in values from other observations. An outlier is either very large or very small value and as noted earlier affects the mean of the data set. More about Outlier is here.
The (1.5 * IQR) criteria for finding outliers (1.5 times IQR):
An observation is suspected outliers if it is:
- Below Q1 - (1.5 * IQR): that is Q1 minus 1.5 times IQR.
- Above Q3 + (1.5 * IQR): that is Q2 minus 1.5 times IQR.
The following picture illustrates the 1.5 * IQR rule:
Example: Best Actress Oscar Winners:
We continue with our Best Actress Oscar winner data set. Here we will try to locate names of actress whose age is beyond the 1.5 * IQR range.
# we have data in 'actress' data frame; just check the quantile > quantile(Age, names=T) 0% 25% 50% 75% 100% 21.00 32.50 35.00 41.25 80.00 # Lets check the IQR > IQR(Age) [1] 8.75 # Now lets see how to retrieve value for Q1 - First quantile > quantile(Age,0.25) 25% 32.5 # (Is this correct method?) # As can be seen above passing a value 0.25 returns value of the first quantile. # now lets see what is the value for Q1 - (1.5 * IQR) >quantile(Age, 0.25) - (IQR(Age) * 1.5) 25% 19.375 # Okay so this also works! # now how do we get names of all actress whose age is less than 19.375? >which(Age < (quantile(Age, 0.25) - IQR(Age) * 1.5)) integer(0) # So in our data set there is no actress whose age is less than 19.375 since the smallest is of age 21! # now lets try to find upper/higher suspected outliers. # Remember its Q3 + (IQR * 1.5) >quantile(Age,0.75) + (IQR(Age) * 1.5) 75% 54.375 # Okay so far so good; lets get age that are greater than 54.375 > Age > quantile(Age,0.75) + (IQR(Age) * 1.5 ) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE [13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE # See it returns TRUE in the indices where our condition matches. > which(Age > quantile(Age,0.75) + (IQR(Age) * 1.5 )) [1] 12 16 20 # which command returns the same but actual index number; but how do we get names? > actress[which(Age > quantile(Age,0.75) + (IQR(Age) * 1.5 )),] Year Name Movie Age 12 1981 Kathryn Hepburn On Golden Pond 74 16 1985 Geraldine Page A Trip to the Bountiful 61 20 1989 Jessica Tandy Driving Miss Daisy 80 # so the comma (,) at the end does the trick. Lets try to do without "which" command. >actress[ Age > (quantile(Age,.75, names=T) + (IQR(Age) * 1.5)),] Year Name Movie Age 12 1981 Kathryn Hepburn On Golden Pond 74 16 1985 Geraldine Page A Trip to the Bountiful 61 20 1989 Jessica Tandy Driving Miss Daisy 80 # so finally we got suspected outliers!Other methods:
We can draw boxplot to visually detect outliers. (But following does not seem that helpful?)
> boxplot(Age) # or > plot(lm(Age ~ 1)) # or using library car > library("car") > outlier.test(lm(Age ~ 1)) max|rstudent| = 3.943262, degrees of freedom = 30, unadjusted p = 0.0004461754, Bonferroni p = 0.01427761 Observation: 20
No comments:
Post a Comment