Tuesday, May 13, 2008

Learning Statistics using R: Five Number Summary

Most of the material here is taken from the course website!
Before we go ahead and learn how graphical representation of Five Number Summary, let check out few intersting course problems.
Here they are:
  1. Example 1: A survey taken of 140 sports fans asked the question: "What is the most you have ever spent for a ticket to a sporting event?"
    The five-number summary for the data collected is:
    min = 85 Q1 =130 median = 145 Q3 = 150 max = 250
    Should the smallest observation be classified as an outlier?
  2. Example 2: A survey taken in a large statistics class contained the question: "What's the fastest you have driven a car (in mph)?"
    The five-number summary for the 87 males surveyed is:
    min=55 Q1=95 Median=110 Q3=120 Max=155
    Should the largest observation in this data set be classified as an outlier?

Important Summary About Spread
  • The range covered by the data is the most intuitive measure of spread and is exactly the distance between the smallest data point - min and the largest one - Max.
  • Another measure of spread is the inter-quartile range (IQR) which is the range covered by the middle 50% of the data.
  • IQR = Q3-Q1, the difference between the third and first quartiles. The first quartile (Q1) is the value such that one quarter (25%) of the data points fall below it. The third quartile is the value such that three quarters (75%) of the data points fall below it.
  • The IQR should be used as a measure of spread of a distribution only when the median is used as a measure of center.
  • The IQR can be used to detect outliers using the 1.5 * IQR (1.5 times IQR) criterion.

Five Number Summary: So far, in our discussion about measures of spread, the key players were:
  • The extremes (min and Max) which provide the range covered by all the data, and
  • The quartiles (Q1, M and Q3), which together provide the IQR, the range covered by the middle 50% of the data.
The combination of all five numbers (min, Q1, M, Q3, Max) is called the five number summary, and provides a quick numerical description of both the center and spread of a distribution.
Boxplot can be used for graphically summarizing these five number summary of the distribution of a quantitative. Lets start doing boxplots in R.

Example: Best Actor Oscar Winners:
This time we use best actor Oscar winners instead of actress and draw a boxplot using R.
# read the actor.csv file.
>actor<-read.csv("actor.csv", header=T, sep=",")
> attach(actor)
>boxplot(Age, border=c('blue'), xlab='Actor Age')
Here is how our boxplot of actor data set looks:
The boxplot graphically represents the distribution of a quantitative variable by visually displaying the five number summary and any observation that was classified as a suspected outlier using the 1.5(IQR) criterion.

Example: Best Actress Oscar Winners:
We use our actress data set again to draw another box plot:
# read the actress.csv file.
>actress<-read.csv("actress.csv", header=T, sep=",")
> attach(actress)
> boxplot(Age, border=c('magenta'), xlab='Actress Age')
Here is how our actress box plot looks drawn using R: Following graph taken from the course website highlights various details of boxplot done for actress data set. There are couple of interactive examples of boxplot at course website, try doing them.

Example: Best Actress and Actor Oscar Winners: Side by Side Comparative Boxplots. So far we have examined the age distributions of Oscar winners for males and females separately. It will be interesting to compare the age distributions of actors and actresses who won the best acting Oscar. To do that we will look at side-by-side boxplots of the age distributions by gender.
Its quite easy to do side by side boxplots in R. Following code shows how to do it:
# read the actor.csv file.
>actor <-read.csv("actor.csv", header=T, sep=",")
>actress <-read.csv("actress.csv", header=T, sep=",")
# Following is one single command 
# Bug-Fix: Gabriele Righetti 
> boxplot(actor$Age, actress$Age, border=c('blue','magenta'), names=c('Actor','Actress'), ylab='Age',main='Side-By-Side (Comparative) Boxplots\nAge of Best Actor/Actress Winners (1970-2001)')
>
This is how our final output from R look:
Recall also that we found the five-number summary and means for both distributions:
  • Actors: min=31, Q1=37.25, M=42.5, Q3=50.25, Max=76
  • Actresses: min=21, Q1=32, M=35, Q3=41.5, Max 80
Based on the graph and numerical measures, we can make the following comparison between the two distributions:
  • Center: The graph reveals that the age distribution of the males is higher than the females' age distribution. This is supported by the numerical measures. The median age for females (35) is lower than for the males (42.5). Actually, it should be noted that even the third quartile of the females' distribution (41.5) is lower than the median age for males. We therefore conclude that in general, actresses win the Best Actress Oscar at a younger age than the actors do.
  • Spread: Judging by the range of the data, there is much more variability in the females' distribution (range=59) than there is in the males' distribution (range=35). On the other hand, if we look at the IQR, which measures the variability only among the middle 50% of the distribution, we see more spread among males (IQR=13) than the females (IQR=9.5). We conclude that among all the winners, the actors' ages are more alike than the actresses' ages. However, the middle 50% of the age distribution of actresses is more homogeneous than the actors' age distribution.
  • Outliers: We see that we have outliers in both distributions. There is only one high outlier in the actors' distribution (76, Henry Fonda, On Golden Pond), compared with three high outliers in the actresses' distribution.

Monday, May 5, 2008

Send Your Name to the Moon

NASA invites people of all ages to join the lunar exploration journey with an opportunity to send their names to the moon aboard the Lunar Reconnaissance Orbiter, or LRO, spacecraft.

Here is my certificate of participation!

Sunday, May 4, 2008

Learning Statistics using R:Detecting Outliers with IQR

Most of the material here is taken from the course website!
An Outlier is an observation/data-point in set of observations (data set) that is far removed in values from other observations. An outlier is either very large or very small value and as noted earlier affects the mean of the data set. More about Outlier is here.

The (1.5 * IQR) criteria for finding outliers (1.5 times IQR):
An observation is suspected outliers if it is:
  1. Below Q1 - (1.5 * IQR): that is Q1 minus 1.5 times IQR.
  2. Above Q3 + (1.5 * IQR): that is Q2 minus 1.5 times IQR.

The following picture illustrates the 1.5 * IQR rule:
Example: Best Actress Oscar Winners:
We continue with our Best Actress Oscar winner data set. Here we will try to locate names of actress whose age is beyond the 1.5 * IQR range.
# we have data in 'actress' data frame; just check the quantile
> quantile(Age, names=T)   
0%   25%   50%   75%  100% 
21.00 32.50 35.00 41.25 80.00 

# Lets check the IQR 
> IQR(Age)
[1] 8.75

# Now lets see how to retrieve value for Q1 - First quantile
> quantile(Age,0.25) 
25% 32.5

# (Is this correct method?)
# As can be seen above passing a value 0.25 returns value of the first quantile. 
# now lets see what is the value for Q1 - (1.5 * IQR)
>quantile(Age, 0.25) - (IQR(Age) * 1.5) 
  25% 19.375
# Okay so this also works!

# now how do we get names of all actress whose age is less than 19.375?
>which(Age < (quantile(Age, 0.25) - IQR(Age) * 1.5))
integer(0)
# So in our data set there is no actress whose age is less than 19.375 since the smallest is of age 21!

# now lets try to find upper/higher suspected outliers. 
# Remember its Q3 + (IQR * 1.5)
>quantile(Age,0.75) + (IQR(Age) * 1.5) 
  75% 
54.375

# Okay so far so good; lets get age that are greater than 54.375
> Age > quantile(Age,0.75) + (IQR(Age) * 1.5 ) 
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[13] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# See it returns TRUE in the indices where our condition matches.

> which(Age > quantile(Age,0.75) + (IQR(Age) * 1.5 ))
[1] 12 16 20
# which command returns the same but actual index number; but how do we get names?

> actress[which(Age > quantile(Age,0.75) + (IQR(Age) * 1.5 )),]   
    Year      Name                   Movie              Age
12 1981   Kathryn Hepburn    On Golden Pond             74
16 1985   Geraldine Page     A Trip to the Bountiful    61
20 1989   Jessica Tandy      Driving Miss Daisy         80

# so the comma (,) at the end does the trick. Lets try to do without "which" command.


>actress[ Age > (quantile(Age,.75, names=T) + (IQR(Age) * 1.5)),]  
    Year      Name                   Movie              Age
12 1981   Kathryn Hepburn    On Golden Pond             74
16 1985   Geraldine Page     A Trip to the Bountiful    61
20 1989   Jessica Tandy      Driving Miss Daisy         80

# so finally we got suspected outliers!
Other methods:
We can draw boxplot to visually detect outliers. (But following does not seem that helpful?)

> boxplot(Age)
# or
> plot(lm(Age ~ 1))
# or using library car
> library("car")
> outlier.test(lm(Age ~ 1))

max|rstudent| = 3.943262, degrees of freedom = 30,
unadjusted p = 0.0004461754, Bonferroni p = 0.01427761

Observation: 20

Friday, May 2, 2008

Learning Statistics using R: IQR

I am including the notes from the course itself. None of material is mine other than errors!
As seen earlier range gives us over all range of the distribution while IQR measures the spread of distribution by giving us the range covered by the middle 50% of the data.
The picture taken from course website makes it more clear.
Here is how course suggest on finding IQR:
  1. First sort the data so that we can easily find the median. As we know that median divides the dataset in such a way that 50% of data points are below the median and 50% of data points are above the median. That is data set would divide in two equal halves; lower and top halves - first half containing min to median and another from median to max.
  2. Find the median of the lower 50% of the data or the first half. This is called the first quartile of the distribution and is denoted by Q1. Q1 or median of the fist half is called the first quartile since one quarter of the data points fall below it.
  3. Repeat this again for the top 50% of the data. Find the median of the top 50% of the data. This is called the third quartile of the distribution and is denoted by Q3. Q3 is called the third quartile since three quarters of the data points fall below it.
  4. The middle 50% of data between Q1 and Q3 is IQR and calculated by following: IQR = Q3 - Q1.

Here is another picture taken from course website that visually explains how first quartile and Q1 is found:

Few very important observation that course makes as following:
  • From the first picture we can see that Q1, M, and Q3 divide the data into four quarters with 25% of the data points in each, where the median is essentially the second quartile. The use of IQR=Q3-Q1 as a measure of spread is therefore particularly appropriate when the median M is used as a measure of center.
  • We can define a bit more precisely what is considered the bottom or top 50% of the data. The bottom (top) 50% of the data is all the observations whose position in the ordered list is to the LEFT (RIGHT) of the location of the overall median M. The following picture will visually illustrate this for the simple cases of n=7 and n=8.
Note that when n is odd (like in n=7 above), the median is not included in either the bottom or top half of the data; when n is even (like in n=8 above), the data are naturally divided into two halves.

Example: Best Actress Oscar Winners: Course uses stemplot for finding IQR; we use simple R command to find IQR. Please note that IQR found using R is different from the course example.
# we have data in 'actress' data frame.
> quantile(Age, names=T)
   0%   25%    50%   75%    100%
21.00  32.50  35.00  41.25  80.00 
> IQR(Age)
[1] 8.75

As can be seen in above code, 'quantile' R command outputs following for quartile:
  • Q1 32.50 (shown as 25%; course calculated value is 32)
  • Q3 41.25 (shown as 75%; course calculated value is 41.5)
Simple IQR R function calculates 8.75 as IQR; while course calculated value is 9.75.