Tuesday, May 13, 2008

Learning Statistics using R: Five Number Summary

Most of the material here is taken from the course website!
Before we go ahead and learn how graphical representation of Five Number Summary, let check out few intersting course problems.
Here they are:
  1. Example 1: A survey taken of 140 sports fans asked the question: "What is the most you have ever spent for a ticket to a sporting event?"
    The five-number summary for the data collected is:
    min = 85 Q1 =130 median = 145 Q3 = 150 max = 250
    Should the smallest observation be classified as an outlier?
  2. Example 2: A survey taken in a large statistics class contained the question: "What's the fastest you have driven a car (in mph)?"
    The five-number summary for the 87 males surveyed is:
    min=55 Q1=95 Median=110 Q3=120 Max=155
    Should the largest observation in this data set be classified as an outlier?

Important Summary About Spread
  • The range covered by the data is the most intuitive measure of spread and is exactly the distance between the smallest data point - min and the largest one - Max.
  • Another measure of spread is the inter-quartile range (IQR) which is the range covered by the middle 50% of the data.
  • IQR = Q3-Q1, the difference between the third and first quartiles. The first quartile (Q1) is the value such that one quarter (25%) of the data points fall below it. The third quartile is the value such that three quarters (75%) of the data points fall below it.
  • The IQR should be used as a measure of spread of a distribution only when the median is used as a measure of center.
  • The IQR can be used to detect outliers using the 1.5 * IQR (1.5 times IQR) criterion.

Five Number Summary: So far, in our discussion about measures of spread, the key players were:
  • The extremes (min and Max) which provide the range covered by all the data, and
  • The quartiles (Q1, M and Q3), which together provide the IQR, the range covered by the middle 50% of the data.
The combination of all five numbers (min, Q1, M, Q3, Max) is called the five number summary, and provides a quick numerical description of both the center and spread of a distribution.
Boxplot can be used for graphically summarizing these five number summary of the distribution of a quantitative. Lets start doing boxplots in R.

Example: Best Actor Oscar Winners:
This time we use best actor Oscar winners instead of actress and draw a boxplot using R.
# read the actor.csv file.
>actor<-read.csv("actor.csv", header=T, sep=",")
> attach(actor)
>boxplot(Age, border=c('blue'), xlab='Actor Age')
Here is how our boxplot of actor data set looks:
The boxplot graphically represents the distribution of a quantitative variable by visually displaying the five number summary and any observation that was classified as a suspected outlier using the 1.5(IQR) criterion.

Example: Best Actress Oscar Winners:
We use our actress data set again to draw another box plot:
# read the actress.csv file.
>actress<-read.csv("actress.csv", header=T, sep=",")
> attach(actress)
> boxplot(Age, border=c('magenta'), xlab='Actress Age')
Here is how our actress box plot looks drawn using R: Following graph taken from the course website highlights various details of boxplot done for actress data set. There are couple of interactive examples of boxplot at course website, try doing them.

Example: Best Actress and Actor Oscar Winners: Side by Side Comparative Boxplots. So far we have examined the age distributions of Oscar winners for males and females separately. It will be interesting to compare the age distributions of actors and actresses who won the best acting Oscar. To do that we will look at side-by-side boxplots of the age distributions by gender.
Its quite easy to do side by side boxplots in R. Following code shows how to do it:
# read the actor.csv file.
>actor <-read.csv("actor.csv", header=T, sep=",")
>actress <-read.csv("actress.csv", header=T, sep=",")
# Following is one single command 
# Bug-Fix: Gabriele Righetti 
> boxplot(actor$Age, actress$Age, border=c('blue','magenta'), names=c('Actor','Actress'), ylab='Age',main='Side-By-Side (Comparative) Boxplots\nAge of Best Actor/Actress Winners (1970-2001)')
>
This is how our final output from R look:
Recall also that we found the five-number summary and means for both distributions:
  • Actors: min=31, Q1=37.25, M=42.5, Q3=50.25, Max=76
  • Actresses: min=21, Q1=32, M=35, Q3=41.5, Max 80
Based on the graph and numerical measures, we can make the following comparison between the two distributions:
  • Center: The graph reveals that the age distribution of the males is higher than the females' age distribution. This is supported by the numerical measures. The median age for females (35) is lower than for the males (42.5). Actually, it should be noted that even the third quartile of the females' distribution (41.5) is lower than the median age for males. We therefore conclude that in general, actresses win the Best Actress Oscar at a younger age than the actors do.
  • Spread: Judging by the range of the data, there is much more variability in the females' distribution (range=59) than there is in the males' distribution (range=35). On the other hand, if we look at the IQR, which measures the variability only among the middle 50% of the distribution, we see more spread among males (IQR=13) than the females (IQR=9.5). We conclude that among all the winners, the actors' ages are more alike than the actresses' ages. However, the middle 50% of the age distribution of actresses is more homogeneous than the actors' age distribution.
  • Outliers: We see that we have outliers in both distributions. There is only one high outlier in the actors' distribution (76, Henry Fonda, On Golden Pond), compared with three high outliers in the actresses' distribution.

No comments: