Friday, May 2, 2008

Learning Statistics using R: IQR

I am including the notes from the course itself. None of material is mine other than errors!
As seen earlier range gives us over all range of the distribution while IQR measures the spread of distribution by giving us the range covered by the middle 50% of the data.
The picture taken from course website makes it more clear.
Here is how course suggest on finding IQR:
  1. First sort the data so that we can easily find the median. As we know that median divides the dataset in such a way that 50% of data points are below the median and 50% of data points are above the median. That is data set would divide in two equal halves; lower and top halves - first half containing min to median and another from median to max.
  2. Find the median of the lower 50% of the data or the first half. This is called the first quartile of the distribution and is denoted by Q1. Q1 or median of the fist half is called the first quartile since one quarter of the data points fall below it.
  3. Repeat this again for the top 50% of the data. Find the median of the top 50% of the data. This is called the third quartile of the distribution and is denoted by Q3. Q3 is called the third quartile since three quarters of the data points fall below it.
  4. The middle 50% of data between Q1 and Q3 is IQR and calculated by following: IQR = Q3 - Q1.

Here is another picture taken from course website that visually explains how first quartile and Q1 is found:

Few very important observation that course makes as following:
  • From the first picture we can see that Q1, M, and Q3 divide the data into four quarters with 25% of the data points in each, where the median is essentially the second quartile. The use of IQR=Q3-Q1 as a measure of spread is therefore particularly appropriate when the median M is used as a measure of center.
  • We can define a bit more precisely what is considered the bottom or top 50% of the data. The bottom (top) 50% of the data is all the observations whose position in the ordered list is to the LEFT (RIGHT) of the location of the overall median M. The following picture will visually illustrate this for the simple cases of n=7 and n=8.
Note that when n is odd (like in n=7 above), the median is not included in either the bottom or top half of the data; when n is even (like in n=8 above), the data are naturally divided into two halves.

Example: Best Actress Oscar Winners: Course uses stemplot for finding IQR; we use simple R command to find IQR. Please note that IQR found using R is different from the course example.
# we have data in 'actress' data frame.
> quantile(Age, names=T)
   0%   25%    50%   75%    100%
21.00  32.50  35.00  41.25  80.00 
> IQR(Age)
[1] 8.75

As can be seen in above code, 'quantile' R command outputs following for quartile:
  • Q1 32.50 (shown as 25%; course calculated value is 32)
  • Q3 41.25 (shown as 75%; course calculated value is 41.5)
Simple IQR R function calculates 8.75 as IQR; while course calculated value is 9.75.

No comments: