Friday, June 13, 2008

Learning Statistics using R: Standard Deviation and Histogram

Most of the material here is taken from the course website!
In following example we will see how histogram can help us to clarify the concept of Standard Deviation.
Example:
At the end of a statistics course, the 27 students in the class were asked to rate the instructor on a number scale of 1 to 9 (1 being "very poor", and 9 being "best instrctor I've ever had"). The following table provides three hypothetical rating data:
Following are the histogram of data of each class:
What can we say about standard deviation by looking at these histogram and data set?
Lets assume that the mean of all three data set is 5 (which is reasonable clear by looking at histograms) and we know (roughly) that standard deviation is average distance of all data points from their mean.
  • For class I histogram most of the ratings are at 5 which is also the mean of the data set. So the average distance between mean and data points would be very small (since most of the data points are at mean).
  • For class II histogram most of the ratings are at far points from mean - 5. In this case most of the data points are at two extrems at 1 and 9. So the average distance between mean and data points would be larger.
  • For class III histgram data points are evenly distributed around mean. We can safely say that in this case the average distance between mean and data points would be greater than that of class I but smaller than that of class II. ie. in-between class I and class II standard deviation.

Lets check our assumption by loading these data set into R and verifying standard deviation of each. The excel contained data set can be downloaded from here.
> class1 <- c(1,1,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,9,9)
> sd(class1)
[1] 1.568929
> summary(class1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       5       5       5       5       9 

> class2 <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,5,9,9,9,9,9,9,9,9,9,9,9,9,9)
> sd(class2)
[1] 4
> summary(class2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       5       5       9       9 

> class3 <- c(1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9)
> sd(class3)
[1] 2.631174
> summary(class3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       3       5       5       7       9 
So we have following standard deviation for our 3 class ratings
  • Class I : 1.568929
  • Class II : 4.0
  • Class III : 2.631174
(Note that excel may vary a bit in results of standard deviation if you are using stdev function.) So calculated standard deviation confirm our assumption that we made by looking at histograms.

No comments: