Tuesday, June 17, 2008

Learning Statistics using R: Rule Of Standard Deviation

Most of the material here is taken from the course website!
Following explains the rule of standard deviation, also known as The Empirical Rule. The rule is applied only to normal (symmetric) data distribution.
  • Approximately 68% of the observations fall within 1 standard deviation of the mean.
  • Approximately 95% of the observations fall within 2 standard deviations of the mean.
  • Approximately 99.7% (or virtually all) of the observations fall within 3 standard deviations of the mean.
This rule provides more insights about standard deviation and following picture taken from course web site illustrates the same rule: Lets understand this with an example: The following data represents height data of 50 males. Lets use R to find 5 number summary of these and to confirm if the distribution is nomal - mound shaped.
# We use 'c' command to populate male vector; on which we will carry our operations.
male <- c(64, 66, 66, 67, 67, 67, 67, 68, 68, 68, 68, 68, 68, 69, 69, 69, 69, 69, 70, 70, 70, 70, 70, 70, 70, 71, 71, 71, 71, 71, 71, 71, 72, 72, 72, 72, 72, 72, 73, 73, 73, 74, 74, 74, 74, 74, 75, 76, 76, 77)
> hist(male)
In above code sample 'hist' command draws a histogram that has almost normal-mound shape. Here is the image that R draws for us. Lets find five number summary and confirm if standard deviation rule applies correctly to this data set.
# Just a simple summary command gives five point summary
> summary(male)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  64.00   68.25   70.50   70.58   72.00   77.00
> sd(male)
[1] 2.857786
# Lets apply first rule - 68% of data points are within (mean - 1 * SD) and (mean + 1 * SD)
> male >= (mean(male) - (1 * sd(male))) & male <= (mean(male) + (1 * sd(male)))
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[37]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE
# above command gives us the indices of male vector as TRUE where our condition satisfies.
# Lets count how many such obervations are there.
> length(male[male >= (mean(male) - (1 * sd(male))) & male <= (mean(male) + (1 * sd(male)))])
[1] 34
# So out of 50 observation 34 observation are with in mean +/- 1 SD. ie.
> 34/50 * 100
[1] 68
# So as rule suggests, 68% observations are with in mean +/- 1 SD.

# Lets check second rule - 95% of data points are within (mean - 2 * SD) and (mean + 2 * SD)
> length(male[male >= (mean(male) - (2 * sd(male))) & male <= (mean(male) + (2 * sd(male)))])
[1] 48
> 48/50 * 100
[1] 96
# So indeed 95% of data points are within mean +/- 2 SD.

# Lets check third rule - 99.7% of data points are within (mean - 3 * SD) and (mean + 3 * SD)
> length(male[male >= (mean(male) - (3 * sd(male))) & male <= (mean(male) + (3 * sd(male)))])
[1] 50
> 50/50*100
[1] 100
# this shows that 99.7% of data points are with in mean +/- 3 SD.
Following table taken from course website makes this more clear:
Summary:
  • The standard deviation measures the spread by reporting a typical (average) distance between the data points and their average.
  • It is appropriate to use the SD as a measure of spread with the mean as the measure of center.
  • Since the mean and standard deviations are highly influenced by extreme observations, they should be used as numerical descriptions of the center and spread only for distributions that are roughly symmetric, and have no outliers.
  • For symmetric mound-shaped distributions, the Standard Deviation Rule tells us what percentage of the observations falls within 1, 2, and 3 standard deviations of the mean, and thus provides another way to interpret the standard deviation's value for distributions of this type.

No comments: