Wednesday, June 4, 2008

Learning Statistics using R: Standard Deviation

Most of the material here is taken from the course website!
Earlier we examined measure of spread using Range(max - min) and IQR(the range covered by middle 50% data). We also noted that IQR should be used when median is used as measure of center. Now we move to another measure of spread called standard deviation.

The idea behind standard deviation is to quantify the spread of distribution by measuring how far the observations are from mean of distribution. That is how distant observations are located from the mean of all observations. The standard deviation gives average (or typical distance) between an observation/data point and the mean, X-bar.

Lets understand standard deviation using an example; we calculate standard deviation step by step using R commands. (There is a single R function to do the same!)
Assume we have following data set of 8 observations:
7, 9, 5, 13, 3, 11, 15, 9
  • Calculate mean:
    We use R's 'c' function to combine them in a vector and then use mean function to calculate mean of the data set.
    > dataset <- c(7, 9, 5, 13, 3, 11, 15, 9)
    > is.vector(dataset)
    [1] TRUE
    # we have stored our observation in a vector called dataset
    > mean(dataset)
    [1] 9
    # so we have 9 as the mean of our data set; now we need to find
    # distance of each observation from this value : 9.
    > deviation <- c (dataset - 9)
    > deviation
    [1] -2  0 -4  4 -6  2  6  0
    # above command directly gives us deviation of each observation from 
    # the mean; that we have stored in another vector called deviation.
    
    Thinking about the idea behind the standard deviation being an average (typical) distance between the data points and their mean, it will make sense to average the deviations we got. Note, however, that the sum of the deviations from the mean is always 0.

  • Square of deviation:
    So we square each of the deviation and then take its average; which following R code does:
    # we can use either of following two methods to calculate square of
    # deviations.
    > deviation ^ 2
    [1]  4  0 16 16 36  4 36  0
    
    > deviation * deviation
    [1]  4  0 16 16 36  4 36  0
    
  • Average the square deviations by adding them up, and dividing by n-1, (one less than the sample size): Lets do that in R.
    > (sum(deviation ^ 2)) / (length(dataset) - 1)
    [1] 16
    
    This average of the squared deviations is called the variance of the data.

  • Find standard deviation:
    The standard deviation of the data is the square root of the variance. So in our case it would be square root of 16.
    >sqrt(16)
    [1] 4
    
    Why do we take the square root? Note that 16 is an average of the squared deviations, and therfore has different units of measurement. In this case 16 is measured in "squared deviation", which obviously cannot be interpreted. We therefore, take the square root in order to compensate for the fact that we squared our deviations, and in order to go back to the original units of measurement.

Properties of the Standard Deviation:
  1. It should be clear from the discussion thus far that the standard deviation should be paired as a measure of spread with the mean as a measure of center.
  2. Note that the only way, mathematically, in which the standard deviation = 0, is when all the observations have the same value. Indeed in this case not only the standard deviation is 0, but also the range and the IQR are 0.
  3. Like the mean, the SD is strongly influenced by outliers in the data. Consider our last example: 3, 5, 7, 9, 9, 11, 13, 15 (data ordered). If the largest observation was wrongly recorded as 150, then: the average would jump up to 25.9, and the standard deviation jumps up to 50.3 Note that in this simple example it is easy to see that while the standard is strongly influenced by outliers, the IQR is not! In both cases, the IQR will be the same since, like the median, the calculation of the quartiles depends only on the order of the data rather than the actual values.

Choosing Numerical Summaries
  • Use mean and the standard deviation as measures of center and spread only for reasonably symmetric distributions with no outliers.
  • Use the five-number summary (which gives the median, IQR and range) for all other cases.
R function for Standard Deviation: There is a single R function "sd" that calculates Standard Deviation of dataset, just be careful to use "na.rm=TRUE" argument if you have NA values in your dataset. This function would return vector of SD of columns if dataset is dataframe or matrix. Remember its column's SD and not rows by default.

No comments: