Wednesday, March 3, 2010

Why numerical measures? Mean and Median

As we show in previous example only graphical representation of quantitative variable is not enough. Using graphical representation we can only get a rough estimate for the center and spread. A description of the distribution of a quantitative variable must include, in addition to the graphical display, a more precise numerical description of the center and spread of the distribution. So we learn:
  • how to quantify the center and spread of a distribution with various numerical measures.
  • Few important properties of numerical measures
  • how to choose the appropriate numerical measures of center and spread to supplement the histogram/graphical representation.

Measue of Center (1 of 2):
Two most important measure of center of a distribution are mean and median. These two have completely different approach and idea of describing center of a distribution.
  • Mean/Arithmetic mean: is the sum of the observations (values) divided by the number(count) of observations. If X1, X2, X3,...Xn are total 'n' number of observations then their mean X̄ (x bar) is :

    X̄ = (X1 + X2 + X3 + ... + Xn)/ n

    Lets take one example using random values. We use R's runif function to generate 10 random values and then compute its mean:
    
    # runif will generate 10 random observation/values.
    > observation <- runif(10) 
    > print(observation) 
     [1] 0.7080567 0.6582278 0.2415265 0.4169798 0.4172357 0.2258143 0.3805531
     [8] 0.4568466 0.5952122 0.4650702
    > mean(observation)
    [1] 0.4565523
    > sum(observation)/10.00
    [1] 0.4565523
    

    If we take our Best Actress Oscar winners from 1970 to 2001 example then following are the different age when actress have won Oscar:
    34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33

    Their sum: 34+34+26+37+42......+35+33=1233
    and their count (total number of observations): 32. Hence the mean age of this dataset is :
    X̄ = 1233/32 = 38.5

  • Median: M is the center/midpoint of the distribution. M is such a number that half of the the observations fall above and half fall below. To find the median:
    • Order the data from smallest to largest. (sort).
    • Consider whether n, the number of observations, is even or odd.
      • If n is odd, the median M is the center observation in the ordered list. This observation is the one "sitting" in the (n+1)/2 spot in the ordered list.
      • If n is even, the median M is the mean of the two center observations in the ordered list. These two observations are the ones "sitting" in the n/2 and n/2 + 1 spots in the ordered list.
    This is better explained using a visualization provided at course website: Now lets again take our Best Actress Oscar winners from 1970 to 2001 example, following are the ages:
    34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33

    We paste these values (ages) into simple R vector:
    >age = c(34, 34, 26, 37, 42, 41, 35, 31, 41, 33, 30, 74, 33, 49, 38, 61, 21, 41, 26, 80, 43, 29, 33, 35, 45, 49, 39, 34, 26, 25, 35, 33)
    
    # first check mean so we know that we have copied all values.
    >mean(age)
    [1] 38.53125
    # Good, in R we can order/sort this dataset with one command.
    >sort(age)
     [1] 21 25 26 26 26 29 30 31 33 33 33 33 34 34 34 35 35 35 37 38 39 41 41 41 42
    [26] 43 45 49 49 61 74 80
    >length(age)
    [1] 32
    # so we know its median would be mean of 16th and 17th  observation that (35+35)/2.
    > (sort(age)[16] + sort(age)[17])/2
    [1] 35
    # so the 16th observation is 35 and 17th is also 35, lets cross check using R's function.
    > median(age)
    [1] 35
    > median(sort(age))
    [1] 35
    # So R's median function correctly returns 35 even when dataset is not sorted. but if we do
    > (age[16]+age[17])/2
    [1] 41
    # then answer is incorrect since age dataset is not sorted. 
    

No comments: