Monday, March 15, 2010

Measure of Center: Mean Vs. Median

So now we know that to measure center of a distribution we have two most common numerical measures; Mean and Median.
The mean describes the center as an average value, where the actual values of the data points play an important role since we are summing values and then dividing by count (number of values). The median, on the other hand, locates the middle value as the center, and the order of the data is the key to finding it (here we are sorting and only counting).

Lets understand difference between two with an example.
Assume we have following two data sets:
Data set A -> 64 65 66 68 70 71 73
Data set B -> 64 65 66 68 70 71 730
Please observe that only last value changes in these two sets, ie. 73 in data set A becomes 730 in data set B.

For data set A, the mean is 68.1, and the median is 68. Visually comparing these two data sets we know that the observation 730 is very large and is certainly an outlier. In this case the median is still 68, but the mean will be influenced by the high outlier, and shifted up to 162. The message that we should take from this example is: The mean is very sensitive to outliers (as it factors in their magnitude), while the median is resistant to outliers.

Types of distributions and mean and median:
Lets see what happens to mean and median with our 3 basic distributions.
  • Symmetric distributions with no outliers: Mean (X̄) is approximately equal to median (M).
  • Skewed right distributions and/or datasets with high outliers: Mean (X̄) is always greater than median (M). (X̄ > M)
  • Skewed left distributions and/or datasets with low outliers: Mean (X̄) is always less than median (M). (X̄ < M)


Let's Summarize
  • The two main numerical measures for the center of a distribution are the mean (X̄) and the median (M). The mean is the average of the values, while the median is the middle value.
  • The mean is very sensitive to outliers (as it factors in their magnitude), while the median is resistant to outliers.
  • The mean is an appropriate measure of center only for symmetric distributions with no outliers. In all other cases, the median should be used to describe the center of the distribution.

Wednesday, March 3, 2010

Why numerical measures? Mean and Median

As we show in previous example only graphical representation of quantitative variable is not enough. Using graphical representation we can only get a rough estimate for the center and spread. A description of the distribution of a quantitative variable must include, in addition to the graphical display, a more precise numerical description of the center and spread of the distribution. So we learn:
  • how to quantify the center and spread of a distribution with various numerical measures.
  • Few important properties of numerical measures
  • how to choose the appropriate numerical measures of center and spread to supplement the histogram/graphical representation.

Measue of Center (1 of 2):
Two most important measure of center of a distribution are mean and median. These two have completely different approach and idea of describing center of a distribution.
  • Mean/Arithmetic mean: is the sum of the observations (values) divided by the number(count) of observations. If X1, X2, X3,...Xn are total 'n' number of observations then their mean X̄ (x bar) is :

    X̄ = (X1 + X2 + X3 + ... + Xn)/ n

    Lets take one example using random values. We use R's runif function to generate 10 random values and then compute its mean:
    
    # runif will generate 10 random observation/values.
    > observation <- runif(10) 
    > print(observation) 
     [1] 0.7080567 0.6582278 0.2415265 0.4169798 0.4172357 0.2258143 0.3805531
     [8] 0.4568466 0.5952122 0.4650702
    > mean(observation)
    [1] 0.4565523
    > sum(observation)/10.00
    [1] 0.4565523
    

    If we take our Best Actress Oscar winners from 1970 to 2001 example then following are the different age when actress have won Oscar:
    34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33

    Their sum: 34+34+26+37+42......+35+33=1233
    and their count (total number of observations): 32. Hence the mean age of this dataset is :
    X̄ = 1233/32 = 38.5

  • Median: M is the center/midpoint of the distribution. M is such a number that half of the the observations fall above and half fall below. To find the median:
    • Order the data from smallest to largest. (sort).
    • Consider whether n, the number of observations, is even or odd.
      • If n is odd, the median M is the center observation in the ordered list. This observation is the one "sitting" in the (n+1)/2 spot in the ordered list.
      • If n is even, the median M is the mean of the two center observations in the ordered list. These two observations are the ones "sitting" in the n/2 and n/2 + 1 spots in the ordered list.
    This is better explained using a visualization provided at course website: Now lets again take our Best Actress Oscar winners from 1970 to 2001 example, following are the ages:
    34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33

    We paste these values (ages) into simple R vector:
    >age = c(34, 34, 26, 37, 42, 41, 35, 31, 41, 33, 30, 74, 33, 49, 38, 61, 21, 41, 26, 80, 43, 29, 33, 35, 45, 49, 39, 34, 26, 25, 35, 33)
    
    # first check mean so we know that we have copied all values.
    >mean(age)
    [1] 38.53125
    # Good, in R we can order/sort this dataset with one command.
    >sort(age)
     [1] 21 25 26 26 26 29 30 31 33 33 33 33 34 34 34 35 35 35 37 38 39 41 41 41 42
    [26] 43 45 49 49 61 74 80
    >length(age)
    [1] 32
    # so we know its median would be mean of 16th and 17th  observation that (35+35)/2.
    > (sort(age)[16] + sort(age)[17])/2
    [1] 35
    # so the 16th observation is 35 and 17th is also 35, lets cross check using R's function.
    > median(age)
    [1] 35
    > median(sort(age))
    [1] 35
    # So R's median function correctly returns 35 even when dataset is not sorted. but if we do
    > (age[16]+age[17])/2
    [1] 41
    # then answer is incorrect since age dataset is not sorted.