~Krishna Dagli: Why numerical measures? Mean and Median

As we show in previous example only graphical representation of quantitative variable is not enough. Using graphical representation we can only get a rough estimate for the center and spread. A description of the distribution of a quantitative variable must include, in addition to the graphical display, a more precise numerical description of the center and spread of the distribution. So we learn:

how to quantify the center and spread of a distribution with various numerical measures.
Few important properties of numerical measures
how to choose the appropriate numerical measures of center and spread to supplement the histogram/graphical representation.

Measue of Center (1 of 2):
Two most important measure of center of a distribution are mean and median. These two have completely different approach and idea of describing center of a distribution.

Mean/Arithmetic mean: is the sum of the observations (values) divided by the number(count) of observations. If X1, X2, X3,...Xn are total 'n' number of observations then their mean X̄ (x bar) is :

X̄ = (X1 + X2 + X3 + ... + Xn)/ n

Lets take one example using random values. We use R's runif function to generate 10 random values and then compute its mean:
```
# runif will generate 10 random observation/values.
> observation <- runif(10) 
> print(observation) 
 [1] 0.7080567 0.6582278 0.2415265 0.4169798 0.4172357 0.2258143 0.3805531
 [8] 0.4568466 0.5952122 0.4650702
> mean(observation)
[1] 0.4565523
> sum(observation)/10.00
[1] 0.4565523
```
If we take our Best Actress Oscar winners from 1970 to 2001 example then following are the different age when actress have won Oscar:
34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33

Their sum: 34+34+26+37+42......+35+33=1233
and their count (total number of observations): 32. Hence the mean age of this dataset is :
X̄ = 1233/32 = 38.5
Median: M is the center/midpoint of the distribution. M is such a number that half of the the observations fall above and half fall below. To find the median:
- Order the data from smallest to largest. (sort).
- Consider whether n, the number of observations, is even or odd.
  - If n is odd, the median M is the center observation in the ordered list. This observation is the one "sitting" in the (n+1)/2 spot in the ordered list.
  - If n is even, the median M is the mean of the two center observations in the ordered list. These two observations are the ones "sitting" in the n/2 and n/2 + 1 spots in the ordered list.
This is better explained using a visualization provided at course website: Now lets again take our Best Actress Oscar winners from 1970 to 2001 example, following are the ages:
34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33

We paste these values (ages) into simple R vector:
```
>age = c(34, 34, 26, 37, 42, 41, 35, 31, 41, 33, 30, 74, 33, 49, 38, 61, 21, 41, 26, 80, 43, 29, 33, 35, 45, 49, 39, 34, 26, 25, 35, 33)

# first check mean so we know that we have copied all values.
>mean(age)
[1] 38.53125
# Good, in R we can order/sort this dataset with one command.
>sort(age)
 [1] 21 25 26 26 26 29 30 31 33 33 33 33 34 34 34 35 35 35 37 38 39 41 41 41 42
[26] 43 45 49 49 61 74 80
>length(age)
[1] 32
# so we know its median would be mean of 16th and 17th  observation that (35+35)/2.
> (sort(age)[16] + sort(age)[17])/2
[1] 35
# so the 16th observation is 35 and 17th is also 35, lets cross check using R's function.
> median(age)
[1] 35
> median(sort(age))
[1] 35
# So R's median function correctly returns 35 even when dataset is not sorted. but if we do
> (age[16]+age[17])/2
[1] 41
# then answer is incorrect since age dataset is not sorted. 
```

~Krishna Dagli

Wednesday, March 3, 2010

Why numerical measures? Mean and Median

No comments:

Blog Archive

Learning