- how to quantify the center and spread of a distribution with various numerical measures.
- Few important properties of numerical measures
- how to choose the appropriate numerical measures of center and spread to supplement the histogram/graphical representation.
Measue of Center (1 of 2):
Two most important measure of center of a distribution are mean and median. These two have completely different approach and idea of describing center of a distribution.
- Mean/Arithmetic mean: is the sum of the
observations (values) divided by the number(count) of observations. If
X1, X2, X3,...Xn are total 'n' number of observations then their mean
X̄ (x bar) is :
X̄ = (X1 + X2 + X3 + ... + Xn)/ n
Lets take one example using random values. We use R's runif function to generate 10 random values and then compute its mean:# runif will generate 10 random observation/values. > observation <- runif(10) > print(observation) [1] 0.7080567 0.6582278 0.2415265 0.4169798 0.4172357 0.2258143 0.3805531 [8] 0.4568466 0.5952122 0.4650702 > mean(observation) [1] 0.4565523 > sum(observation)/10.00 [1] 0.4565523
If we take our Best Actress Oscar winners from 1970 to 2001 example then following are the different age when actress have won Oscar:
34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33
Their sum: 34+34+26+37+42......+35+33=1233
and their count (total number of observations): 32. Hence the mean age of this dataset is :
X̄ = 1233/32 = 38.5
- Median: M is the center/midpoint of the
distribution. M is such a number that half of the the observations
fall above and half fall below. To find the median:
- Order the data from smallest to largest. (sort).
- Consider whether n, the number of observations, is even or
odd.
- If n is odd, the median M is the center observation in the ordered list. This observation is the one "sitting" in the (n+1)/2 spot in the ordered list.
- If n is even, the median M is the mean of the two center observations in the ordered list. These two observations are the ones "sitting" in the n/2 and n/2 + 1 spots in the ordered list.
Now lets again take our Best Actress Oscar winners from 1970 to 2001 example, following are the ages:
34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33
We paste these values (ages) into simple R vector:>age = c(34, 34, 26, 37, 42, 41, 35, 31, 41, 33, 30, 74, 33, 49, 38, 61, 21, 41, 26, 80, 43, 29, 33, 35, 45, 49, 39, 34, 26, 25, 35, 33) # first check mean so we know that we have copied all values. >mean(age) [1] 38.53125 # Good, in R we can order/sort this dataset with one command. >sort(age) [1] 21 25 26 26 26 29 30 31 33 33 33 33 34 34 34 35 35 35 37 38 39 41 41 41 42 [26] 43 45 49 49 61 74 80 >length(age) [1] 32 # so we know its median would be mean of 16th and 17th observation that (35+35)/2. > (sort(age)[16] + sort(age)[17])/2 [1] 35 # so the 16th observation is 35 and 17th is also 35, lets cross check using R's function. > median(age) [1] 35 > median(sort(age)) [1] 35 # So R's median function correctly returns 35 even when dataset is not sorted. but if we do > (age[16]+age[17])/2 [1] 41 # then answer is incorrect since age dataset is not sorted.
No comments:
Post a Comment