- how to quantify the center and spread of a distribution with various numerical measures.
- Few important properties of numerical measures
- how to choose the appropriate numerical measures of center and spread to supplement the histogram/graphical representation.
Measue of Center (1 of 2):
Two most important measure of center of a distribution are mean and median. These two have completely different approach and idea of describing center of a distribution.
- Mean/Arithmetic mean: is the sum of the
observations (values) divided by the number(count) of observations. If
X1, X2, X3,...Xn are total 'n' number of observations then their mean
X̄ (x bar) is :
X̄ = (X1 + X2 + X3 + ... + Xn)/ n
Lets take one example using random values. We use R's runif function to generate 10 random values and then compute its mean:# runif will generate 10 random observation/values. > observation <- runif(10) > print(observation) [1] 0.7080567 0.6582278 0.2415265 0.4169798 0.4172357 0.2258143 0.3805531 [8] 0.4568466 0.5952122 0.4650702 > mean(observation) [1] 0.4565523 > sum(observation)/10.00 [1] 0.4565523
If we take our Best Actress Oscar winners from 1970 to 2001 example then following are the different age when actress have won Oscar:
34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33
Their sum: 34+34+26+37+42......+35+33=1233
and their count (total number of observations): 32. Hence the mean age of this dataset is :
X̄ = 1233/32 = 38.5
- Median: M is the center/midpoint of the
distribution. M is such a number that half of the the observations
fall above and half fall below. To find the median:
- Order the data from smallest to largest. (sort).
- Consider whether n, the number of observations, is even or
odd.
- If n is odd, the median M is the center observation in the ordered list. This observation is the one "sitting" in the (n+1)/2 spot in the ordered list.
- If n is even, the median M is the mean of the two center observations in the ordered list. These two observations are the ones "sitting" in the n/2 and n/2 + 1 spots in the ordered list.
34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33
We paste these values (ages) into simple R vector:>age = c(34, 34, 26, 37, 42, 41, 35, 31, 41, 33, 30, 74, 33, 49, 38, 61, 21, 41, 26, 80, 43, 29, 33, 35, 45, 49, 39, 34, 26, 25, 35, 33) # first check mean so we know that we have copied all values. >mean(age) [1] 38.53125 # Good, in R we can order/sort this dataset with one command. >sort(age) [1] 21 25 26 26 26 29 30 31 33 33 33 33 34 34 34 35 35 35 37 38 39 41 41 41 42 [26] 43 45 49 49 61 74 80 >length(age) [1] 32 # so we know its median would be mean of 16th and 17th observation that (35+35)/2. > (sort(age)[16] + sort(age)[17])/2 [1] 35 # so the 16th observation is 35 and 17th is also 35, lets cross check using R's function. > median(age) [1] 35 > median(sort(age)) [1] 35 # So R's median function correctly returns 35 even when dataset is not sorted. but if we do > (age[16]+age[17])/2 [1] 41 # then answer is incorrect since age dataset is not sorted.
No comments:
Post a Comment