Sunday, April 27, 2008

Learning Statistics using R: Numerical Measure

    Numerical Measures: We proceed to numerical measures of the distribution of a quantitative variable. From this point onward I am including the notes from the course itself. None of material is mine other than errors! The distribution of a quantitative variable is described by its shape, center, and spread. With histogram we can describe the shape of the distribution, but we can only get a rough estimate for the center and spread. So along with graphical display we need a more precise numerical description of the center and spread of the distribution.

    In this section we will learn:
    • how to quantify the center and spread of a distribution with various numerical measures.
    • some of the properties of these numerical measures, and
    • how to choose the appropriate numerical measures of center and spread to supplement the histogram.

    1. Measure of Center: The two main numerical measures for the center of a distribution are the mean and the median. Each one of these measures is based on a completely different idea of describing the center of a distribution. We will first present each one of the measures, and then compare their properties.

      1. Measure of Center: The mean is the sum of the observations divided by the number of observations. If the n observations are X1, X2, ... Xn, their mean, which we denote by X (and read X-bar), is therefore: = X = (X1+X2+..+Xn)/n.

        Example: Best Actress Oscar Winners: We continue with our Best Actress Oscar Winners dataset.
        # read the actress.csv file in an actress data frame. [Bug-Fix: Gabriele Righetti]
        >actress <- read.csv ("actress.csv", header=T, sep=",")
        # with following command we do not have to keep writing actor$Age to refer to Age column of actor.
        # a single command summary can give us all details, but just to learn few more R commands.
        >mean(Age)[1] 38.53125
        As it can be seen from above example, "mean" is an R command that gives average of distribution (measure of center).

      2. Median: The median M is the midpoint of the distribution. It is the number such that half of the observations fall above and half fall below. To find the median:
        • Order the data from smallest to largest.
        • Consider whether n, the number of observations, is even or odd.
          • If n is odd, the median M is the center observation in the ordered list. This observation is the one "sitting" in the (n+1)/2 spot in the ordered list.
          • If n is even, the median M is the mean of the two center observations in the ordered list. These two observations are the ones "sitting" in the n/2 and n/2 + 1 spots in the ordered list.

          Finding median using Best Actress data set:
          # we already have data read in the actress data frame.
          > attach(actress)
          > median(Age)
          [1] 35
          As seen in above code we can use "median" command of R to find the median value of the distribution.

          Example: Finding median. Here are the numbers of hours that 9 students spend on the computer on a typical day:
          1, 6, 7, 5, 5, 8, 11, 12, 15
          # store numbers of hours spent in a hours vector.
          hours<-c(1 , 6 , 7 , 5 , 5 , 8 , 11 , 12 , 15)
          > median(hours)
          [1] 7
          > mean(hours)
          [1] 7.777778
          # as we have total 9 observations, (n+1)/2th observation (in sorted data), i.e. 5th.

No comments: