Wednesday, April 23, 2008

Learning Statistics using R : One Quantitative Variable:

The course in this section teaches how to explore data collected from a quantitative variable, and summarize important features of its distribution. This section starts with graphs and then moves on to numerical measures of the distribution and 5 number summary. As course suggest to display data from one quantitative variable graphically, we can use either the histogram, stem-plot, or box-plot. We will try to do course examples using R.

  1. Histogram: For histogram we break range of values into intervals and count how many observations fall into each interval. The first example illustrates histogram with the help of exam grades of 15 students. In example bins/intervals are created using 10 points wide length, including first value while excluding last value.

    Here are details:

    Exam grades of 15 students: 88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73. Bins, intervals that are chosen along with the count of observations:

    [40-50) 1
    [50-60) 2
    [60-70) 4
    [70-80) 5
    [80-90) 2
    [90-100] 1
    As shown above intervals are closed at one end while open on other end. Lets try to plot histogram-using R.
    >grades<-c(88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73)
    # Above command we uses R's 'c'  (combine) function to create a vector of grades.
    >hist(grades, right=FALSE)
    # Just a simple command draws the following histogram.

    Note if in the above "hist" command if you do not use "right=FALSE" argument then the observation with value 60 will be counted in the [50-60] interval. The R help page says, "right" argument is logical; if TRUE, the histograms cells are right-closed (left open) intervals. In our command we used "right=FALSE" that means the intervals are of the form [a, b); which what we wanted.

  2. More on histogram: The Center of distribution is its midpoint - the value that divides the distribution so that approximately half the observations take smaller values, and approximately half observations take larger values.
    # we use our above "grades"  vector.
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    
    48.0    61.5    71.0    70.2    77.0    97.0
    As you can see in above example "summary" command displays various important numbers. The out put displays center of the distribution along with min and max and other values.

  3. Best Actor Oscar Winners: Applying histogram to actual data. The actual data set is here. As usual we save the excel file in a csv (comma separated) file. In following text we show how to do interesting things with few R commands.
    # read the actor.csv file in an actor data frame.
    >actor <- read.csv ("actor.csv", header=T, sep=",")
    # we run summary command, just to see five-point summary.
    Min.    :31.00   
    1st Qu. :37.75   
    Median  :42.50   
    Mean    :44.72   
    3rd Qu. :48.75   
    Max.    :76.00 
    # with following command we do not have to keep writing actor$Age to refer to Age column of actor.
    # as we have only one column, summary(Age) would also give same output as above.
    As explained in here, our minimum data point is 31, and our max is 76. We'll use a bin width of 5, and make bins from 30 to 80. So we have total 10 bins.
    > hist(Age, breaks=10, xlim=c(30,80))
    In above command we directly refer to "Age" column, the "breaks" parameters can either specify number of bins or actual breaks. Also "xlim" parameter is used to define the range of x values with sensible defaults. Also help page mentions that "xlim" is not used to define histogram breaks, but only for plotting. Here is what final histogram looks like: Play with "xlim" and "ylim" options and see what happens to the histogram. In our example Frequency count is only up to 8 on the Y-axis adjust it to 9.

  4. Stemplot: The stemplot also called stem and leaf plot is another graphical display of the distribution of quantitative data. As course suggest for drawing stemplot we separate each data point into stem and leaf, where
    • Leaf - right-most digit.
    • Stem - anything but the right-most digit.

    So if the data point is 34 then leaf is 4 and stem is 3 but if data point is 3.41 then leaf is 1 and stem is 3.4

  5. Best Actress Oscar Winners: Stemplot of actual data. We use now Best Actress data set. The actual data set is here. It also included in the document at the end.
    # read the actress.csv file in an actress data frame.
    >actress <- read.csv ("actress.csv", header=T, sep=",")
    # with following command we do not have to keep writing actor$Age to refer to Age column of actor.
    # as we are interested in "Age" column of the data frame, we will work with it only, but check summary
    >stem(Age, scale=2)
    As you can see a simple "stem" R command draws the stemplot. The "scale" option/argument to the function expands the scale of the plot. Here is what our stemplot looks like:
    When we use scale=2
    The decimal point is 1 digit(s) to the right of the | 
     2 | 1 
     2 | 56669  
     3 | 013333444  
     3 | 555789
     4 | 11123  
     4 | 599  
     5 |  
     5 |  
     6 | 1 
     6 |  
     7 | 4  
     7 |  
     8 | 0 
    When we use scale=1
    The decimal point is 1 digit(s) to the right of the |  
    2 | 156669  
    3 | 013333444555789  
    4 | 11123599  
    5 |   
    6 | 1 
    7 | 4  
    8 | 0

    Well, I do not know how to rotate stemplot 90 degree counter clockwise using R, so it visually resembles histogram; if anybody know please let me know.


Navneet Nandan Jha said...

"Actor" data set is not present in the specified link.
Please have a look into it.

Anonymous said...