Tuesday, April 29, 2008

Learning Statistics using R: Measure of Spread

Most of the material here is taken from the course website!
To describe a distribution along with measure of center we also need to know spread also known as variability of distribution. As course describes there are 3 commonly used measures of spread/variability, each describing spread differently:
  • Range
  • Inter-quartile range (IQR)
  • Standard deviation

  • Range:
  • Range is simplest measure of spread and is exactly the distance (difference) between smallest data point (min) and maximum data point. We try to find Range of our Best Actress Dataset:
    actress <- read.csv("actress.csv", sep=",", header=T)
    > attach(actress)
    > summary(Age)   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.   
    21.00   32.50   35.00   38.53   41.25   80.00
    > range(Age)
    [1] 21 80
    > diff(range(Age))
    [1] 59
    
    Yes, summary command gives us all the details but we try to learn few more R commands. As can be seen in above example "range" function gives the minimum and maximum value for the "Age" distribution. If we subtract min from max we get number of years covered as shown by "diff" command.
    80 (max) - 21 (min) = 59 (Range)

Monday, April 28, 2008

Learning Statistics using R: Comparing Mean and Median

Mean and Median are measure of center, each describing center in a different way. Mean, is average value of all observations and due to this actual values of observations makes a difference to its value, while Median is a middle value in an ordered data set.

Lets understand this with few simple examples:
  • Assume we have a dataset with these three values: 1, 2, 5. We can see the median being 2 and mean as (1+2+5) = 8 / 3 = 2.67
  • If we just change the last observation value from 5 to 50 then median is still 2 but mean changes to 17.67.


As course brings out the main point that is "The mean is very sensitive to outliers (as it factors in their magnitude), while the median is resistant to outliers."

So as course explains:
  • For symmetric distributions with no outliers: X is approximately equal to M.
  • For skewed right distributions and/or datasets with high outliers: X > M.
  • For skewed left distributions and/or datasets with low outliers: X < M.
Hence mean is used for symmetric distribution with no outliers while median is used in other case for measure of center.

Sunday, April 27, 2008

Learning Statistics using R: Numerical Measure

    Numerical Measures: We proceed to numerical measures of the distribution of a quantitative variable. From this point onward I am including the notes from the course itself. None of material is mine other than errors! The distribution of a quantitative variable is described by its shape, center, and spread. With histogram we can describe the shape of the distribution, but we can only get a rough estimate for the center and spread. So along with graphical display we need a more precise numerical description of the center and spread of the distribution.

    In this section we will learn:
    • how to quantify the center and spread of a distribution with various numerical measures.
    • some of the properties of these numerical measures, and
    • how to choose the appropriate numerical measures of center and spread to supplement the histogram.


    1. Measure of Center: The two main numerical measures for the center of a distribution are the mean and the median. Each one of these measures is based on a completely different idea of describing the center of a distribution. We will first present each one of the measures, and then compare their properties.

      1. Measure of Center: The mean is the sum of the observations divided by the number of observations. If the n observations are X1, X2, ... Xn, their mean, which we denote by X (and read X-bar), is therefore: = X = (X1+X2+..+Xn)/n.

        Example: Best Actress Oscar Winners: We continue with our Best Actress Oscar Winners dataset.
        # read the actress.csv file in an actress data frame. [Bug-Fix: Gabriele Righetti]
        >actress <- read.csv ("actress.csv", header=T, sep=",")
        # with following command we do not have to keep writing actor$Age to refer to Age column of actor.
        >attach(actress)
        # a single command summary can give us all details, but just to learn few more R commands.
        >mean(Age)[1] 38.53125
        
        As it can be seen from above example, "mean" is an R command that gives average of distribution (measure of center).

      2. Median: The median M is the midpoint of the distribution. It is the number such that half of the observations fall above and half fall below. To find the median:
        • Order the data from smallest to largest.
        • Consider whether n, the number of observations, is even or odd.
          • If n is odd, the median M is the center observation in the ordered list. This observation is the one "sitting" in the (n+1)/2 spot in the ordered list.
          • If n is even, the median M is the mean of the two center observations in the ordered list. These two observations are the ones "sitting" in the n/2 and n/2 + 1 spots in the ordered list.

          Finding median using Best Actress data set:
          # we already have data read in the actress data frame.
          > attach(actress)
          > median(Age)
          [1] 35
          
          As seen in above code we can use "median" command of R to find the median value of the distribution.

          Example: Finding median. Here are the numbers of hours that 9 students spend on the computer on a typical day:
          1, 6, 7, 5, 5, 8, 11, 12, 15
          # store numbers of hours spent in a hours vector.
          hours<-c(1 , 6 , 7 , 5 , 5 , 8 , 11 , 12 , 15)
          > median(hours)
          [1] 7
          > mean(hours)
          [1] 7.777778
          # as we have total 9 observations, (n+1)/2th observation (in sorted data), i.e. 5th.
          

Wednesday, April 23, 2008

Learning Statistics using R : One Quantitative Variable:

The course in this section teaches how to explore data collected from a quantitative variable, and summarize important features of its distribution. This section starts with graphs and then moves on to numerical measures of the distribution and 5 number summary. As course suggest to display data from one quantitative variable graphically, we can use either the histogram, stem-plot, or box-plot. We will try to do course examples using R.

  1. Histogram: For histogram we break range of values into intervals and count how many observations fall into each interval. The first example illustrates histogram with the help of exam grades of 15 students. In example bins/intervals are created using 10 points wide length, including first value while excluding last value.

    Here are details:

    Exam grades of 15 students: 88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73. Bins, intervals that are chosen along with the count of observations:

    ScoreCount
    [40-50) 1
    [50-60) 2
    [60-70) 4
    [70-80) 5
    [80-90) 2
    [90-100] 1
    As shown above intervals are closed at one end while open on other end. Lets try to plot histogram-using R.
    >grades<-c(88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73)
    # Above command we uses R's 'c'  (combine) function to create a vector of grades.
    >hist(grades, right=FALSE)
    # Just a simple command draws the following histogram.
    

    Note if in the above "hist" command if you do not use "right=FALSE" argument then the observation with value 60 will be counted in the [50-60] interval. The R help page says, "right" argument is logical; if TRUE, the histograms cells are right-closed (left open) intervals. In our command we used "right=FALSE" that means the intervals are of the form [a, b); which what we wanted.

  2. More on histogram: The Center of distribution is its midpoint - the value that divides the distribution so that approximately half the observations take smaller values, and approximately half observations take larger values.
    # we use our above "grades"  vector.
    >summary(grades)   
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    
    48.0    61.5    71.0    70.2    77.0    97.0
    
    As you can see in above example "summary" command displays various important numbers. The out put displays center of the distribution along with min and max and other values.

  3. Best Actor Oscar Winners: Applying histogram to actual data. The actual data set is here. As usual we save the excel file in a csv (comma separated) file. In following text we show how to do interesting things with few R commands.
    # read the actor.csv file in an actor data frame.
    >actor <- read.csv ("actor.csv", header=T, sep=",")
    # we run summary command, just to see five-point summary.
    >summary(actor)
          Age        
    Min.    :31.00   
    1st Qu. :37.75   
    Median  :42.50   
    Mean    :44.72   
    3rd Qu. :48.75   
    Max.    :76.00 
    # with following command we do not have to keep writing actor$Age to refer to Age column of actor.
    >attach(actor)
    # as we have only one column, summary(Age) would also give same output as above.
    
    As explained in here, our minimum data point is 31, and our max is 76. We'll use a bin width of 5, and make bins from 30 to 80. So we have total 10 bins.
    > hist(Age, breaks=10, xlim=c(30,80))
    
    In above command we directly refer to "Age" column, the "breaks" parameters can either specify number of bins or actual breaks. Also "xlim" parameter is used to define the range of x values with sensible defaults. Also help page mentions that "xlim" is not used to define histogram breaks, but only for plotting. Here is what final histogram looks like: Play with "xlim" and "ylim" options and see what happens to the histogram. In our example Frequency count is only up to 8 on the Y-axis adjust it to 9.



  4. Stemplot: The stemplot also called stem and leaf plot is another graphical display of the distribution of quantitative data. As course suggest for drawing stemplot we separate each data point into stem and leaf, where
    • Leaf - right-most digit.
    • Stem - anything but the right-most digit.

    So if the data point is 34 then leaf is 4 and stem is 3 but if data point is 3.41 then leaf is 1 and stem is 3.4

  5. Best Actress Oscar Winners: Stemplot of actual data. We use now Best Actress data set. The actual data set is here. It also included in the document at the end.
    # read the actress.csv file in an actress data frame.
    >actress <- read.csv ("actress.csv", header=T, sep=",")
    # with following command we do not have to keep writing actor$Age to refer to Age column of actor.
    >attach(actress)
    # as we are interested in "Age" column of the data frame, we will work with it only, but check summary
    >stem(Age, scale=2)
    
    As you can see a simple "stem" R command draws the stemplot. The "scale" option/argument to the function expands the scale of the plot. Here is what our stemplot looks like:
    When we use scale=2
    The decimal point is 1 digit(s) to the right of the | 
     2 | 1 
     2 | 56669  
     3 | 013333444  
     3 | 555789
     4 | 11123  
     4 | 599  
     5 |  
     5 |  
     6 | 1 
     6 |  
     7 | 4  
     7 |  
     8 | 0 
    
    When we use scale=1
    The decimal point is 1 digit(s) to the right of the |  
    2 | 156669  
    3 | 013333444555789  
    4 | 11123599  
    5 |   
    6 | 1 
    7 | 4  
    8 | 0
    

    Well, I do not know how to rotate stemplot 90 degree counter clockwise using R, so it visually resembles histogram; if anybody know please let me know.

Tuesday, April 22, 2008

Learning Statistics using R : One Categorical Variable

In the course, second activity teaches how to summaries collected data. Course also teaches how to draw pie charts and other basic charts using Excel. We will do the same using R. The data set used for this activity can be downloaded from here. The data set consists of answers of 1200 US college students. The question that was asked "With whom do you find it easiest to make friends: opposite sex, same sex or no difference?"

In this activity we will use the collected data to:
  1. Learn how to use R to tally our data into a table of counts and percents.
  2. Learn how to use R to produce a pie chart and other charts.
  1. Learn how to use R to tally our data into table of counts and percents. We save "friend.xls" file as a friend.csv file. We can use simple "summary" command of R to get summary of our data set.
    # Bug-Fix: Gabriele Righetti from Italy.
    >friend <- read.csv("c://friends.csv", header=TRUE)
    # just print the content of our data set/frame.
    >print (friend) 
    # note default action is to print.
    # lets run summary command and see what is out put.
    >summary(friend) 
    # note following is the output of the command.          
    Friends    
    No difference:602   
    Opposite sex :434   
    Same sex     :164
    
    From the above out put we can see that data set contains what counts. In our example we can see that 162 students find is easy to make friends with same sex, while 434 are more comfortable with opposite sex when they have to make friends and 602 students find that sex does not make any difference when they make friends. Now lets use few R commands to do more.
    # we have data in friend object/data.frame.
    > table(friend)
    friend
    No difference   Opposite sex      Same sex
      602           434                   164
    
    As in above output R "table" command gives us something similar to the "summary" command but table command is more powerful and used for cross-classifying data. Lets us do pie and bar charts.
    >pie(table(friend))
    >barplot(table(friend))
    
    These two simple commands draws pie char and bar chart. Look at the help pages of each to make them more interesting.

    Following are our pie and barcharts!





Monday, April 21, 2008

Learning Statistics using R : Exploring Dataset

Wikipedia: Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data.

For statistics, there is plenty of material available online but I found Carnegie Mellon University's introduction to statistics easy to understand, so I decided to do course examples in R.

We start with first activity of Exploring a dataset and classifying variables.
  1. Exploring a Dataset:As course explains, Variables can be classified into one of two types: Quantitative or Categorical. The first activity explains how to explore data set and identify variable types using Microsoft Excel, we will try to do same using R.

    Here is the objective of first activity:
    1. Learn how to open and examine a dataset using Excel.
    2. Practice classifying variables by their type: quantitative or categorical.
    3. Learn to change numeric codes to meaningful labels.

    For following examples; most of the stuff is taken from Ajay Shah's R pages. The data set used in this activity can be downloaded from the course site.

    1. Learn how to open and examine a dataset using R: We save data set in a CSV (Comma delimited) file using Excel. Assuming that file name is "depression.csv" and it is stored in "c:\", we can use following code to read it using R.
      # read the csv file,  notice the front slash / or use two back slashes.
      >depress <- read.table ("c:/depression.csv", sep=",", header=TRUE)
      # check if we have correct number of records.
      >nrow(depress)
      # check if we have correct number of columns. 
      >ncol(depress)
      # lets display the internal structure of our data set/data frame.
      >str(depress)
      # display the content of data frame - default action is to print content
      >depress
      You can even read an Excel file directly in R, here is an example and go here for more information and dependency details of reading an excel file.

    2. Practice classifying variables by their type: quantitative or categorical. In our last example "str" command that used for displaying internal structure of R data frame also provides additional information regarding variables. Lets check the output of the "str(depress)" command.
      >str(depress)
      'data.frame':   109 obs. of  7 variables: 
      $ Hospt  : int  1 1 1 1 1 1 1 1 1 1 ... 
      $ Treat  : Factor w/ 3 levels "Imipramine","Lithium",..: 2 1 1 2 2 3 2 3 3 3 ... 
      $ Outcome: Factor w/ 2 levels "No Recurrence",..: 2 1 1 2 1 2 1 2 1 2 ... 
      $ Time   : num   36.1 105.1  74.6  49.7  14.4 ... 
      $ AcuteT : int  211 176 191 206 63 70 55 512 162 306 ... 
      $ Age    : int  33 49 50 29 29 30 56 48 22 61 ... 
      $ Gender : int  1 1 1 2 1 2 1 1 2 2 ...
      
      From the output of the "str" command we can see that "Treat" and "Outcome" are factor variable. R uses Factors to store categorical variables. But we also know from examining data set that "Hospt" and "Gender" are also categorical variable. We can also use following R command to confirm the same.
      # Bug-Fix: Gabriele Righetti 
      is.factor(depress$Treat) 
      [1] TRUE 
    3. Learn to change numeric codes to meaningful labels: Lets try to convert "Gender" variable to a factor. To convert "Gender" variable to factor variable in R we can use the "factor" function. In our data set 1 indicates Female and 2 is used for Male. Lets change those numbers to meaning full categorical variables.
      >depress$Gender <- factor(depress$Gender, labels = c("Female", "Male"))
      >depress$Gender
      
      In above command first label Female corresponds to value 1 and second Male label is assigned to value 2. We do not have to do search and replace, that is done by R for us.