Monday, August 25, 2008

Learning Statistics Using R: Role Type Classification: Case III (1 of 5)

Lets start with our next example; understanding relationship between two quantitative variables, ie. both explanatory and response variables are quantitative.

As in previous two cases we compared distribution of response variable with that of explanatory variables. To be specific, in case I we compared distribution of quantitative response with categorical explanatory variable and in case II we compared distribution of categorical response variable with categorical explanatory variable.
Now we have both variables as quantitative, more importantly we have a explanatory variable that is quantitiave. We will start understanding their relationship using "scatterplot".

We start with an example taken from course site.
A Pennsylvania research firm conducted a study in which 30 drivers (of ages 18 to 82 years old) were sampled and for each one the maximum distance at which he/she could read a newly designed sign was determined. The goal of this study was to explore the relationship between driver's age and the maximum distance at which signs were legible, and then use the study's findings to improve safety for older drivers. (Reference: Utts and Heckard, Mind on Statistics (2002). Originally source: Data collected by Last Resource, Inc, Bellfonte, PA.)

Since the purpose of this study is to explore the effect of age on maximum legibility distance,
  • the explanatory variable is Age, and
  • the response variable is Distance.
Here is what the raw data look like and its available here.
Note that the data structure is such that for each individual (in this case driver 1....driver 30) we have a pair of value (in this case representing the driver's age and distance). We can therefore think about this data as 30 pairs of values: (18,510), (32,410), (55,420)........(82,360).

The first step in exploring the relationship between driver age and sign legibility distance is to create an appropriate and informative graphical display. The appropriate graphical display for examining the relationship between two quantitative variables is the scatterplot.

Here is how a scatterplot is constructed for our example:

To create a scatterplot, each pair of values is plotted, so that the value of the explanatory variable (X) is plotted on the horizontal axis, and the value of the response variable (Y) is plotted on the vertical axis. In other words, each individual (driver, in our example) appears on the scatterplot as a single point whose x-coordinate is the value of the explanatory for that individual, and the y-coordinate is the value of the response. Following images taken from course website illustrate the same.

As we have data set, lets start doing the same using R.

# Read data using read.csv function, here separator is tab.
# Bug-Fix: Gabriele Righetti 
> signdist <- read.csv ("signdistance.txt", sep="\t", header=T)
> plot(signdist, type="b", main="Sign and Distance", col="blue", xlab="Driver Age", 
ylab="Sign Legibility Distance (Feet)")
The output of above simple command plot draws our scatter plot. Also note our scatter plot is different, that is we have lines along with points. That is possible with simple parameter "type=b" in plot commands. See the help page of plot command to see all types.
Another using "type=p" argument.

Sunday, August 3, 2008

Learning Statistics Using R: Role Type Classification : Case II (2 of 2)

Following example from the course website provides an opportunity to analyze relationship between two categorical variables using R.

Study: An associated press article captured the attentions of readers with the headline "Night lights bad for kids?" The article was based on a 1999 study at the University of Pennsylvania and Children's Hospital of Philadelphia, in which parents were surveyed about the lighting conditions under which their children slept between birth and age 2 (lamp, night-light, or no light) and whether or not their children developed nearsightedness (myopia.) The purpose of the study was to explore the effect of a young child's night-time exposure to light on later nearsightedness.

The actual excel file can be downloaded from here. Or it can be downloaded as and csv file along with other data sets from here.

Lets try to do this analysis using R.
  1. Find out explanatory and response variables:
    • Light:(No light/Night light/Lamp) this categorical variable is explanatory variable.
    • Nearsightedness: (Yes/No) this categorical variable is response variable since values (yes/no) depends on the value of Light variable.

  2. Create a two-way summary table: Following steps shows how we summarize the data in a two-way table using R. Note there are multiple ways of creating two-way table in R but I have not successful at them.
     # Read data usingread.csv function.  
    > nightlight<-read.csv("nightlight.csv", header=T,sep=",")
    # Find row and column totals using addmargins function.
    > addmargins(table(nightlight),FUN=sum, quiet=F)
    Margins computed over dimensions
    in the following order:
    1: Light
    2: Nearsightedness
    Light          No Yes sum
      lamp         34  41  75
      night light 153  79 232
      no light    155  17 172
      sum         342 137 479
    As we have seen earlier "table" command is used to tabulate data but we need row and column totals; that is possible using "addmargins" function. (I think this can also be achieved using combination of apply, t, cut and few other functions but I am not able to do so due to limited knowledge of R. Have a look at this discussion thread.)

  3. Find percentage: As noted earlier, having sum does not help; we need to find percentage for comparing distribution of response variable.
    Again, as I am learning R I could not do this percentage calculation using combination of R commands but found a package (rather two) that does exactly what we want.
    • Using "CrossTable" function from library "gregmisc":
       # load the library 
      > library("gregmisc")
      Loading required package: gdata
      Loading required package: gmodels
      Loading required package: gplots
      Loading required package: gtools
      # Read data usingread.csv function.  
      > nightlight <- read.csv("nightlight.csv", header=T,sep=",")
      > CrossTable(table(nightlight), prop.t=FALSE, prop.c=FALSE, prop.r=TRUE, prop.chisq=F)
         Cell Contents
      |                       N |
      |           N / Row Total |
      Total Observations in Table:  479 
                   | Nearsightedness 
             Light |        No |       Yes | Row Total | 
              lamp |        34 |        41 |        75 | 
                   |     0.453 |     0.547 |     0.157 | 
       night light |       153 |        79 |       232 | 
                   |     0.659 |     0.341 |     0.484 | 
          no light |       155 |        17 |       172 | 
                   |     0.901 |     0.099 |     0.359 | 
      Column Total |       342 |       137 |       479 | 

      As we can see from the above output, "CrossTable" function provides us what we want. ( "prop.r=TRUE" provides us with row proportions.)

    • Using "Rcmdr" library
      This library provides nice GUI using which we can find row percentage and other details. Here is the screen shot of the same.

  4. Interpret these results: Lets try to analyze these results. We want explore the effect of a young child's night-time exposure to light on later nearsightedness. From the above results lets note few things:
    • The results suggest that propotion of children 0.547 (or 54.7%) developed nearsightedness when exposed to lamp. This propotion is higher when we compare this to night light and no light propotion; that are 0.341 (or 34.1 %) and 0.099 (or 9.9 %) respectively.
    • That is there are 5 times higher chances of children developing nearsightedness when slept with lamp compared to children who slept without any lights.
    • Though 9.9 % of children developed nearsightedness when slept without any lights.