Sunday, August 3, 2008

Learning Statistics Using R: Role Type Classification : Case II (2 of 2)

Following example from the course website provides an opportunity to analyze relationship between two categorical variables using R.

Study: An associated press article captured the attentions of readers with the headline "Night lights bad for kids?" The article was based on a 1999 study at the University of Pennsylvania and Children's Hospital of Philadelphia, in which parents were surveyed about the lighting conditions under which their children slept between birth and age 2 (lamp, night-light, or no light) and whether or not their children developed nearsightedness (myopia.) The purpose of the study was to explore the effect of a young child's night-time exposure to light on later nearsightedness.

The actual excel file can be downloaded from here. Or it can be downloaded as and csv file along with other data sets from here.

Lets try to do this analysis using R.
  1. Find out explanatory and response variables:
    • Light:(No light/Night light/Lamp) this categorical variable is explanatory variable.
    • Nearsightedness: (Yes/No) this categorical variable is response variable since values (yes/no) depends on the value of Light variable.

  2. Create a two-way summary table: Following steps shows how we summarize the data in a two-way table using R. Note there are multiple ways of creating two-way table in R but I have not successful at them.
     # Read data usingread.csv function.  
    > nightlight<-read.csv("nightlight.csv", header=T,sep=",")
    
    # Find row and column totals using addmargins function.
    > addmargins(table(nightlight),FUN=sum, quiet=F)
    Margins computed over dimensions
    in the following order:
    1: Light
    2: Nearsightedness
                 Nearsightedness
    Light          No Yes sum
      lamp         34  41  75
      night light 153  79 232
      no light    155  17 172
      sum         342 137 479
    
    As we have seen earlier "table" command is used to tabulate data but we need row and column totals; that is possible using "addmargins" function. (I think this can also be achieved using combination of apply, t, cut and few other functions but I am not able to do so due to limited knowledge of R. Have a look at this discussion thread.)

  3. Find percentage: As noted earlier, having sum does not help; we need to find percentage for comparing distribution of response variable.
    Again, as I am learning R I could not do this percentage calculation using combination of R commands but found a package (rather two) that does exactly what we want.
    • Using "CrossTable" function from library "gregmisc":
       # load the library 
      > library("gregmisc")
      Loading required package: gdata
      Loading required package: gmodels
      Loading required package: gplots
      Loading required package: gtools
      
      # Read data usingread.csv function.  
      > nightlight <- read.csv("nightlight.csv", header=T,sep=",")
      > CrossTable(table(nightlight), prop.t=FALSE, prop.c=FALSE, prop.r=TRUE, prop.chisq=F)
      
         Cell Contents
      |-------------------------|
      |                       N |
      |           N / Row Total |
      |-------------------------|
       
      Total Observations in Table:  479 
       
                   | Nearsightedness 
             Light |        No |       Yes | Row Total | 
      -------------|-----------|-----------|-----------|
              lamp |        34 |        41 |        75 | 
                   |     0.453 |     0.547 |     0.157 | 
      -------------|-----------|-----------|-----------|
       night light |       153 |        79 |       232 | 
                   |     0.659 |     0.341 |     0.484 | 
      -------------|-----------|-----------|-----------|
          no light |       155 |        17 |       172 | 
                   |     0.901 |     0.099 |     0.359 | 
      -------------|-----------|-----------|-----------|
      Column Total |       342 |       137 |       479 | 
      -------------|-----------|-----------|-----------|
      

      As we can see from the above output, "CrossTable" function provides us what we want. ( "prop.r=TRUE" provides us with row proportions.)

    • Using "Rcmdr" library
      This library provides nice GUI using which we can find row percentage and other details. Here is the screen shot of the same.

  4. Interpret these results: Lets try to analyze these results. We want explore the effect of a young child's night-time exposure to light on later nearsightedness. From the above results lets note few things:
    • The results suggest that propotion of children 0.547 (or 54.7%) developed nearsightedness when exposed to lamp. This propotion is higher when we compare this to night light and no light propotion; that are 0.341 (or 34.1 %) and 0.099 (or 9.9 %) respectively.
    • That is there are 5 times higher chances of children developing nearsightedness when slept with lamp compared to children who slept without any lights.
    • Though 9.9 % of children developed nearsightedness when slept without any lights.

No comments: