Monday, April 21, 2008

Learning Statistics using R : Exploring Dataset

Wikipedia: Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data.

For statistics, there is plenty of material available online but I found Carnegie Mellon University's introduction to statistics easy to understand, so I decided to do course examples in R.

We start with first activity of Exploring a dataset and classifying variables.
  1. Exploring a Dataset:As course explains, Variables can be classified into one of two types: Quantitative or Categorical. The first activity explains how to explore data set and identify variable types using Microsoft Excel, we will try to do same using R.

    Here is the objective of first activity:
    1. Learn how to open and examine a dataset using Excel.
    2. Practice classifying variables by their type: quantitative or categorical.
    3. Learn to change numeric codes to meaningful labels.

    For following examples; most of the stuff is taken from Ajay Shah's R pages. The data set used in this activity can be downloaded from the course site.

    1. Learn how to open and examine a dataset using R: We save data set in a CSV (Comma delimited) file using Excel. Assuming that file name is "depression.csv" and it is stored in "c:\", we can use following code to read it using R.
      # read the csv file,  notice the front slash / or use two back slashes.
      >depress <- read.table ("c:/depression.csv", sep=",", header=TRUE)
      # check if we have correct number of records.
      >nrow(depress)
      # check if we have correct number of columns. 
      >ncol(depress)
      # lets display the internal structure of our data set/data frame.
      >str(depress)
      # display the content of data frame - default action is to print content
      >depress
      You can even read an Excel file directly in R, here is an example and go here for more information and dependency details of reading an excel file.

    2. Practice classifying variables by their type: quantitative or categorical. In our last example "str" command that used for displaying internal structure of R data frame also provides additional information regarding variables. Lets check the output of the "str(depress)" command.
      >str(depress)
      'data.frame':   109 obs. of  7 variables: 
      $ Hospt  : int  1 1 1 1 1 1 1 1 1 1 ... 
      $ Treat  : Factor w/ 3 levels "Imipramine","Lithium",..: 2 1 1 2 2 3 2 3 3 3 ... 
      $ Outcome: Factor w/ 2 levels "No Recurrence",..: 2 1 1 2 1 2 1 2 1 2 ... 
      $ Time   : num   36.1 105.1  74.6  49.7  14.4 ... 
      $ AcuteT : int  211 176 191 206 63 70 55 512 162 306 ... 
      $ Age    : int  33 49 50 29 29 30 56 48 22 61 ... 
      $ Gender : int  1 1 1 2 1 2 1 1 2 2 ...
      
      From the output of the "str" command we can see that "Treat" and "Outcome" are factor variable. R uses Factors to store categorical variables. But we also know from examining data set that "Hospt" and "Gender" are also categorical variable. We can also use following R command to confirm the same.
      # Bug-Fix: Gabriele Righetti 
      is.factor(depress$Treat) 
      [1] TRUE 
    3. Learn to change numeric codes to meaningful labels: Lets try to convert "Gender" variable to a factor. To convert "Gender" variable to factor variable in R we can use the "factor" function. In our data set 1 indicates Female and 2 is used for Male. Lets change those numbers to meaning full categorical variables.
      >depress$Gender <- factor(depress$Gender, labels = c("Female", "Male"))
      >depress$Gender
      
      In above command first label Female corresponds to value 1 and second Male label is assigned to value 2. We do not have to do search and replace, that is done by R for us.

3 comments:

madanksuwal said...

hi
do i need to change the numeric data to factorial (the presence/ absence data) for ordination study
the matrix is too big
how to change the how matrix to factorial from numeric at once, not column by column
please help me (urgent)

email id: loginms@hotmail.com

Krishna Dagli said...

see ?as.factor

> a <- matrix(1:10, 2)
> a
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
> as.factor(a)
[1] 1 2 3 4 5 6 7 8 9 10
Levels: 1 2 3 4 5 6 7 8 9 10

madanksuwal said...

i could not solve, it say :
> as.factor(bird.df)
Error in sort.list(unique.default(x), na.last = TRUE) :
'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?
>
i don't understand what is this

thank you