For statistics, there is plenty of material available online but I found Carnegie Mellon University's introduction to statistics easy to understand, so I decided to do course examples in R.
We start with first activity of Exploring a dataset and classifying variables.- Exploring a Dataset:As course explains, Variables can be
classified into one of two types: Quantitative or
Categorical. The first activity explains how to explore
data set and identify variable types using Microsoft Excel, we will
try to do same using R.
Here is the objective of first activity:- Learn how to open and examine a dataset using Excel.
- Practice classifying variables by their type: quantitative or categorical.
- Learn to change numeric codes to meaningful labels.
For following examples; most of the stuff is taken from Ajay Shah's R pages. The data set used in this activity can be downloaded from the course site.
- Learn how to open and examine a dataset using R: We save data set
in a CSV (Comma delimited) file using Excel. Assuming that file name
is "depression.csv" and it is stored in "c:\", we can use following
code to read it using R.
# read the csv file, notice the front slash / or use two back slashes. >depress <- read.table ("c:/depression.csv", sep=",", header=TRUE) # check if we have correct number of records. >nrow(depress) # check if we have correct number of columns. >ncol(depress) # lets display the internal structure of our data set/data frame. >str(depress) # display the content of data frame - default action is to print content >depress
You can even read an Excel file directly in R, here is an example and go here for more information and dependency details of reading an excel file.
- Practice classifying variables by their type: quantitative or categorical.
In our last example "str" command that used for displaying internal
structure of R data frame also provides additional information
regarding variables. Lets check the output of the "str(depress)"
command.
>str(depress) 'data.frame': 109 obs. of 7 variables: $ Hospt : int 1 1 1 1 1 1 1 1 1 1 ... $ Treat : Factor w/ 3 levels "Imipramine","Lithium",..: 2 1 1 2 2 3 2 3 3 3 ... $ Outcome: Factor w/ 2 levels "No Recurrence",..: 2 1 1 2 1 2 1 2 1 2 ... $ Time : num 36.1 105.1 74.6 49.7 14.4 ... $ AcuteT : int 211 176 191 206 63 70 55 512 162 306 ... $ Age : int 33 49 50 29 29 30 56 48 22 61 ... $ Gender : int 1 1 1 2 1 2 1 1 2 2 ...
From the output of the "str" command we can see that "Treat" and "Outcome" are factor variable. R uses Factors to store categorical variables. But we also know from examining data set that "Hospt" and "Gender" are also categorical variable. We can also use following R command to confirm the same.# Bug-Fix: Gabriele Righetti is.factor(depress$Treat) [1] TRUE
- Learn to change numeric codes to meaningful labels: Lets try to
convert "Gender" variable to a factor. To convert "Gender" variable
to factor variable in R we can use the "factor" function. In our data
set 1 indicates Female and 2 is used for Male. Lets change those
numbers to meaning full categorical variables.
>depress$Gender <- factor(depress$Gender, labels = c("Female", "Male")) >depress$Gender
In above command first label Female corresponds to value 1 and second Male label is assigned to value 2. We do not have to do search and replace, that is done by R for us.
3 comments:
hi
do i need to change the numeric data to factorial (the presence/ absence data) for ordination study
the matrix is too big
how to change the how matrix to factorial from numeric at once, not column by column
please help me (urgent)
email id: loginms@hotmail.com
see ?as.factor
> a <- matrix(1:10, 2)
> a
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
> as.factor(a)
[1] 1 2 3 4 5 6 7 8 9 10
Levels: 1 2 3 4 5 6 7 8 9 10
i could not solve, it say :
> as.factor(bird.df)
Error in sort.list(unique.default(x), na.last = TRUE) :
'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?
>
i don't understand what is this
thank you
Post a Comment