Showing posts with label Course. Show all posts
Showing posts with label Course. Show all posts

Thursday, July 17, 2008

Learning Statistics Using R: Role Type Classification : Case II (1 of 2)

Case II of our role type classification includes study of relationship between a Categorical Explanatory and a Categorical Response variable.

We start with an example from the course web site to explore relationship between two categorical variables.
Example: In a survey, 1200 U.S. college students were asked about their body-image, underweight, overweight, or about right. We have to find answer to following questions:
If we had separated our sample of 1200 U.S. college students by gender and looked at males and females separately, would we have found a similar distribution across body-image categories?
More specifically,are men and women just as likely to think their weight is about right? Among those students who do not think their weight is about right, is there a difference between the genders in feelings about body-image?

So for answering these questions requires us to study the relationship between two categorical variables. Both response and explanatory variables are categorical since we want to find how gender (male/female) affects body image (underweight, overweight, right weight). Here in this study we have following:
  • Gender: (Male/Female) as explanatory variable and it is a categorical variable.
  • Body-image:(underweight, overweight, right weight) as response variable and it is a categorical variable.

As I could not find raw data for these example; we will directly use results derived at the course site instead of reading raw data in R and finding results.

To understand how body image is related to gender, we need an informative display that summarizes the data. In order to summarize the relationship between two categorical variables, we create a display called a two-way table.

Here is the two-way table for our example:

So our two-way table summarizes data of all 1200 students by gender and their body image as counts. The "Total" row or column is a summary of one of the two categorical variables, ignoring the other. In our example:
  • The Total row gives the summary of the categorical variable Body-image:
  • The Total column gives the summary of the categorical variable Gender:

Remember, though, that our primary goal is to explore how body image is related to gender. Exploring the relationship between two categorical variables (in this case Body-image and Gender) amounts to comparing the distributions of the response (in this case Body-image) across the different values of the explanatory (in this case males and females):
Note that it does not make sense to compare raw counts, because there are more females than males overall. So for example, it is not very informative to say "there are 560 females who responded 'About Right' compared to only 295 males," since the 560 females are out of a total of 760, and the 295 males are only out of a total of 440). We need to supplement our display, the two-way table, with some numerical summaries that will allow us to compare the distributions. These numerical summaries are found by simply converting the counts to percents within (or restricted to) each value of the explanatory variable separately! In our example: We look at each gender separately, and convert the counts to percents within that gender. Let's start with females:

Note that each count is converted to percents by dividing by the total number of females, 760. These numerical summaries are called conditional percents, since we find them by conditioning on one of the genders

Comments
  1. In our example, we chose to organize the data with the explanatory variable Gender in rows and the response variable Body-image in columns, and thus our conditional percents were row percents, calculated within each row separately. Similarly, if the explanatory variable happens to sit in columns and the response variable in rows, our conditional percents will be column percents, calculated within each column separately.
  2. Another way to visualize the conditional percents, instead of a table, is the double bar chart. This display is quite common in newspapers.

After looking at the numerical summary and graph lets try to put the results in words:
  • The results suggest that propotion of males who are happy with their body image 'About right' is slightly less than among female student. That is 73.3 % of female students are happy with their body image compared to only 67 % of males.
  • Female students who are not happy with their body image often feel they are overweight. That is 73.3 % are happy but remaining 21.4 % feel they are overweight compared to only 4.9 % feeling underweight.
  • Male students who are not happy with their body image feel they are overweight about often as they feel they are underweight. That is 16.6 % student feel they are overweight while rougly same 16.2 % student feel they are underweight.

Wednesday, June 4, 2008

Learning Statistics using R: Standard Deviation

Most of the material here is taken from the course website!
Earlier we examined measure of spread using Range(max - min) and IQR(the range covered by middle 50% data). We also noted that IQR should be used when median is used as measure of center. Now we move to another measure of spread called standard deviation.

The idea behind standard deviation is to quantify the spread of distribution by measuring how far the observations are from mean of distribution. That is how distant observations are located from the mean of all observations. The standard deviation gives average (or typical distance) between an observation/data point and the mean, X-bar.

Lets understand standard deviation using an example; we calculate standard deviation step by step using R commands. (There is a single R function to do the same!)
Assume we have following data set of 8 observations:
7, 9, 5, 13, 3, 11, 15, 9
  • Calculate mean:
    We use R's 'c' function to combine them in a vector and then use mean function to calculate mean of the data set.
    > dataset <- c(7, 9, 5, 13, 3, 11, 15, 9)
    > is.vector(dataset)
    [1] TRUE
    # we have stored our observation in a vector called dataset
    > mean(dataset)
    [1] 9
    # so we have 9 as the mean of our data set; now we need to find
    # distance of each observation from this value : 9.
    > deviation <- c (dataset - 9)
    > deviation
    [1] -2  0 -4  4 -6  2  6  0
    # above command directly gives us deviation of each observation from 
    # the mean; that we have stored in another vector called deviation.
    
    Thinking about the idea behind the standard deviation being an average (typical) distance between the data points and their mean, it will make sense to average the deviations we got. Note, however, that the sum of the deviations from the mean is always 0.

  • Square of deviation:
    So we square each of the deviation and then take its average; which following R code does:
    # we can use either of following two methods to calculate square of
    # deviations.
    > deviation ^ 2
    [1]  4  0 16 16 36  4 36  0
    
    > deviation * deviation
    [1]  4  0 16 16 36  4 36  0
    
  • Average the square deviations by adding them up, and dividing by n-1, (one less than the sample size): Lets do that in R.
    > (sum(deviation ^ 2)) / (length(dataset) - 1)
    [1] 16
    
    This average of the squared deviations is called the variance of the data.

  • Find standard deviation:
    The standard deviation of the data is the square root of the variance. So in our case it would be square root of 16.
    >sqrt(16)
    [1] 4
    
    Why do we take the square root? Note that 16 is an average of the squared deviations, and therfore has different units of measurement. In this case 16 is measured in "squared deviation", which obviously cannot be interpreted. We therefore, take the square root in order to compensate for the fact that we squared our deviations, and in order to go back to the original units of measurement.

Properties of the Standard Deviation:
  1. It should be clear from the discussion thus far that the standard deviation should be paired as a measure of spread with the mean as a measure of center.
  2. Note that the only way, mathematically, in which the standard deviation = 0, is when all the observations have the same value. Indeed in this case not only the standard deviation is 0, but also the range and the IQR are 0.
  3. Like the mean, the SD is strongly influenced by outliers in the data. Consider our last example: 3, 5, 7, 9, 9, 11, 13, 15 (data ordered). If the largest observation was wrongly recorded as 150, then: the average would jump up to 25.9, and the standard deviation jumps up to 50.3 Note that in this simple example it is easy to see that while the standard is strongly influenced by outliers, the IQR is not! In both cases, the IQR will be the same since, like the median, the calculation of the quartiles depends only on the order of the data rather than the actual values.

Choosing Numerical Summaries
  • Use mean and the standard deviation as measures of center and spread only for reasonably symmetric distributions with no outliers.
  • Use the five-number summary (which gives the median, IQR and range) for all other cases.
R function for Standard Deviation: There is a single R function "sd" that calculates Standard Deviation of dataset, just be careful to use "na.rm=TRUE" argument if you have NA values in your dataset. This function would return vector of SD of columns if dataset is dataframe or matrix. Remember its column's SD and not rows by default.

Tuesday, May 13, 2008

Learning Statistics using R: Five Number Summary

Most of the material here is taken from the course website!
Before we go ahead and learn how graphical representation of Five Number Summary, let check out few intersting course problems.
Here they are:
  1. Example 1: A survey taken of 140 sports fans asked the question: "What is the most you have ever spent for a ticket to a sporting event?"
    The five-number summary for the data collected is:
    min = 85 Q1 =130 median = 145 Q3 = 150 max = 250
    Should the smallest observation be classified as an outlier?
  2. Example 2: A survey taken in a large statistics class contained the question: "What's the fastest you have driven a car (in mph)?"
    The five-number summary for the 87 males surveyed is:
    min=55 Q1=95 Median=110 Q3=120 Max=155
    Should the largest observation in this data set be classified as an outlier?

Important Summary About Spread
  • The range covered by the data is the most intuitive measure of spread and is exactly the distance between the smallest data point - min and the largest one - Max.
  • Another measure of spread is the inter-quartile range (IQR) which is the range covered by the middle 50% of the data.
  • IQR = Q3-Q1, the difference between the third and first quartiles. The first quartile (Q1) is the value such that one quarter (25%) of the data points fall below it. The third quartile is the value such that three quarters (75%) of the data points fall below it.
  • The IQR should be used as a measure of spread of a distribution only when the median is used as a measure of center.
  • The IQR can be used to detect outliers using the 1.5 * IQR (1.5 times IQR) criterion.

Five Number Summary: So far, in our discussion about measures of spread, the key players were:
  • The extremes (min and Max) which provide the range covered by all the data, and
  • The quartiles (Q1, M and Q3), which together provide the IQR, the range covered by the middle 50% of the data.
The combination of all five numbers (min, Q1, M, Q3, Max) is called the five number summary, and provides a quick numerical description of both the center and spread of a distribution.
Boxplot can be used for graphically summarizing these five number summary of the distribution of a quantitative. Lets start doing boxplots in R.

Example: Best Actor Oscar Winners:
This time we use best actor Oscar winners instead of actress and draw a boxplot using R.
# read the actor.csv file.
>actor<-read.csv("actor.csv", header=T, sep=",")
> attach(actor)
>boxplot(Age, border=c('blue'), xlab='Actor Age')
Here is how our boxplot of actor data set looks:
The boxplot graphically represents the distribution of a quantitative variable by visually displaying the five number summary and any observation that was classified as a suspected outlier using the 1.5(IQR) criterion.

Example: Best Actress Oscar Winners:
We use our actress data set again to draw another box plot:
# read the actress.csv file.
>actress<-read.csv("actress.csv", header=T, sep=",")
> attach(actress)
> boxplot(Age, border=c('magenta'), xlab='Actress Age')
Here is how our actress box plot looks drawn using R: Following graph taken from the course website highlights various details of boxplot done for actress data set. There are couple of interactive examples of boxplot at course website, try doing them.

Example: Best Actress and Actor Oscar Winners: Side by Side Comparative Boxplots. So far we have examined the age distributions of Oscar winners for males and females separately. It will be interesting to compare the age distributions of actors and actresses who won the best acting Oscar. To do that we will look at side-by-side boxplots of the age distributions by gender.
Its quite easy to do side by side boxplots in R. Following code shows how to do it:
# read the actor.csv file.
>actor <-read.csv("actor.csv", header=T, sep=",")
>actress <-read.csv("actress.csv", header=T, sep=",")
# Following is one single command 
# Bug-Fix: Gabriele Righetti 
> boxplot(actor$Age, actress$Age, border=c('blue','magenta'), names=c('Actor','Actress'), ylab='Age',main='Side-By-Side (Comparative) Boxplots\nAge of Best Actor/Actress Winners (1970-2001)')
>
This is how our final output from R look:
Recall also that we found the five-number summary and means for both distributions:
  • Actors: min=31, Q1=37.25, M=42.5, Q3=50.25, Max=76
  • Actresses: min=21, Q1=32, M=35, Q3=41.5, Max 80
Based on the graph and numerical measures, we can make the following comparison between the two distributions:
  • Center: The graph reveals that the age distribution of the males is higher than the females' age distribution. This is supported by the numerical measures. The median age for females (35) is lower than for the males (42.5). Actually, it should be noted that even the third quartile of the females' distribution (41.5) is lower than the median age for males. We therefore conclude that in general, actresses win the Best Actress Oscar at a younger age than the actors do.
  • Spread: Judging by the range of the data, there is much more variability in the females' distribution (range=59) than there is in the males' distribution (range=35). On the other hand, if we look at the IQR, which measures the variability only among the middle 50% of the distribution, we see more spread among males (IQR=13) than the females (IQR=9.5). We conclude that among all the winners, the actors' ages are more alike than the actresses' ages. However, the middle 50% of the age distribution of actresses is more homogeneous than the actors' age distribution.
  • Outliers: We see that we have outliers in both distributions. There is only one high outlier in the actors' distribution (76, Henry Fonda, On Golden Pond), compared with three high outliers in the actresses' distribution.

Sunday, May 4, 2008

Learning Statistics using R:Detecting Outliers with IQR

Most of the material here is taken from the course website!
An Outlier is an observation/data-point in set of observations (data set) that is far removed in values from other observations. An outlier is either very large or very small value and as noted earlier affects the mean of the data set. More about Outlier is here.

The (1.5 * IQR) criteria for finding outliers (1.5 times IQR):
An observation is suspected outliers if it is:
  1. Below Q1 - (1.5 * IQR): that is Q1 minus 1.5 times IQR.
  2. Above Q3 + (1.5 * IQR): that is Q2 minus 1.5 times IQR.

The following picture illustrates the 1.5 * IQR rule:
Example: Best Actress Oscar Winners:
We continue with our Best Actress Oscar winner data set. Here we will try to locate names of actress whose age is beyond the 1.5 * IQR range.
# we have data in 'actress' data frame; just check the quantile
> quantile(Age, names=T)   
0%   25%   50%   75%  100% 
21.00 32.50 35.00 41.25 80.00 

# Lets check the IQR 
> IQR(Age)
[1] 8.75

# Now lets see how to retrieve value for Q1 - First quantile
> quantile(Age,0.25) 
25% 32.5

# (Is this correct method?)
# As can be seen above passing a value 0.25 returns value of the first quantile. 
# now lets see what is the value for Q1 - (1.5 * IQR)
>quantile(Age, 0.25) - (IQR(Age) * 1.5) 
  25% 19.375
# Okay so this also works!

# now how do we get names of all actress whose age is less than 19.375?
>which(Age < (quantile(Age, 0.25) - IQR(Age) * 1.5))
integer(0)
# So in our data set there is no actress whose age is less than 19.375 since the smallest is of age 21!

# now lets try to find upper/higher suspected outliers. 
# Remember its Q3 + (IQR * 1.5)
>quantile(Age,0.75) + (IQR(Age) * 1.5) 
  75% 
54.375

# Okay so far so good; lets get age that are greater than 54.375
> Age > quantile(Age,0.75) + (IQR(Age) * 1.5 ) 
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[13] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# See it returns TRUE in the indices where our condition matches.

> which(Age > quantile(Age,0.75) + (IQR(Age) * 1.5 ))
[1] 12 16 20
# which command returns the same but actual index number; but how do we get names?

> actress[which(Age > quantile(Age,0.75) + (IQR(Age) * 1.5 )),]   
    Year      Name                   Movie              Age
12 1981   Kathryn Hepburn    On Golden Pond             74
16 1985   Geraldine Page     A Trip to the Bountiful    61
20 1989   Jessica Tandy      Driving Miss Daisy         80

# so the comma (,) at the end does the trick. Lets try to do without "which" command.


>actress[ Age > (quantile(Age,.75, names=T) + (IQR(Age) * 1.5)),]  
    Year      Name                   Movie              Age
12 1981   Kathryn Hepburn    On Golden Pond             74
16 1985   Geraldine Page     A Trip to the Bountiful    61
20 1989   Jessica Tandy      Driving Miss Daisy         80

# so finally we got suspected outliers!
Other methods:
We can draw boxplot to visually detect outliers. (But following does not seem that helpful?)

> boxplot(Age)
# or
> plot(lm(Age ~ 1))
# or using library car
> library("car")
> outlier.test(lm(Age ~ 1))

max|rstudent| = 3.943262, degrees of freedom = 30,
unadjusted p = 0.0004461754, Bonferroni p = 0.01427761

Observation: 20

Friday, May 2, 2008

Learning Statistics using R: IQR

I am including the notes from the course itself. None of material is mine other than errors!
As seen earlier range gives us over all range of the distribution while IQR measures the spread of distribution by giving us the range covered by the middle 50% of the data.
The picture taken from course website makes it more clear.
Here is how course suggest on finding IQR:
  1. First sort the data so that we can easily find the median. As we know that median divides the dataset in such a way that 50% of data points are below the median and 50% of data points are above the median. That is data set would divide in two equal halves; lower and top halves - first half containing min to median and another from median to max.
  2. Find the median of the lower 50% of the data or the first half. This is called the first quartile of the distribution and is denoted by Q1. Q1 or median of the fist half is called the first quartile since one quarter of the data points fall below it.
  3. Repeat this again for the top 50% of the data. Find the median of the top 50% of the data. This is called the third quartile of the distribution and is denoted by Q3. Q3 is called the third quartile since three quarters of the data points fall below it.
  4. The middle 50% of data between Q1 and Q3 is IQR and calculated by following: IQR = Q3 - Q1.

Here is another picture taken from course website that visually explains how first quartile and Q1 is found:

Few very important observation that course makes as following:
  • From the first picture we can see that Q1, M, and Q3 divide the data into four quarters with 25% of the data points in each, where the median is essentially the second quartile. The use of IQR=Q3-Q1 as a measure of spread is therefore particularly appropriate when the median M is used as a measure of center.
  • We can define a bit more precisely what is considered the bottom or top 50% of the data. The bottom (top) 50% of the data is all the observations whose position in the ordered list is to the LEFT (RIGHT) of the location of the overall median M. The following picture will visually illustrate this for the simple cases of n=7 and n=8.
Note that when n is odd (like in n=7 above), the median is not included in either the bottom or top half of the data; when n is even (like in n=8 above), the data are naturally divided into two halves.

Example: Best Actress Oscar Winners: Course uses stemplot for finding IQR; we use simple R command to find IQR. Please note that IQR found using R is different from the course example.
# we have data in 'actress' data frame.
> quantile(Age, names=T)
   0%   25%    50%   75%    100%
21.00  32.50  35.00  41.25  80.00 
> IQR(Age)
[1] 8.75

As can be seen in above code, 'quantile' R command outputs following for quartile:
  • Q1 32.50 (shown as 25%; course calculated value is 32)
  • Q3 41.25 (shown as 75%; course calculated value is 41.5)
Simple IQR R function calculates 8.75 as IQR; while course calculated value is 9.75.

Tuesday, April 29, 2008

Learning Statistics using R: Measure of Spread

Most of the material here is taken from the course website!
To describe a distribution along with measure of center we also need to know spread also known as variability of distribution. As course describes there are 3 commonly used measures of spread/variability, each describing spread differently:
  • Range
  • Inter-quartile range (IQR)
  • Standard deviation

  • Range:
  • Range is simplest measure of spread and is exactly the distance (difference) between smallest data point (min) and maximum data point. We try to find Range of our Best Actress Dataset:
    actress <- read.csv("actress.csv", sep=",", header=T)
    > attach(actress)
    > summary(Age)   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.   
    21.00   32.50   35.00   38.53   41.25   80.00
    > range(Age)
    [1] 21 80
    > diff(range(Age))
    [1] 59
    
    Yes, summary command gives us all the details but we try to learn few more R commands. As can be seen in above example "range" function gives the minimum and maximum value for the "Age" distribution. If we subtract min from max we get number of years covered as shown by "diff" command.
    80 (max) - 21 (min) = 59 (Range)

Monday, April 28, 2008

Learning Statistics using R: Comparing Mean and Median

Mean and Median are measure of center, each describing center in a different way. Mean, is average value of all observations and due to this actual values of observations makes a difference to its value, while Median is a middle value in an ordered data set.

Lets understand this with few simple examples:
  • Assume we have a dataset with these three values: 1, 2, 5. We can see the median being 2 and mean as (1+2+5) = 8 / 3 = 2.67
  • If we just change the last observation value from 5 to 50 then median is still 2 but mean changes to 17.67.


As course brings out the main point that is "The mean is very sensitive to outliers (as it factors in their magnitude), while the median is resistant to outliers."

So as course explains:
  • For symmetric distributions with no outliers: X is approximately equal to M.
  • For skewed right distributions and/or datasets with high outliers: X > M.
  • For skewed left distributions and/or datasets with low outliers: X < M.
Hence mean is used for symmetric distribution with no outliers while median is used in other case for measure of center.

Sunday, April 27, 2008

Learning Statistics using R: Numerical Measure

    Numerical Measures: We proceed to numerical measures of the distribution of a quantitative variable. From this point onward I am including the notes from the course itself. None of material is mine other than errors! The distribution of a quantitative variable is described by its shape, center, and spread. With histogram we can describe the shape of the distribution, but we can only get a rough estimate for the center and spread. So along with graphical display we need a more precise numerical description of the center and spread of the distribution.

    In this section we will learn:
    • how to quantify the center and spread of a distribution with various numerical measures.
    • some of the properties of these numerical measures, and
    • how to choose the appropriate numerical measures of center and spread to supplement the histogram.


    1. Measure of Center: The two main numerical measures for the center of a distribution are the mean and the median. Each one of these measures is based on a completely different idea of describing the center of a distribution. We will first present each one of the measures, and then compare their properties.

      1. Measure of Center: The mean is the sum of the observations divided by the number of observations. If the n observations are X1, X2, ... Xn, their mean, which we denote by X (and read X-bar), is therefore: = X = (X1+X2+..+Xn)/n.

        Example: Best Actress Oscar Winners: We continue with our Best Actress Oscar Winners dataset.
        # read the actress.csv file in an actress data frame. [Bug-Fix: Gabriele Righetti]
        >actress <- read.csv ("actress.csv", header=T, sep=",")
        # with following command we do not have to keep writing actor$Age to refer to Age column of actor.
        >attach(actress)
        # a single command summary can give us all details, but just to learn few more R commands.
        >mean(Age)[1] 38.53125
        
        As it can be seen from above example, "mean" is an R command that gives average of distribution (measure of center).

      2. Median: The median M is the midpoint of the distribution. It is the number such that half of the observations fall above and half fall below. To find the median:
        • Order the data from smallest to largest.
        • Consider whether n, the number of observations, is even or odd.
          • If n is odd, the median M is the center observation in the ordered list. This observation is the one "sitting" in the (n+1)/2 spot in the ordered list.
          • If n is even, the median M is the mean of the two center observations in the ordered list. These two observations are the ones "sitting" in the n/2 and n/2 + 1 spots in the ordered list.

          Finding median using Best Actress data set:
          # we already have data read in the actress data frame.
          > attach(actress)
          > median(Age)
          [1] 35
          
          As seen in above code we can use "median" command of R to find the median value of the distribution.

          Example: Finding median. Here are the numbers of hours that 9 students spend on the computer on a typical day:
          1, 6, 7, 5, 5, 8, 11, 12, 15
          # store numbers of hours spent in a hours vector.
          hours<-c(1 , 6 , 7 , 5 , 5 , 8 , 11 , 12 , 15)
          > median(hours)
          [1] 7
          > mean(hours)
          [1] 7.777778
          # as we have total 9 observations, (n+1)/2th observation (in sorted data), i.e. 5th.
          

Wednesday, April 23, 2008

Learning Statistics using R : One Quantitative Variable:

The course in this section teaches how to explore data collected from a quantitative variable, and summarize important features of its distribution. This section starts with graphs and then moves on to numerical measures of the distribution and 5 number summary. As course suggest to display data from one quantitative variable graphically, we can use either the histogram, stem-plot, or box-plot. We will try to do course examples using R.

  1. Histogram: For histogram we break range of values into intervals and count how many observations fall into each interval. The first example illustrates histogram with the help of exam grades of 15 students. In example bins/intervals are created using 10 points wide length, including first value while excluding last value.

    Here are details:

    Exam grades of 15 students: 88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73. Bins, intervals that are chosen along with the count of observations:

    ScoreCount
    [40-50) 1
    [50-60) 2
    [60-70) 4
    [70-80) 5
    [80-90) 2
    [90-100] 1
    As shown above intervals are closed at one end while open on other end. Lets try to plot histogram-using R.
    >grades<-c(88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73)
    # Above command we uses R's 'c'  (combine) function to create a vector of grades.
    >hist(grades, right=FALSE)
    # Just a simple command draws the following histogram.
    

    Note if in the above "hist" command if you do not use "right=FALSE" argument then the observation with value 60 will be counted in the [50-60] interval. The R help page says, "right" argument is logical; if TRUE, the histograms cells are right-closed (left open) intervals. In our command we used "right=FALSE" that means the intervals are of the form [a, b); which what we wanted.

  2. More on histogram: The Center of distribution is its midpoint - the value that divides the distribution so that approximately half the observations take smaller values, and approximately half observations take larger values.
    # we use our above "grades"  vector.
    >summary(grades)   
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    
    48.0    61.5    71.0    70.2    77.0    97.0
    
    As you can see in above example "summary" command displays various important numbers. The out put displays center of the distribution along with min and max and other values.

  3. Best Actor Oscar Winners: Applying histogram to actual data. The actual data set is here. As usual we save the excel file in a csv (comma separated) file. In following text we show how to do interesting things with few R commands.
    # read the actor.csv file in an actor data frame.
    >actor <- read.csv ("actor.csv", header=T, sep=",")
    # we run summary command, just to see five-point summary.
    >summary(actor)
          Age        
    Min.    :31.00   
    1st Qu. :37.75   
    Median  :42.50   
    Mean    :44.72   
    3rd Qu. :48.75   
    Max.    :76.00 
    # with following command we do not have to keep writing actor$Age to refer to Age column of actor.
    >attach(actor)
    # as we have only one column, summary(Age) would also give same output as above.
    
    As explained in here, our minimum data point is 31, and our max is 76. We'll use a bin width of 5, and make bins from 30 to 80. So we have total 10 bins.
    > hist(Age, breaks=10, xlim=c(30,80))
    
    In above command we directly refer to "Age" column, the "breaks" parameters can either specify number of bins or actual breaks. Also "xlim" parameter is used to define the range of x values with sensible defaults. Also help page mentions that "xlim" is not used to define histogram breaks, but only for plotting. Here is what final histogram looks like: Play with "xlim" and "ylim" options and see what happens to the histogram. In our example Frequency count is only up to 8 on the Y-axis adjust it to 9.



  4. Stemplot: The stemplot also called stem and leaf plot is another graphical display of the distribution of quantitative data. As course suggest for drawing stemplot we separate each data point into stem and leaf, where
    • Leaf - right-most digit.
    • Stem - anything but the right-most digit.

    So if the data point is 34 then leaf is 4 and stem is 3 but if data point is 3.41 then leaf is 1 and stem is 3.4

  5. Best Actress Oscar Winners: Stemplot of actual data. We use now Best Actress data set. The actual data set is here. It also included in the document at the end.
    # read the actress.csv file in an actress data frame.
    >actress <- read.csv ("actress.csv", header=T, sep=",")
    # with following command we do not have to keep writing actor$Age to refer to Age column of actor.
    >attach(actress)
    # as we are interested in "Age" column of the data frame, we will work with it only, but check summary
    >stem(Age, scale=2)
    
    As you can see a simple "stem" R command draws the stemplot. The "scale" option/argument to the function expands the scale of the plot. Here is what our stemplot looks like:
    When we use scale=2
    The decimal point is 1 digit(s) to the right of the | 
     2 | 1 
     2 | 56669  
     3 | 013333444  
     3 | 555789
     4 | 11123  
     4 | 599  
     5 |  
     5 |  
     6 | 1 
     6 |  
     7 | 4  
     7 |  
     8 | 0 
    
    When we use scale=1
    The decimal point is 1 digit(s) to the right of the |  
    2 | 156669  
    3 | 013333444555789  
    4 | 11123599  
    5 |   
    6 | 1 
    7 | 4  
    8 | 0
    

    Well, I do not know how to rotate stemplot 90 degree counter clockwise using R, so it visually resembles histogram; if anybody know please let me know.

Tuesday, April 22, 2008

Learning Statistics using R : One Categorical Variable

In the course, second activity teaches how to summaries collected data. Course also teaches how to draw pie charts and other basic charts using Excel. We will do the same using R. The data set used for this activity can be downloaded from here. The data set consists of answers of 1200 US college students. The question that was asked "With whom do you find it easiest to make friends: opposite sex, same sex or no difference?"

In this activity we will use the collected data to:
  1. Learn how to use R to tally our data into a table of counts and percents.
  2. Learn how to use R to produce a pie chart and other charts.
  1. Learn how to use R to tally our data into table of counts and percents. We save "friend.xls" file as a friend.csv file. We can use simple "summary" command of R to get summary of our data set.
    # Bug-Fix: Gabriele Righetti from Italy.
    >friend <- read.csv("c://friends.csv", header=TRUE)
    # just print the content of our data set/frame.
    >print (friend) 
    # note default action is to print.
    # lets run summary command and see what is out put.
    >summary(friend) 
    # note following is the output of the command.          
    Friends    
    No difference:602   
    Opposite sex :434   
    Same sex     :164
    
    From the above out put we can see that data set contains what counts. In our example we can see that 162 students find is easy to make friends with same sex, while 434 are more comfortable with opposite sex when they have to make friends and 602 students find that sex does not make any difference when they make friends. Now lets use few R commands to do more.
    # we have data in friend object/data.frame.
    > table(friend)
    friend
    No difference   Opposite sex      Same sex
      602           434                   164
    
    As in above output R "table" command gives us something similar to the "summary" command but table command is more powerful and used for cross-classifying data. Lets us do pie and bar charts.
    >pie(table(friend))
    >barplot(table(friend))
    
    These two simple commands draws pie char and bar chart. Look at the help pages of each to make them more interesting.

    Following are our pie and barcharts!





Monday, April 21, 2008

Learning Statistics using R : Exploring Dataset

Wikipedia: Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data.

For statistics, there is plenty of material available online but I found Carnegie Mellon University's introduction to statistics easy to understand, so I decided to do course examples in R.

We start with first activity of Exploring a dataset and classifying variables.
  1. Exploring a Dataset:As course explains, Variables can be classified into one of two types: Quantitative or Categorical. The first activity explains how to explore data set and identify variable types using Microsoft Excel, we will try to do same using R.

    Here is the objective of first activity:
    1. Learn how to open and examine a dataset using Excel.
    2. Practice classifying variables by their type: quantitative or categorical.
    3. Learn to change numeric codes to meaningful labels.

    For following examples; most of the stuff is taken from Ajay Shah's R pages. The data set used in this activity can be downloaded from the course site.

    1. Learn how to open and examine a dataset using R: We save data set in a CSV (Comma delimited) file using Excel. Assuming that file name is "depression.csv" and it is stored in "c:\", we can use following code to read it using R.
      # read the csv file,  notice the front slash / or use two back slashes.
      >depress <- read.table ("c:/depression.csv", sep=",", header=TRUE)
      # check if we have correct number of records.
      >nrow(depress)
      # check if we have correct number of columns. 
      >ncol(depress)
      # lets display the internal structure of our data set/data frame.
      >str(depress)
      # display the content of data frame - default action is to print content
      >depress
      You can even read an Excel file directly in R, here is an example and go here for more information and dependency details of reading an excel file.

    2. Practice classifying variables by their type: quantitative or categorical. In our last example "str" command that used for displaying internal structure of R data frame also provides additional information regarding variables. Lets check the output of the "str(depress)" command.
      >str(depress)
      'data.frame':   109 obs. of  7 variables: 
      $ Hospt  : int  1 1 1 1 1 1 1 1 1 1 ... 
      $ Treat  : Factor w/ 3 levels "Imipramine","Lithium",..: 2 1 1 2 2 3 2 3 3 3 ... 
      $ Outcome: Factor w/ 2 levels "No Recurrence",..: 2 1 1 2 1 2 1 2 1 2 ... 
      $ Time   : num   36.1 105.1  74.6  49.7  14.4 ... 
      $ AcuteT : int  211 176 191 206 63 70 55 512 162 306 ... 
      $ Age    : int  33 49 50 29 29 30 56 48 22 61 ... 
      $ Gender : int  1 1 1 2 1 2 1 1 2 2 ...
      
      From the output of the "str" command we can see that "Treat" and "Outcome" are factor variable. R uses Factors to store categorical variables. But we also know from examining data set that "Hospt" and "Gender" are also categorical variable. We can also use following R command to confirm the same.
      # Bug-Fix: Gabriele Righetti 
      is.factor(depress$Treat) 
      [1] TRUE 
    3. Learn to change numeric codes to meaningful labels: Lets try to convert "Gender" variable to a factor. To convert "Gender" variable to factor variable in R we can use the "factor" function. In our data set 1 indicates Female and 2 is used for Male. Lets change those numbers to meaning full categorical variables.
      >depress$Gender <- factor(depress$Gender, labels = c("Female", "Male"))
      >depress$Gender
      
      In above command first label Female corresponds to value 1 and second Male label is assigned to value 2. We do not have to do search and replace, that is done by R for us.