~Krishna Dagli: June 2008

Wednesday, June 18, 2008

Learning Statistics using R: Role-type classification (1 of 2)

Most of the material here is taken from the course website!
The second module of the course explains the relationship between two variables. In earlier sections we learned how to work with a distribution of a single variable, either quantitative or categorical.
This section start with Role-Type classification of two variable:In most studies involving two variables, each of the variables has a role. We distinguish between:

Response variable: the outcome of study.
Explanatory varible: the variable that claims to explain, predict or affect the response.

The response variables are also known as "Dependent" variables and the explanatory variables as "Independent" variables. Dependent variable depend on the Independent variable and hence the name. A simple example would be of a function that computes the sum of passed arguments; in this case arguments (the values whose sum we need to find) are independent variables while output (sum of these values) is dependent variable.
Lets take 8 example from course website to make this clear. We will be using these examples for further variable type classification.

We want to explore whether the outcome of the study - the score on a test - is affected by the test-taker's gender. Therefore:
- Gender is the explanatory variable
- Test score is the response variable
How does the number of calories a hot dog has related to (or effected by) the type of hot dog (beef, meat or poultry)? (in other words, are there differences in the number of calories between the three type of hot dogs?)
- Number of calories is response variable
- Type of hot dog is explanatory variable
In this study we explore whether nearsightedness of a person can be explained by the type of light that person slept with as a baby. Therefore:
- Light Type is the explanatory variable
- Nearsightedness is the response variable
Are the smoking habits of a person (yes/no) related to the person's gender?
- Gender of person (male/female) is explanatory variable
- Smoking habit is response variable
Here we are examining whether a student's SAT score is a good predictor for the student's GPA in freshman year. Therefore:
- SAT score is the explanatory variable
- GPA of Freshman Year is the response variable
In an attempt to improve highway safety for older drivers, a government agency funded a research that explored the relationship between drivers' age and sign legibility distance (the maximum distance at which the driver can read a sign).
- Driver's age is the explanatory variable
- Sign legibility distance is response variable
Here we are examining whether a person's outcome on the driving test (pass/fail) can be explained by the length of time this person has practiced driving prior to the test. Therefore:
- Time is the explanatory variable
- Driving Test Outcome is the response variable
Can you predict a person's favorite type of music (Classical/Rock/Jazz) based on his/her IQ level?
- IQ Level is explanatory variable
- Type of music is response variable

Above examples helps in identifying response and explanatory variable but is it always clear what is the role classification? In other words, is it always clear which of the variables is the explanatory and which is the response?
Answer: NO! There are studies in which the role classification is not really clear. This mainly happens in cases when both variables are categorical or both are quantitative. An example could be a study that explores the relationship between the SAT Math and SAT Verbal scores. In cases like this, any classification choice would be fine (as long as it is consistent throughout the analysis).

We know that a variable is either categorical variable or quantitative variable. We use this information to further classify response and explanatory variables. With this role-type classification we get following 4 possibilities:

Case I: Explanatory is Categorical and Response is Quantitative variable.
Case II: Explanatory is Categorical and Response is Categorical variable.
Case III:Explanatory is Quantitative and Response is Quantitative variable.
Case IV: Explanatory is Quantitative and Response Categorical variable.

Following table taken from course web summarizes above 4 cases:

The couse warns us that this role-type classification serves as the infrastructure for the entire section. In each of the 4 cases, different statistical tools (displays and numerical measures) should be used in order to explore the relationship between the two variables.

Along with this course also suggest us following important rule:
Principle:
When confronted with a research question that involves exploring the relationship between two variables, the first and most crucial step is to determine which of the 4 cases represents the data structure of the problem. In other words, the first step should be classifying the two relevant variables according to their role and type, and only then can we determine the appropriate statistical tools.
Lets go back to our 8 examples and try to classify explanatory and response variables to categorical or quantitative variable.

We want to explore whether the outcome of the study - the score on a test - is affected by the test-taker's gender. Therefore:
- Gender is the explanatory variable and it is categorical variable.
- Test score is the response variable and it is quantitative variable.
- Therefore this is an example of Case I.
How does the number of calories a hot dog has related to (or effected by) the type of hot dog (beef, meat or poultry)? (in other words, are there differences in the number of calories between the three type of hot dogs?)
- Type of hot dog is explanatory variable and it is categorical variable.
- Number of calories is response variable and it is quantitative variable.
- Therefore this is an example of Case I.
In this study we explore whether nearsightedness of a person can be explained by the type of light that person slept with as a baby. Therefore:
- Light Type is the explanatory variable and it is categorical variable.
- Nearsightedness is the response variable and it is categorical variable.
- Therefore this is an example of Case II.
Are the smoking habits of a person (yes/no) related to the person's gender?
- Gender of person (male/female) is explanatory variable and it is categorical variable.
- Smoking habit is response variable and it is categorical variable.
- Therefore this is an example of Case II.
Here we are examining whether a student's SAT score is a good predictor for the student's GPA in freshman year. Therefore:
- SAT score is the explanatory variable and it is quantitative variable.
- GPA of Freshman Year is the response variable and it is quantitiative variable.
- Therefore this is an example of Case III.
In an attempt to improve highway safety for older drivers, a government agency funded a research that explored the relationship between drivers' age and sign legibility distance (the maximum distance at which the driver can read a sign).
- Driver's age is the explanatory variable and it is quantitiave variable.
- Sign legibility distance is response variable and it is quantitative variable.
- Therefore this is an example of Case III.
Here we are examining whether a person's outcome on the driving test (pass/fail) can be explained by the length of time this person has practiced driving prior to the test. Therefore:
- Time is the explanatory variable and it is qunatitative variable.
- Driving Test Outcome is the response variable and it is categorical variable.
- Therefore this is an example of Case IV.
Can you predict a person's favorite type of music (Classical/Rock/Jazz) based on his/her IQ level?
- IQ Level is explanatory variable and it is quantitiave variable.
- Type of music is response variable and it is categorical variable.
- Therefore this is an example of Case IV.

After this we learn more about role-type classification and which tool to use in which cases.

Tuesday, June 17, 2008

Learning Statistics using R: Rule Of Standard Deviation

Most of the material here is taken from the course website!
Following explains the rule of standard deviation, also known as The Empirical Rule. The rule is applied only to normal (symmetric) data distribution.

Approximately 68% of the observations fall within 1 standard deviation of the mean.
Approximately 95% of the observations fall within 2 standard deviations of the mean.
Approximately 99.7% (or virtually all) of the observations fall within 3 standard deviations of the mean.

This rule provides more insights about standard deviation and following picture taken from course web site illustrates the same rule:

Lets understand this with an example: The following data represents height data of 50 males. Lets use R to find 5 number summary of these and to confirm if the distribution is nomal - mound shaped.

# We use 'c' command to populate male vector; on which we will carry our operations.
male <- c(64, 66, 66, 67, 67, 67, 67, 68, 68, 68, 68, 68, 68, 69, 69, 69, 69, 69, 70, 70, 70, 70, 70, 70, 70, 71, 71, 71, 71, 71, 71, 71, 72, 72, 72, 72, 72, 72, 73, 73, 73, 74, 74, 74, 74, 74, 75, 76, 76, 77)
> hist(male)

In above code sample 'hist' command draws a histogram that has almost normal-mound shape. Here is the image that R draws for us.

Lets find five number summary and confirm if standard deviation rule applies correctly to this data set.

# Just a simple summary command gives five point summary
> summary(male)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  64.00   68.25   70.50   70.58   72.00   77.00
> sd(male)
[1] 2.857786
# Lets apply first rule - 68% of data points are within (mean - 1 * SD) and (mean + 1 * SD)
> male >= (mean(male) - (1 * sd(male))) & male <= (mean(male) + (1 * sd(male)))
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[37]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE
# above command gives us the indices of male vector as TRUE where our condition satisfies.
# Lets count how many such obervations are there.
> length(male[male >= (mean(male) - (1 * sd(male))) & male <= (mean(male) + (1 * sd(male)))])
[1] 34
# So out of 50 observation 34 observation are with in mean +/- 1 SD. ie.
> 34/50 * 100
[1] 68
# So as rule suggests, 68% observations are with in mean +/- 1 SD.

# Lets check second rule - 95% of data points are within (mean - 2 * SD) and (mean + 2 * SD)
> length(male[male >= (mean(male) - (2 * sd(male))) & male <= (mean(male) + (2 * sd(male)))])
[1] 48
> 48/50 * 100
[1] 96
# So indeed 95% of data points are within mean +/- 2 SD.

# Lets check third rule - 99.7% of data points are within (mean - 3 * SD) and (mean + 3 * SD)
> length(male[male >= (mean(male) - (3 * sd(male))) & male <= (mean(male) + (3 * sd(male)))])
[1] 50
> 50/50*100
[1] 100
# this shows that 99.7% of data points are with in mean +/- 3 SD.

Following table taken from course website makes this more clear:

Summary:

The standard deviation measures the spread by reporting a typical (average) distance between the data points and their average.
It is appropriate to use the SD as a measure of spread with the mean as the measure of center.
Since the mean and standard deviations are highly influenced by extreme observations, they should be used as numerical descriptions of the center and spread only for distributions that are roughly symmetric, and have no outliers.
For symmetric mound-shaped distributions, the Standard Deviation Rule tells us what percentage of the observations falls within 1, 2, and 3 standard deviations of the mean, and thus provides another way to interpret the standard deviation's value for distributions of this type.

Friday, June 13, 2008

Learning Statistics using R: Standard Deviation and Histogram

Most of the material here is taken from the course website!
In following example we will see how histogram can help us to clarify the concept of Standard Deviation.
Example:
At the end of a statistics course, the 27 students in the class were asked to rate the instructor on a number scale of 1 to 9 (1 being "very poor", and 9 being "best instrctor I've ever had"). The following table provides three hypothetical rating data:

Following are the histogram of data of each class:

What can we say about standard deviation by looking at these histogram and data set?
Lets assume that the mean of all three data set is 5 (which is reasonable clear by looking at histograms) and we know (roughly) that standard deviation is average distance of all data points from their mean.

For class I histogram most of the ratings are at 5 which is also the mean of the data set. So the average distance between mean and data points would be very small (since most of the data points are at mean).
For class II histogram most of the ratings are at far points from mean - 5. In this case most of the data points are at two extrems at 1 and 9. So the average distance between mean and data points would be larger.
For class III histgram data points are evenly distributed around mean. We can safely say that in this case the average distance between mean and data points would be greater than that of class I but smaller than that of class II. ie. in-between class I and class II standard deviation.

Lets check our assumption by loading these data set into R and verifying standard deviation of each. The excel contained data set can be downloaded from here.

> class1 <- c(1,1,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,9,9)
> sd(class1)
[1] 1.568929
> summary(class1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       5       5       5       5       9 

> class2 <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,5,9,9,9,9,9,9,9,9,9,9,9,9,9)
> sd(class2)
[1] 4
> summary(class2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       5       5       9       9 

> class3 <- c(1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9)
> sd(class3)
[1] 2.631174
> summary(class3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       3       5       5       7       9

So we have following standard deviation for our 3 class ratings

Class I : 1.568929
Class II : 4.0
Class III : 2.631174

(Note that excel may vary a bit in results of standard deviation if you are using stdev function.) So calculated standard deviation confirm our assumption that we made by looking at histograms.

Wednesday, June 4, 2008

Learning Statistics using R: Standard Deviation

Most of the material here is taken from the course website!
Earlier we examined measure of spread using Range(max - min) and IQR(the range covered by middle 50% data). We also noted that IQR should be used when median is used as measure of center. Now we move to another measure of spread called standard deviation.

The idea behind standard deviation is to quantify the spread of distribution by measuring how far the observations are from mean of distribution. That is how distant observations are located from the mean of all observations. The standard deviation gives average (or typical distance) between an observation/data point and the mean, X-bar.

Lets understand standard deviation using an example; we calculate standard deviation step by step using R commands. (There is a single R function to do the same!)
Assume we have following data set of 8 observations:
7, 9, 5, 13, 3, 11, 15, 9

Calculate mean:
We use R's 'c' function to combine them in a vector and then use mean function to calculate mean of the data set.

> dataset <- c(7, 9, 5, 13, 3, 11, 15, 9)
> is.vector(dataset)
[1] TRUE
# we have stored our observation in a vector called dataset
> mean(dataset)
[1] 9
# so we have 9 as the mean of our data set; now we need to find
# distance of each observation from this value : 9.
> deviation <- c (dataset - 9)
> deviation
[1] -2  0 -4  4 -6  2  6  0
# above command directly gives us deviation of each observation from 
# the mean; that we have stored in another vector called deviation.

Thinking about the idea behind the standard deviation being an average (typical) distance between the data points and their mean, it will make sense to average the deviations we got. Note, however, that the sum of the deviations from the mean is always 0.

Square of deviation:
So we square each of the deviation and then take its average; which following R code does:

# we can use either of following two methods to calculate square of
# deviations.
> deviation ^ 2
[1]  4  0 16 16 36  4 36  0

> deviation * deviation
[1]  4  0 16 16 36  4 36  0

Average the square deviations by adding them up, and dividing by n-1, (one less than the sample size): Lets do that in R.
```
> (sum(deviation ^ 2)) / (length(dataset) - 1)
[1] 16
```
This average of the squared deviations is called the variance of the data.
Find standard deviation:
The standard deviation of the data is the square root of the variance. So in our case it would be square root of 16.
```
>sqrt(16)
[1] 4
```
Why do we take the square root? Note that 16 is an average of the squared deviations, and therfore has different units of measurement. In this case 16 is measured in "squared deviation", which obviously cannot be interpreted. We therefore, take the square root in order to compensate for the fact that we squared our deviations, and in order to go back to the original units of measurement.

Properties of the Standard Deviation:

It should be clear from the discussion thus far that the standard deviation should be paired as a measure of spread with the mean as a measure of center.
Note that the only way, mathematically, in which the standard deviation = 0, is when all the observations have the same value. Indeed in this case not only the standard deviation is 0, but also the range and the IQR are 0.
Like the mean, the SD is strongly influenced by outliers in the data. Consider our last example: 3, 5, 7, 9, 9, 11, 13, 15 (data ordered). If the largest observation was wrongly recorded as 150, then: the average would jump up to 25.9, and the standard deviation jumps up to 50.3 Note that in this simple example it is easy to see that while the standard is strongly influenced by outliers, the IQR is not! In both cases, the IQR will be the same since, like the median, the calculation of the quartiles depends only on the order of the data rather than the actual values.

Choosing Numerical Summaries

Use mean and the standard deviation as measures of center and spread only for reasonably symmetric distributions with no outliers.
Use the five-number summary (which gives the median, IQR and range) for all other cases.

R function for Standard Deviation: There is a single R function "sd" that calculates Standard Deviation of dataset, just be careful to use "na.rm=TRUE" argument if you have NA values in your dataset. This function would return vector of SD of columns if dataset is dataframe or matrix. Remember its column's SD and not rows by default.

~Krishna Dagli