Thursday, July 17, 2008

Learning Statistics Using R: Role Type Classification : Case II (1 of 2)

Case II of our role type classification includes study of relationship between a Categorical Explanatory and a Categorical Response variable.

We start with an example from the course web site to explore relationship between two categorical variables.
Example: In a survey, 1200 U.S. college students were asked about their body-image, underweight, overweight, or about right. We have to find answer to following questions:
If we had separated our sample of 1200 U.S. college students by gender and looked at males and females separately, would we have found a similar distribution across body-image categories?
More specifically,are men and women just as likely to think their weight is about right? Among those students who do not think their weight is about right, is there a difference between the genders in feelings about body-image?

So for answering these questions requires us to study the relationship between two categorical variables. Both response and explanatory variables are categorical since we want to find how gender (male/female) affects body image (underweight, overweight, right weight). Here in this study we have following:
  • Gender: (Male/Female) as explanatory variable and it is a categorical variable.
  • Body-image:(underweight, overweight, right weight) as response variable and it is a categorical variable.

As I could not find raw data for these example; we will directly use results derived at the course site instead of reading raw data in R and finding results.

To understand how body image is related to gender, we need an informative display that summarizes the data. In order to summarize the relationship between two categorical variables, we create a display called a two-way table.

Here is the two-way table for our example:

So our two-way table summarizes data of all 1200 students by gender and their body image as counts. The "Total" row or column is a summary of one of the two categorical variables, ignoring the other. In our example:
  • The Total row gives the summary of the categorical variable Body-image:
  • The Total column gives the summary of the categorical variable Gender:

Remember, though, that our primary goal is to explore how body image is related to gender. Exploring the relationship between two categorical variables (in this case Body-image and Gender) amounts to comparing the distributions of the response (in this case Body-image) across the different values of the explanatory (in this case males and females):
Note that it does not make sense to compare raw counts, because there are more females than males overall. So for example, it is not very informative to say "there are 560 females who responded 'About Right' compared to only 295 males," since the 560 females are out of a total of 760, and the 295 males are only out of a total of 440). We need to supplement our display, the two-way table, with some numerical summaries that will allow us to compare the distributions. These numerical summaries are found by simply converting the counts to percents within (or restricted to) each value of the explanatory variable separately! In our example: We look at each gender separately, and convert the counts to percents within that gender. Let's start with females:

Note that each count is converted to percents by dividing by the total number of females, 760. These numerical summaries are called conditional percents, since we find them by conditioning on one of the genders

Comments
  1. In our example, we chose to organize the data with the explanatory variable Gender in rows and the response variable Body-image in columns, and thus our conditional percents were row percents, calculated within each row separately. Similarly, if the explanatory variable happens to sit in columns and the response variable in rows, our conditional percents will be column percents, calculated within each column separately.
  2. Another way to visualize the conditional percents, instead of a table, is the double bar chart. This display is quite common in newspapers.

After looking at the numerical summary and graph lets try to put the results in words:
  • The results suggest that propotion of males who are happy with their body image 'About right' is slightly less than among female student. That is 73.3 % of female students are happy with their body image compared to only 67 % of males.
  • Female students who are not happy with their body image often feel they are overweight. That is 73.3 % are happy but remaining 21.4 % feel they are overweight compared to only 4.9 % feeling underweight.
  • Male students who are not happy with their body image feel they are overweight about often as they feel they are underweight. That is 16.6 % student feel they are overweight while rougly same 16.2 % student feel they are underweight.

Tuesday, July 15, 2008

Learning Statistics Using R: Role Type Classification : Case I (1 of 1)

As we show earlier, Case I of our role type classification includes study of a relationship between a Categorical Explanatory and a Quantitative Response variable.

We start with an example from the course website.
Example:People who are concerned about their health may prefer hot dogs that are low in calories. A study was conducted by a concerned health group in which 54 major hot dog brands were examined, and their calorie contents recorded. In addition, each brand was classified by type: beef, poultry, and meat (mostly pork and beef, but up to 15% poultry meat).

The purpose of the study was to examine whether the number of calories a hot dog has is related to (or affected by) its type. (Reference: Moore, David S., and George P. McCabe (1989). Introduction to the Practice of Statistics. Original source: Consumer Reports, June 1986, pp. 366-367.) Answering this question requires us to examine the relationship between the categorical variable Type and the quantitative variable Calories. Because the question of interest is whether the type of hot dog affects calorie content,
  • the explanatory variable is Type, and
  • the response variable is Calories.
To explore how the number of calories is related to the type of hot dog, we need an informative visual display of the data that will compare the three types of hot dogs with respect to their calorie content. The visual display that we'll use is side-by-side boxplots. The side-by-side boxplots will allow us to compare the distribution of calorie counts within each category of the explanatory variable, hot dog type. We use R to do side-by-side box plot:
> hotdogdata <- read.csv ("TA01_009.TXT", header=T, sep="\t")
> attach(hotdogdata)
> boxplot(Calories[HotDog == 'Beef'], Calories[HotDog == 'Meat'], Calories[HotDog == 'Poul'], 
  border=c('blue', 'magenta','red'), 
  names=c('Beef','Meat', 'Poul'), 
  ylab='Calories', 
  main='Side By Side Comparative Boxplot of Calories')

Gabriele Righetti from Italy pointed out that it should be "names" and not "xlab" for naming boxplots. Thanks Gabriele.

The output of the above command displays a boxplot like following:
Lets also find five number summary for each type using R. Here is the output of the same:
> summary(Calories[HotDog=='Beef'])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  111.0   140.5   152.5   156.8   177.2   190.0 
> summary(Calories[HotDog=='Meat'])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  107.0   139.0   153.0   158.7   179.0   195.0 
> summary(Calories[HotDog=='Poul'])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   86.0   102.0   129.0   122.5   143.0   170.0 
Let's summarize the results we got and interpret them in the context of the question we posed: By examining the three side-by-side boxplots and the numerical summaries, we see at once that poultry hotdogs as a group contain fewer calories than beef or meat.
The median number of calories in poultry hotdogs (129) is less than the median (and even the first quartile) of either of the other two distributions (medians 152.5 and 153).
The spread of the three distributions is about the same, if IQR is considered (all slightly above 40), but the (full) ranges vary slightly more (beef: 79, meat:88, poultry 84). The general recommendation to the health conscious consumer is to eat poultry hotdogs.
It should be noted, though, that since each of the three types of hotdogs shows quite a large spread among brands, simply buying a poultry hotdog does not guarantee a low calorie food. What we learn from this example is that when exploring the relationship between a categorical explanatory variable and a quantitative response (Case I), we essentially compare the distributions of the quantitative response for each category of the explanatory variable using side-by-side boxplots supplemented by descriptive statistics.

So we can safely say, the relationship between a categorical explanatory and a quantitative response variable is summarized using:
  • Data display: side-by-side boxplots
  • Numerical summaries: descriptive statistics
That is we compare the distributions of the quantitative (response) variables for each category (of categorical variable or factors as in R).

Tuesday, July 8, 2008

Learning Statistics using R: Role-type classification (2 of 2)

Lets try few examples from the course web site. In these example we are presented with a brief description of a study involving two variables. We are required to determine which of the four cases represents the data sets of the problem. That is we need to identify if a variable is Categorical or Quantitative and which variable is Response and Explanatory variable.
  1. A store asked 250 of its customers whether they were satisfied with the service or not. The purpose of this study was to examine the relationship between the customer's satisfaction and gender.

    In this example, Gender is explanatory variable and Statifaction based on gender is response variable. Both these variables are Categorical and hence this is an example of Case II.

  2. A study was conducted in order to explore the relationship between the number of beers a person drinks, and his/her Blood Alcohol Level (in %).
    In this example; Both the explanatory (number of beers) and response (BAC) variables are quantitative in this case, and therefore this is an example of case III. Hence this is an example of case I.

  3. A study was conducted in order to determine whether longevity (how long a person lives) is somehow related to the person's handedness (right-handed/left-handed).

    In this case the explanatory variable (handedness) is categorical and the response variables (longevity) is quantitative. This is, therefore an example of case I.

Learning Statistics using R: Data sets used in examples

The actual data is available from the course website but lately my links to actual data set is not working; perhaps those are changed? I am also not sure on how to upload these data sets on blogger. So I have uploaded a zip file containing all data sets at MegaShare. Here is the link to download file : http://www.MegaShare.com/497357 (Updated : 01-OCT-2008, size: ~5.2 K)