Tuesday, July 15, 2008

Learning Statistics Using R: Role Type Classification : Case I (1 of 1)

As we show earlier, Case I of our role type classification includes study of a relationship between a Categorical Explanatory and a Quantitative Response variable.

We start with an example from the course website.
Example:People who are concerned about their health may prefer hot dogs that are low in calories. A study was conducted by a concerned health group in which 54 major hot dog brands were examined, and their calorie contents recorded. In addition, each brand was classified by type: beef, poultry, and meat (mostly pork and beef, but up to 15% poultry meat).

The purpose of the study was to examine whether the number of calories a hot dog has is related to (or affected by) its type. (Reference: Moore, David S., and George P. McCabe (1989). Introduction to the Practice of Statistics. Original source: Consumer Reports, June 1986, pp. 366-367.) Answering this question requires us to examine the relationship between the categorical variable Type and the quantitative variable Calories. Because the question of interest is whether the type of hot dog affects calorie content,
  • the explanatory variable is Type, and
  • the response variable is Calories.
To explore how the number of calories is related to the type of hot dog, we need an informative visual display of the data that will compare the three types of hot dogs with respect to their calorie content. The visual display that we'll use is side-by-side boxplots. The side-by-side boxplots will allow us to compare the distribution of calorie counts within each category of the explanatory variable, hot dog type. We use R to do side-by-side box plot:
> hotdogdata <- read.csv ("TA01_009.TXT", header=T, sep="\t")
> attach(hotdogdata)
> boxplot(Calories[HotDog == 'Beef'], Calories[HotDog == 'Meat'], Calories[HotDog == 'Poul'], 
  border=c('blue', 'magenta','red'), 
  names=c('Beef','Meat', 'Poul'), 
  ylab='Calories', 
  main='Side By Side Comparative Boxplot of Calories')

Gabriele Righetti from Italy pointed out that it should be "names" and not "xlab" for naming boxplots. Thanks Gabriele.

The output of the above command displays a boxplot like following:
Lets also find five number summary for each type using R. Here is the output of the same:
> summary(Calories[HotDog=='Beef'])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  111.0   140.5   152.5   156.8   177.2   190.0 
> summary(Calories[HotDog=='Meat'])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  107.0   139.0   153.0   158.7   179.0   195.0 
> summary(Calories[HotDog=='Poul'])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   86.0   102.0   129.0   122.5   143.0   170.0 
Let's summarize the results we got and interpret them in the context of the question we posed: By examining the three side-by-side boxplots and the numerical summaries, we see at once that poultry hotdogs as a group contain fewer calories than beef or meat.
The median number of calories in poultry hotdogs (129) is less than the median (and even the first quartile) of either of the other two distributions (medians 152.5 and 153).
The spread of the three distributions is about the same, if IQR is considered (all slightly above 40), but the (full) ranges vary slightly more (beef: 79, meat:88, poultry 84). The general recommendation to the health conscious consumer is to eat poultry hotdogs.
It should be noted, though, that since each of the three types of hotdogs shows quite a large spread among brands, simply buying a poultry hotdog does not guarantee a low calorie food. What we learn from this example is that when exploring the relationship between a categorical explanatory variable and a quantitative response (Case I), we essentially compare the distributions of the quantitative response for each category of the explanatory variable using side-by-side boxplots supplemented by descriptive statistics.

So we can safely say, the relationship between a categorical explanatory and a quantitative response variable is summarized using:
  • Data display: side-by-side boxplots
  • Numerical summaries: descriptive statistics
That is we compare the distributions of the quantitative (response) variables for each category (of categorical variable or factors as in R).

No comments: